4 views

Uploaded by Manojlovic Vaso

ANN neural net

- Machine learning and suicidal messages on twitter
- Unsupervised Learning of Tree Alignment Models for Information Extraction
- Traffic Signs
- L. S. Moulin et al- Support Vector and Multilayer Perceptron Neural Networks Applied to Power Systems Transient Stability Analysis with Input Dimensionality Reduction
- histochemical studies on human fetal brain development
- Analysing the Performance of Classifiers for the Detection of Skin Cancer with Dermoscopic Images
- Brain_Facts_BookHighRes-converted.docx
- Image Categorization
- spe14.pdf
- Eye Behavior Spying
- Omar AlZoubi, Irena Koprinska and Rafael A. Calvo- Classification of Brain-Computer Interface Data
- SURVEY ON HEART DISEASE PREDICTION SYSTEM USING DATA MINING TECHNIQUE
- Brain Tumor Classification using Support Vector Machine
- IRJET-Fault detection of imbalanced data using incremental clustering
- Confernce Paper
- multiple kernel learning
- Higher Bayesian Algebra
- Read More
- allwein00a
- intro ml.pdf

You are on page 1of 34

Ronan Collobert

ronan@collobert.com

Introduction:

Introduction:

W1

tanh()

W2

score

How to train them?

Why does it generalize?

What about real-life inputs (other than vectors x)?

Any applications?

Biological Neuron

Excitatory and inhibitory signals are integrated

If stimulus reaches a threshold, the neuron fires along the axon

Neuron as linear threshold units

1 if w x > T

f (x) =

0 otherwise

A unit can perform OR and AND operations

Combine these units to represent any boolean function

How to train them?

5

Perceptron:

Rosenblatt (1957)

wx+b=0

Input: retina x Rn

Associative area: any kind of (fixed) function (x) Rd

Decision function:

1 if w (x) > 0

f (x) =

1 otherwise

t w t (xt)), given (xt, y t) Rd {1, 1}

max(0,

y

t

t

y (xt) if y t w (xt) 0

t+1

t

w

=w +

0

otherwise

Training: minimize

Perceptron:

u wt ||u|| ||wt||

2

||wt||

max

Assuming classes

are separable

=

x

u

u defines maximum

margin separating hyperplane...

x=

x=

u wt = u wt1 + y t u xt

u wt1 + 1

t

When we do a mistake...

2/||u||

||wt1||2 + R2

t R2

We get:

4 R2

t 2

max

7

Adaline:

? Separable case:

does not find a hyperplane equidistant from the two classes

? Non-separable case: does not converge

Adaline (Widrow & Hoff, 1960) minimizes

1X t

(y wt (xt))2

2

t

Delta rule:

wt+1 = wt + (y t wt xt) xt

Perceptron:

Margin

ezard, 1987), (Collobert, 2004)

Poor generalization capabilities in practice

No control on the margin:

max

2

R2

||wT ||

t w t (xt))

max(0,

1

y

t

t

y (xt) if y t w (xt) 1

t+1

t

w

=w +

0

otherwise

Finite number of updates:

2

t 2 ( + R2 )

max

Control on the margin:

1

max

2 + R2

9

Perceptron:

In Practice

Original Perceptron (10/40/60 iter)

6

10

Regularization

In many machine-learning algorithms (including SVMs!)

early stopping (on a validation set) is a good idea

Weight decay

This is the SVM cost!

11

Going Non-Linear:

(1/2)

1 if w (x) > 0

f (x) =

1 otherwise

Non-linearity achieved by hand-crafting a non-linear ()

2

1.5

3

2.5

2

0.5

1.5

()

0.5

1

0.5

0

3

2.5

4

1.5

3

2

1.5

0.5

2

2

0

0

1.5

0.5

0.5

(x21,

1.5

2 x1 x2, x22)

12

Going Non-Linear:

(2/2)

Consider now the update

t

y (xt) if y t w (xt) 0

t+1

t

w

=w +

0

otherwise

Decision function at the tth example can be written as:

X

t

f (x) =

y t (xt) (x)

t updated

2

E.g., for (x) = (x1, x2) = (x1, 2 x1 x2, x22) a possible kernel is

K(x, xt) = (x xt)2

K(, ) is a kernel if g

Z

such that

g(x)2dx <

Z

then

13

Support Vector Machines unify nicely all the previous concepts

? Perceptron + Margin + Regularization

= soft-margin Support Vector Machines

(Cortes & Vapnik, 1995)

? Perceptron + Margin + Regularization + Kernel

= non-linear Support Vector Machines

(Hard-margin SVMs: Boser, Guyon & Vapnik, 1992)

For non-linear kernels, sparsity issues with gradient descent in the primal

efficient algorithms exist in the dual, or consider a budget

see SVM course

14

Going Non-Linear:

Adding Layers

(1/2)

Neocognitron: (Fukushima, 1980)

15

Going Non-Linear:

Adding Layers

(2/2)

Multi-Layer Perceptron

W1

tanh()

W2

score

16

Any function

g : Rd R

can be approximated (on a compact) by a two-layer neural network

W1

tanh()

W2

score

Cybenko used

? The Riesz representation theorem

17

Gradient Descent

(1/4)

we want to minimize

C() =

T

X

c(f (xt), y t)

t=1

C()

? Variants: see your optimization book (Conjugate gradient, BFGS...)

? Slow in practice

Take advantage of redundency: stochastic gradient descent

Pick a random example t

c(f (xt), y t)

18

Gradient Descent:

Learning Rate

(2/4)

Good idea to use a validation set

19

Gradient Descent:

Caveats

(3/4)

x

w1

tanh()

w2

log(1 + ey )

Saddle points, plateaux..

20

Gradient Descent:

(4/4)

? Not too big: tanh() saturates

? Not too small: all units would do the same!

Normalize properly your data (mean/variance)

? Again, you want to be in the right part of the tanh()

Use a second order approach (H is the Hessian)

C()

+ TH()

C( + ) C() +

? Estimated on a training subset

? Be sure it is positive definite!

? Can be backpropagated as the gradient

? Update with

=

2C

k2

k

21

Gradient Backpropagation

(1/2)

However, previous possible references exist,

including (Leibniz, 1675) and (Newton, 1687)

View the network+loss as a stack of layers

x

f1 ()

f2 ()

f3 ()

f4 ()

How to compute fl

w

??

x

w1

1

2 (y

? f1(x) = w1 x

? f2(f1) = 21 (y f1)2

f

f2 f1

=

w1

f1 |{z}

w1

|{z}

chain rule

=yf1 =x

22

Gradient Backpropagation

x

(2/2)

f1 ()

f2 ()

f3 ()

f4 ()

Brutal way:

fl+1 fl

f

fL fL1

=

l

f

f

fl wl

w

L1

L2

? Receive the gradient w.r.t. its own outputs fl

? Computes the gradient w.r.t. its own input fl1 (backward)

? Computes the gradient w.r.t. its own parameters wl (if any)

f fl

f

=

fl1 fl fl1

f

f fl

=

l

fl wl

w

Often, gradients are efficiently computed using outputs of the module

Do a forward before each backward

23

Examples Of Modules

For simplicity, we denote

?

?

?

?

x

z

y

y

target of a loss module

the output of a module fl (x)

the gradient w.r.t. the output of each module

Module

Forward

y=Wx

W T y

y = 21 (x z)2

xz

y = tanh(x)

y (1 y 2)

y = 1/(1 + ex)

y (1 y) y

Linear

MSE Loss

Tanh

Sigmoid

Backward Gradient

y xT

1zx0

24

(1/2)

we want to maximize the (log-)likelihood

log

T

Y

t=1

p(y t|xt) =

T

X

t=1

Interpret scores as conditional probabilities using a softmax:

efy (x)

p(y|x) = P

fi(x)

ie

X

log p(y|x) = fy (x) log

efi(x)

i

25

(2/2)

log p(y = 1|x) = log

ef1(x)

ef1(x) + ef1(x)

ef1(x)

ef1(x) + ef1(x)

= log(1 + ey (f1(x)f1(x)))

= log(1 + ey (f1(x)f1(x)))

Taking z = y (f1(x) f1(x)),

z 7 log(1 + ez ) is a smooth version of SVM cost

5

log(1+exp(z))

|1z|

+

4

3

2

1

0

4

2

z

26

The target variables y R are now continuous

We often consider

y|x

N (f (x), 2)

In this case,

1

log p(y|x) = 2 ||y f (x)||2 + cste

2

Not great to classification

27

Unsupervised Training

(1/2)

Deep architectures are hard to train: how to pretrain each layer?

Auto-encoder/bottleneck network: try to reconstruct the input

x

1

W

tanh()

2

W

tanh()

3

Caveats:

? It is a bottleneck mapping...

28

Unsupervised Training

(2/2)

x

1

W

tanh()

2

W

tanh()

3

Possible improvements:

h

iT

2

3

1

? No W layer, W = W

(Bengio et al., 2006)

? Impose sparsity constraints

on the projection (Kavukcuoglu et al., 2008)

29

Specialized Layers:

RBF

RBFW 1

W2

f1,i(x) =

1 ||2

||xW,i

2 2

e

= +

Gradient is zero if W 1 colums are far from training examples

30

Specialized Layers:

x

Conv 1D

1D Convolutions

Subsampl. 1D

tanh()

Conv 1D

tanh()

W2

W W W W

X = (X 1, X 2 )

input (matrix)

X 1 X 2

W X 2 X 3 convolution (local embedding

X 3 X 4

for each input column)

Robustness to time shifts:

Apply sub-sampling (as convolution, but W,i contains single value)

Also called Time Delay Neural Networks (TDNNs)

31

Specialized Layers:

2D Convolutions

W2

W3

W1

32

Specialized Training:

Non-Linear CRF

(1/2)

The network score for class k at the tth frame is f ([x]T1 , k, t, )

Akl transition score to jump from class k to class l

x,1

class

class

class

class

x,2

x,3

x,4

x,5

x,6

1

2

3

4

..

.

Sentence score for a class label path [i]T1

s([x]T1 ,

[i]T1 ,

=

)

T

X

t=1

t1 t

[i]t, t, )

= s([x]T , [y]T , )

logadd s([x]T , [j]T , )

1

1

1

1

[j]T1

33

Specialized Training:

Non-Linear CRF

(2/2)

t(j) =

Termination:

= logAddi T (i)

logadd s([x]T1 , [j]T1 , )

[j]T1

Non-linear CRFs: Graph Transformer Networks (Bottou et al., 1997)

Compared to CRFs, we train features (network parameters and

transitions scores Akl )

Inference: Viterbi algorithm (replace logAdd by max)

34

- Machine learning and suicidal messages on twitterUploaded bywestinhouse12
- Unsupervised Learning of Tree Alignment Models for Information ExtractionUploaded byphilip zigoris
- Traffic SignsUploaded byand_moe
- L. S. Moulin et al- Support Vector and Multilayer Perceptron Neural Networks Applied to Power Systems Transient Stability Analysis with Input Dimensionality ReductionUploaded byAsvcxv
- histochemical studies on human fetal brain developmentUploaded bychronokeysaber
- Analysing the Performance of Classifiers for the Detection of Skin Cancer with Dermoscopic ImagesUploaded byGRD Journals
- Brain_Facts_BookHighRes-converted.docxUploaded byRamakrishna Bodapati
- Image CategorizationUploaded byapi-3717234
- spe14.pdfUploaded byamarolima
- Eye Behavior SpyingUploaded byAndrew Berezovskiy
- Omar AlZoubi, Irena Koprinska and Rafael A. Calvo- Classification of Brain-Computer Interface DataUploaded byAsvcxv
- SURVEY ON HEART DISEASE PREDICTION SYSTEM USING DATA MINING TECHNIQUEUploaded byIJIERT-International Journal of Innovations in Engineering Research and Technology
- Brain Tumor Classification using Support Vector MachineUploaded byAnonymous kw8Yrp0R5r
- IRJET-Fault detection of imbalanced data using incremental clusteringUploaded byIRJET Journal
- Confernce PaperUploaded byvijaykannamalla
- multiple kernel learningUploaded byhkgrao
- Higher Bayesian AlgebraUploaded byLeekan Takha
- Read MoreUploaded byJakizo
- allwein00aUploaded byFacundo Quiroga
- intro ml.pdfUploaded byrv_057
- A Sensitivity Analysis of Convolutional Neural Networks for Sentence ClassificationUploaded byCarlangaslangas
- AuditoryProcessing&Tinnitus 01Uploaded byWesley Jackson
- IJCSE10-02-06-96Uploaded byRadhika Selvamani
- CMU-CS-04-109Uploaded bypostscript
- 10 Science Notes 07 Control and Coordination 1Uploaded byAll Rounder Hindustan
- 5. LOCAL INTENSITY MODEL AN OUTLIER DETECTION FRAMEWORK WITH APPLICATIONS TO WHITE MATTER HYPERINTENSITY SEGMENTATION.pdfUploaded byOrchidaceae Phalaenopsis
- mhealth-detecting-work-dayUploaded byapi-351581233
- psy iiuitUploaded byPraveen Srivastava
- Big-O Cheatsheet - Sheet1Uploaded byCharan Prudhvi
- chapter 2 neuron.pptUploaded byNithin

- How the SEG Operates and Eliminates RadiationUploaded bykaosad
- Youth Violence, Guns, And the Illicit Drug IndustryUploaded byManojlovic Vaso
- 9231315 Guns Used in CrimeUploaded byziehenbenzin
- Red Mud Mineral Wool_greece_PrezentacijaUploaded byManojlovic Vaso
- Mike Gamble Boeing Gravity Research PaperUploaded byNullpunktsenergie
- Quantum Mechanics LectureUploaded byIzemAmazigh
- Prezentacija Freeze LiningUploaded byManojlovic Vaso
- SpaceUploaded byManojlovic Vaso
- Kohonen NetworksUploaded byManojlovic Vaso
- Debug 1214Uploaded byManojlovic Vaso
- Antigravity KineticUploaded byManojlovic Vaso
- Nuclear-Fusion-Half-a-Century-of-Magnetic-Confinement-Fusion-Research.pdfUploaded byManojlovic Vaso
- Sarg Ur Sri HariUploaded byManojlovic Vaso
- neural networks and statistical models.pdfUploaded byManojlovic Vaso
- Prefled Softvera DEJANUploaded byManojlovic Vaso
- Opste Strane 1st MMESEEUploaded byManojlovic Vaso
- Course Content Art castingUploaded byManojlovic Vaso
- The International System of units (SI)Uploaded byvaraprasadpg
- Modelling SoftwaresUploaded byManojlovic Vaso
- Chloride Metallurgy 2011Uploaded byManojlovic Vaso
- 06 Whats New in HSC 6_1Uploaded byManojlovic Vaso
- The Goldschmidt ReactionUploaded byManojlovic Vaso
- The project leader_toolkit.pdfUploaded byManojlovic Vaso
- Pregled - science and Education.pdfUploaded byManojlovic Vaso
- Funding by Foundations and Corporate Sponsors.pdfUploaded byManojlovic Vaso
- EC Quality Frame.pdfUploaded byManojlovic Vaso
- Nanotechnologies - Micola Dzubinsky.pdfUploaded byManojlovic Vaso

- Laundry ServicesUploaded byRaja Msaleem
- Zanussi Induction Hob ManualUploaded byiQuniaccommodation
- fresno portfolioUploaded byapi-349060944
- 218 39 Solutions Instructor Manual Chapter 1 World MechanismsUploaded byaikluss
- Practice CalculationUploaded bynts07
- c-apiUploaded bykmgegmpmq
- Lecture40- Vibration Measurement Techniques- FiltersUploaded byKaran Singh
- Biology Week 2-3Uploaded byDmc Abu Ammouna
- Sofa Catalogue 17.11.2017Uploaded byharshit nagar
- How Can We Make Buildings Resistant to Earthquakes With Earthquake EngineeringUploaded byAnonymous AKLRK6KNiR
- KoasUploaded byhoe_cardenas
- Selecting_Elbows_for_Pneumatic_Conveying_Systems (1).pdfUploaded byAshnek
- VMAX3_h13578-hypermax-os-vmax-new-features.pdfUploaded bymvenkata_san
- Introductory Methods of Numerical Analysis by S.S. SastryUploaded byKanv Garg
- Polar Bears and Penguins (1)Uploaded byPrueba3
- Agreement NewUploaded byKaushal Gupta
- What is WebdynproUploaded byPavan
- Final All in One - PDFUploaded byJoseph Johnston
- Botanicals for Blood Pressure and Cholesterol Dr. Mary BoveUploaded bystevenheise
- CG lab manual for 6th sem cse vtuUploaded byAshutoshKumar
- Six SIGMA CASE STudy.pdfUploaded byAnonymous gQyrTUHX38
- CourteneyUploaded byramlohani
- WindowsITPro_201110Uploaded bybalamurali_a
- A Free and Open Source Geographical Information System Approach to Mapping CrimesUploaded byDanJeromeBarrera
- YCH Cattle Panel StructuresUploaded byTravis Stinnett
- Wireless Sensor Network Testbeds a SurveyUploaded byasutoshpat
- 1. 2017 Successful KGSP Graduate Scolars-AnnouncementUploaded byGilson Júnior
- DLL_3rd Qrtr_Week 9Uploaded byzitadewi435
- KNT86 Kronos and HermesUploaded byKeneth Del Carmen
- TEAM 3 Corporate StoryUploaded byAbhishek Jain