KD III 3 NonLinearModel 1415

Knowledge Discovery WS 14/15
Non-Linear Classifiers (ANNs) 5

Prof. Dr. Rudi Studer, Dr. Achim Rettinger*, Dipl.-Inform. Lei Zhang
{rudi.studer, achim.rettinger, l.zhang}@kit.edu
INSTITUT FÜR ANGEWANDTE INFORMATIK UND FORMALE BESCHREIBUNGSVERFAHREN (AIFB)
KIT – University of the State of Baden-Württemberg and

National Laboratory of the Helmholtz Association www.kit.edu
Knowledge Discovery Lecture WS14/15
22.10.2014 Einführung
Basics, Overview
29.10.2014 Design of KD-experiments
05.11.2014 Linear Classifiers
12.11.2014 Data Warehousing & OLAP
19.11.2014 Non-Linear Classifiers (ANNs) Supervised Techniques,
26.11.2014 Kernels, SVM Vector+Label Representation
03.12.2014 entfällt
10.12.2014 Decision Trees
17.12.2014 IBL & Clustering Unsupervised Techniques
07.01.2015 Relational Learning I
Semi-supervised Techniques,
14.01.2015 Relational Learning II
Relational Representation
21.01.2015 Relational Learning III
28.01.2015 Textmining
04.01.2015 Gastvortrag Meta-Topics
11.02.2015 Crisp, Visualisierung
2 Institut AIFB
Die Datenmatrix für Überwachtes Lernen
Xj j-te Eingangsvariable
X = (X0, . . . , XM 1)T
Vektor von Eingangsvariablen
M Anzahl der Eingangsvariablen
N Anzahl der Datenpunkte
Y Ausgangsvariable
xi = (xi,0, . . . , xi,M 1)T
i-ter Eingangsvektor
xi,j j-te Komponente von xi
yi i-te Zielgröße
di = (xi,0, . . . , xi,M 1, yi)T
i-tes Muster
D = {d1, . . . , dN }
(Trainings-) Datensatz
z Testeingangsvektor
t Unbekannte Testzielgröße zu z
X = (x1, . . . xN )T design matrix
1
3 Institut AIFB
Recap: Linear Models
Vector of inputs: X = (x1 , ...xN )

  Single output: Y
M
X
  Linear model: yˆi = xi,j wi,j + b
j=1
Estimate: w, b
4 Institut AIFB
Recap: Perceptron – Components
Model class
•  Perceptrons = linear discriminant functions =

separating hyperplanes
Learning algorithm
•  Gradient descent / Delta Rule
Optimization criterion
•  Minimize squared error at output layer
5 Institut AIFB
Linear Separability and the XOR problem
Famous perceptron tasks: simulate simple Boolean functions
(i.e. encode corresponding truth tables) by encoding inputs as -1
(for false) and +1 (for true)
??
w1 =1,w2=1,b=-1
AND
AND XOR
XOR
Sad but true: even a simple XOR can not be simulated

adequately by a linear threshold function.
6 Institut AIFB
Linear separability and the Iris Dataset…
7 Institut AIFB
Linear Separability and the XOR problem
  Perceptron is able to simulate many different Boolean

functions AND, OR, NAND, NOR.
  Example: we can choose (among others) w1 =1,w2=1,b=-1 for AND
function.
  NB: AND, OR are special cases of the 'm-of-n functions': at least m
out of n inputs need to be 1.
  However, XOR and other more complicated expressions in
propositional logic can not be expressed by a Perceptron.
  This is a problem of the model class, not of the training

algorithm involved.
Minsky, M. and Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press, Cambridge, MA, USA.
8 Institut AIFB
Basis expansion (Prelude for ANNs & SVMs)
M
X1
  Linear model: f (xi , w) = w0 + wj xi,j
j=1
Idea: Augment inputs with additional function: (x)
  Basis expanded linear model:

M
X1
f (xi , w) = w0 + wj j (xi,j )
j=1
11 Institut AIFB
Example 1: Basis Expansion
  XOR is not linear separable

Add Basis Expansion:
(xi ) = xi,1 , xi,2 , (xi,3 = z3 = xi,1 xi,2 )
  w = (1, 2, 2, 4)
  f (x) = 1 2x1 2x2 + 4x1 x2
12 Institut AIFB
Example 2: Basis Expansion
Polynom: f (x) =x 0.3x 3

 
  Basis expansion:
z = (1, x, x , x )
2 3
Weights:
w = (0, 1, 0, 0.3)
  Lineare regression + basis expansion = nonlinear
13 Institut AIFB
Popular Basis Expansions
  j (x) = x1 , x2 , ..., xM 1
j (x) = 2
  x1 , x1 x2
p
  j (x) = log(x1 ), x1
  Picewise polynomials, splines, wavelet bases
j (x)
= exp( vkx xm k )
 
2
MX1
sig(arg) =
(x) = sig( vj xj )
 
1/(1 + exp( arg))

j=0
14 Institut AIFB
Challenges
How to determine and parametrize the basis expansions?
ANNs have proven to achieve good results with few

basisfunctions
15 Institut AIFB
Chapter 4.2.
Multi-Layer Neural Networks
16 Institut AIFB
Abstract Model of the Neuron
Cell body
Dendrites
Axon
Summation
Activation function
17 Institut AIFB
Feed-Forward Networks
Weighted & adaptive Weighted & adaptive

connections connections
Input layer Hidden layer Output layer

=
Input Vector
18 Institut AIFB
Application: ALVINN [Mitchell 1997]
Figure 1: Neural Network learning to

steer an autonomous vehicle [Mitchell 1997]
19 Institut AIFB
Components of an Artificial Neural Network
  Network topology (= model class!)

  Set of artificial neurons
  Structure of the connections between the neurons ("synapses")
  Actual parameter settings (= model)

  Connectivity between the neurons (exhibitory vs. inhibitory)
  Activation thresholds within the neurons
  Training Algorithm
  "Learning" means determining appropriate connection and
threshold weights
20 Institut AIFB
Multi-Layer Neural Networks
  Basic idea: combine / stack multiple neurons in multiple

layers to achieve more complex prediction models.
  Activation at "hidden layers" corresponds to non-linear

transformation of the input space.
  Example (next slide): two-layer neural network (don't count

input layer) with one "hidden" layer (3 units) and one output
layer (1 unit)
21 Institut AIFB
Expressivity of Multi-Layer ANNs
x1 1
t=1
0
1
1
t=2 -‐2 t=1
1 1
0
1 t=1
x2
XOR
input layer hidden layer output layer
  t= threshold (if >=t return 1)

  XOR problem can be solved with one extra layer only
22 Institut AIFB
Interpretation of Multi-Layer ANNs
  Each neuron in the hidden layer computes a single non-

linear feature of the data (which can be adapted itself!)
  See Basis functions
23 Institut AIFB
Sigmoid Activation Function
  Introduces some non-linearity into the network's

calculations as it "squashes" the neuron's activation into
the range [0,1].
  Has an extremely simple derivative function which we need
for computing gradient descent
24 Institut AIFB
Notation
  Network Layers 1…n; layer 0 indicates input vector
wij(l) – connection weights from i-th neuron on layer l-1

to j-th neuron on layer l
  xi(l) – i-th input component to any neuron on layer l
netj(l) – net input to j-th neuron (activation) on layer l with
netj(l) = Sumi wij(l) xi(l)
oj(l) – output of j-th neuron on layer l
oj(l) = Sig(netj(l)) = xj(l+1)
25 Institut AIFB
Example – 2-layer ANN:
final activation at output layer
produces output oi(0)=xi(1)
1 w11(1)‫‏‬
1
2 w11(2)‫‏‬
2 1
...
...
w21 (2)‫‏‬
...
...
...
layer 0 layer 1 layer 2 (output)‫‏‬
produces output oj(1)=xj(2)
NB: To simplify notation we again drop b parameter and include it into the weight
vector: w0 = b. (Imagine this to be a dummy feature that equals 1 for all inputs.)
26 Institut AIFB
Backpropagation – General Considerations
  Recall: To simplify notation we will again drop the bias term b and
include it into the weight vector as: w0 = b.
  Generalization – we will consider the networks with multiple output

units; the corresponding quadratic error over the k outputs becomes:
or for an individual training example d
Backpropagation will work like (incremental) Gradient Descent

  Updates at output layer can be computed directly from quadratic error
  Updates at hidden layer need to take into account, how the units at the hidden layer
influence the eventual quadratic error indirectly
27 Institut AIFB
Backpropagation – Illustration
Data Stream (Propaga<on)‫ ‏‬
Input hidden hidden Output

Layer Layer Layer Layer
Data
Input
Error
Error Stream (Backpropaga<on)‫ ‏‬
28 Institut AIFB
Backpropagation – General Considerations (cont)
  Like for the individual Perceptron, we can imagine an error surface

over all weights in the network (or subsets thereof)
  For any weight wjk(l) in the network we would like to know it's influence
on the final error E, i.e. the corresponding component of the gradient
  Note that weight wjk(l) influences the rest of the network (and thus the
error) only through the net input netk(l) of the neuron k it connects to.
  We can rewrite the partial derivative, regardless of further details, as:
auxilary variable, eventually leading

to the weight update (see later):
29 Institut AIFB
Computing ± – Case I: Units at the output layer (n)
We thus have:
30 Institut AIFB
Computing ± – Case II: Units at the hidden layer (h)
We thus have:
31 Institut AIFB
Backpropagation – Algorithm
random start values can help against local minima
32 Institut AIFB
Multi-Layer ANNs – Properties
  Universal approximation property of ANNs - Every

continuous function can be approximated arbitrarily closely
by a multi-layer ANN with just one hidden layer (under
further assumptions)
  Non-Linearity - Capable of modeling complex functions
  Robustness - Ignores irrelevant inputs and noise.
  Adaptability – it can adapt the weights based on
environment changes.
  Easy to use – black-box view, can be used with less
knowledge about relationship of the function to be
modeled.
33 Institut AIFB
Review: Multi-Layer ANNs – Components
Model class
•  Non-linear functions through arbitrarily complex feed-forward

network topologies
Learning algorithm
•  Gradient descent / Backpropagation
Optimization criterion
•  Minimize squared error at output layer
34 Institut AIFB
Problems of Multi-Layer ANNs
  ANNs tend to overfit easily

  recall that overfitting becomes worse as complexity / expressivity
increases
  careful evaluation (estimation of generalization capability) is
necessary (see later)
  ANNs tend to run into local minima (in contrast to single-

layer networks, the error surface isn't parabolic)
  Training times tend to be very long for complex networks
35 Institut AIFB
Problems of Multi-Layer ANNs
  Long oscillations in
'narrow valeys' E
E
  Stagnation on flat surfaces
Local minima E
36 Institut AIFB
Problems of Multi-Layer ANNs: Overfitting
Figure 8a: Plots of error E as a function of the number of weight updates, for two
different robot perception tasks (Mitchell 1997)
Figure taken from Mitchell (1997)
37 Institut AIFB
Chapter 3.2.c
Other Network Types
39 Institut AIFB
Other Network Types
  We have only looked at feed-forward network topologies,

where input vectors are 'fed' in the forward direction up:
  single layer (linear patterns) vs multi-layer (nonlinear patterns)
  classification vs regression
  Other network topologies that can solve different tasks:

  Recurrent Networks (output may feed back as input)
  Hopfield and Auto-associative networks:
  input = output
  aim: memorizing patterns
  Competitive Learning (clustering tasks)
Kohonen Networks / Self-Organizing Maps (clustering tasks)
  Deep Learning
40 Institut AIFB
Recurrent Networks
Substantially increased expressivity

vs substantially increased complexity
Figure taken from Mitchell (1997)
41 Institut AIFB
Deep Learning
  Deep Networks broadly divided in different categories

  Sparse Coding
  Autoencoders
  Restricted Boltzmann Machines
  Deep Belief Networks
  Advanced Deep Models

  Deep Boltzmann Machines
42 Institut AIFB
Deep Learning - Motivation
Learning more adaptive, robust and structured representations.

Object Detection
Text and Image Retrieval
Speech Recognition Multi-modal Learning
43 Institut AIFB
Sparse Coding [Olshausen & Field, 1996]
  Used for unsupervised feature learning

  Initially developed to explain the early visual processing in the brain.
Input image patches :

Learn (Dictionary of Bases):
Each Input data vector is represented as the sparse linear

combination of bases.
Where represent “activations” and is sparse.
44 Institut AIFB
Sparse Coding – Learned Bases (Example)
Images Learned Bases “edges”
New
Sample
[0,0………0.8,..,0.3,…………..0.5..] = Coefficients (Feature Representation)
45 Institut AIFB
Sparse Coding - Application
Caltech101 Object Category Dataset
Image Classification (e.g. SVM)
46 Institut AIFB
Spare Coding (Training)
Reconstruction Error Sparsity Penalty
Optimization Approach
1.  Fix bases - solve activations a (Lasso problem)

2.  Fix activations a – solve bases (Convex QP problem)
47 Institut AIFB
Spare Coding (Testing)
  Learn the new sample feature representation using the bases.
Input : (image patch ) + (Dictionary of Bases)

Output : Sparse Representation a of the new sample
Example : For Image patch
[0,0………0.8,..,0.3,…………..0.5..] = Coefficients (Feature Representation)
48 Institut AIFB
Autoencoder [Hinton & Zemel, 1994]
Feature Representation
Top-down, Bottom-up,
Generative, Encoder Decoder Feed-
Feed-back Forward
Input
49 Institut AIFB
Autoencoder - Application
  Functions for Encoder and Decoder.
Input : Image (x)
Encoder : z = where W (Weight Matrix) is Encoder Filter and
Decoder : Dz where D (Weight Matrix) is the Decoder Filter
50 Institut AIFB
Autoencoder (Training)
Given an input x. The reconstruction of encoder and decoder is.
Decoder Encoder
where D represents input and output and K represent hidden layers (K< D)
Optimization : W and D determined by minimizing reconstruction error.
51 Institut AIFB
Knowledge Discovery Lecture WS14/15
22.10.2014 Einführung
Basics, Overview
29.10.2014 Design of KD-experiments
05.11.2014 Linear Classifiers
12.11.2014 Data Warehousing & OLAP
19.11.2014 Non-Linear Classifiers (ANNs) Supervised Techniques,
26.11.2014 Kernels, SVM Vector+Label Representation
03.12.2014 entfällt
10.12.2014 Decision Trees
17.12.2014 IBL & Clustering Unsupervised Techniques
07.01.2015 Relational Learning I
Semi-supervised Techniques,
14.01.2015 Relational Learning II
Relational Representation
21.01.2015 Relational Learning III
28.01.2015 Textmining
04.01.2015 Gastvortrag Meta-Topics
11.02.2015 Crisp, Visualisierung
52 Institut AIFB

KD III 3 NonLinearModel 1415

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

KD III 3 NonLinearModel 1415

Uploaded by

Copyright:

Available Formats

Knowledge Discovery WS 14/15

Non-Linear Classifiers (ANNs) 5

INSTITUT FÜR ANGEWANDTE INFORMATIK UND FORMALE BESCHREIBUNGSVERFAHREN (AIFB)

KIT – University of the State of Baden-Württemberg and

Vector of inputs: X = (x1 , ...xN )

• Perceptrons = linear discriminant functions =

• Gradient descent / Delta Rule

• Minimize squared error at output layer

Sad but true: even a simple XOR can not be simulated

Perceptron is able to simulate many different Boolean

This is a problem of the model class, not of the training

Idea: Augment inputs with additional function: (x)

Basis expanded linear model:

XOR is not linear separable

Polynom: f (x) =x 0.3x 3

Lineare regression + basis expansion = nonlinear

1/(1 + exp( arg))

How to determine and parametrize the basis expansions?

ANNs have proven to achieve good results with few

Multi-Layer Neural Networks

Weighted & adaptive Weighted & adaptive

Input layer Hidden layer Output layer

Figure 1: Neural Network learning to

Network topology (= model class!)

Actual parameter settings (= model)

Basic idea: combine / stack multiple neurons in multiple

Activation at "hidden layers" corresponds to non-linear

Example (next slide): two-layer neural network (don't count

t= threshold (if >=t return 1)

Each neuron in the hidden layer computes a single non-

See Basis functions

Introduces some non-linearity into the network's

Network Layers 1…n; layer 0 indicates input vector

wij(l) – connection weights from i-th neuron on layer l-1

produces output oi(0)=xi(1)

layer 0 layer 1 layer 2 (output)‫‏‬

produces output oj(1)=xj(2)

Generalization – we will consider the networks with multiple output

or for an individual training example d

Backpropagation will work like (incremental) Gradient Descent

Data Stream (Propaga<on)‫ ‏‬

Input hidden hidden Output

Error Stream (Backpropaga<on)‫ ‏‬

Like for the individual Perceptron, we can imagine an error surface

auxilary variable, eventually leading

random start values can help against local minima

Universal approximation property of ANNs - Every

• Non-linear functions through arbitrarily complex feed-forward

• Gradient descent / Backpropagation

• Minimize squared error at output layer

ANNs tend to overfit easily

ANNs tend to run into local minima (in contrast to single-

Training times tend to be very long for complex networks

Other Network Types

We have only looked at feed-forward network topologies,

Other network topologies that can solve different tasks:

Substantially increased expressivity

Figure taken from Mitchell (1997)

Deep Networks broadly divided in different categories

Restricted Boltzmann Machines

Deep Belief Networks

Advanced Deep Models

Learning more adaptive, robust and structured representations.

•  Perceptrons = linear discriminant functions =

•  Gradient descent / Delta Rule

•  Minimize squared error at output layer

  Perceptron is able to simulate many different Boolean

  This is a problem of the model class, not of the training

  Basis expanded linear model:

  XOR is not linear separable

  Lineare regression + basis expansion = nonlinear

  Network topology (= model class!)

  Actual parameter settings (= model)

  Basic idea: combine / stack multiple neurons in multiple

  Activation at "hidden layers" corresponds to non-linear

  Example (next slide): two-layer neural network (don't count

  t= threshold (if >=t return 1)

  Each neuron in the hidden layer computes a single non-

  See Basis functions

  Introduces some non-linearity into the network's

  Network Layers 1…n; layer 0 indicates input vector

  Generalization – we will consider the networks with multiple output

  Like for the individual Perceptron, we can imagine an error surface

  Universal approximation property of ANNs - Every

•  Non-linear functions through arbitrarily complex feed-forward

•  Gradient descent / Backpropagation

•  Minimize squared error at output layer

  ANNs tend to overfit easily

  ANNs tend to run into local minima (in contrast to single-

  Training times tend to be very long for complex networks

  We have only looked at feed-forward network topologies,

  Other network topologies that can solve different tasks:

  Deep Networks broadly divided in different categories

  Restricted Boltzmann Machines

  Deep Belief Networks

  Advanced Deep Models

  Used for unsupervised feature learning

1.  Fix bases - solve activations a (Lasso problem)

  Learn the new sample feature representation using the bases.

  Functions for Encoder and Decoder.