You are on page 1of 49

Knowledge Discovery WS 14/15

Non-Linear Classifiers (ANNs) 5  


Prof. Dr. Rudi Studer, Dr. Achim Rettinger*, Dipl.-Inform. Lei Zhang
{rudi.studer, achim.rettinger, l.zhang}@kit.edu

INSTITUT FÜR ANGEWANDTE INFORMATIK UND FORMALE BESCHREIBUNGSVERFAHREN (AIFB)

KIT – University of the State of Baden-Württemberg and


National Laboratory of the Helmholtz Association www.kit.edu
Knowledge Discovery Lecture WS14/15
22.10.2014 Einführung
Basics, Overview
29.10.2014 Design of KD-experiments
05.11.2014 Linear Classifiers
12.11.2014 Data Warehousing & OLAP
19.11.2014 Non-Linear Classifiers (ANNs) Supervised Techniques,
26.11.2014 Kernels, SVM Vector+Label Representation
03.12.2014 entfällt
10.12.2014 Decision Trees
17.12.2014 IBL & Clustering Unsupervised Techniques
07.01.2015 Relational Learning I
Semi-supervised Techniques,
14.01.2015 Relational Learning II
Relational Representation
21.01.2015 Relational Learning III
28.01.2015 Textmining
04.01.2015 Gastvortrag Meta-Topics
11.02.2015 Crisp, Visualisierung

2 Institut AIFB
Die Datenmatrix für Überwachtes Lernen

Xj j-te Eingangsvariable
X = (X0, . . . , XM 1)T
Vektor von Eingangsvariablen
M Anzahl der Eingangsvariablen
N Anzahl der Datenpunkte
Y Ausgangsvariable
xi = (xi,0, . . . , xi,M 1)T
i-ter Eingangsvektor
xi,j j-te Komponente von xi
yi i-te Zielgröße
di = (xi,0, . . . , xi,M 1, yi)T
i-tes Muster
D = {d1, . . . , dN }
(Trainings-) Datensatz
z Testeingangsvektor
t Unbekannte Testzielgröße zu z
X = (x1, . . . xN )T design matrix

1
3 Institut AIFB
Recap: Linear Models

Vector of inputs: X = (x1 , ...xN )


  Single output: Y
M
X
  Linear model: yˆi = xi,j wi,j + b
j=1

Estimate: w, b

4 Institut AIFB
Recap: Perceptron – Components

Model class

•  Perceptrons = linear discriminant functions =


separating hyperplanes

Learning algorithm

•  Gradient descent / Delta Rule

Optimization criterion

•  Minimize squared error at output layer

5 Institut AIFB
Linear Separability and the XOR problem
Famous perceptron tasks: simulate simple Boolean functions
(i.e. encode corresponding truth tables) by encoding inputs as -1
(for false) and +1 (for true)

??
w1 =1,w2=1,b=-1

AND
AND XOR
XOR

Sad but true: even a simple XOR can not be simulated


adequately by a linear threshold function.

6 Institut AIFB
Linear separability and the Iris Dataset…

7 Institut AIFB
Linear Separability and the XOR problem

  Perceptron is able to simulate many different Boolean


functions AND, OR, NAND, NOR.
  Example: we can choose (among others) w1 =1,w2=1,b=-1 for AND
function.
  NB: AND, OR are special cases of the 'm-of-n functions': at least m
out of n inputs need to be 1.
  However, XOR and other more complicated expressions in
propositional logic can not be expressed by a Perceptron.

  This is a problem of the model class, not of the training


algorithm involved.

Minsky, M. and Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press, Cambridge, MA, USA.

8 Institut AIFB
Basis expansion (Prelude for ANNs & SVMs)
M
X1
  Linear model: f (xi , w) = w0 + wj xi,j
j=1

Idea: Augment inputs with additional function: (x)

  Basis expanded linear model:


M
X1
f (xi , w) = w0 + wj j (xi,j )
j=1
11 Institut AIFB
Example 1: Basis Expansion

  XOR is not linear separable


Add Basis Expansion:
(xi ) = xi,1 , xi,2 , (xi,3 = z3 = xi,1 xi,2 )

  w = (1, 2, 2, 4)
  f (x) = 1 2x1 2x2 + 4x1 x2

12 Institut AIFB
Example 2: Basis Expansion

Polynom: f (x) =x 0.3x 3


  Basis expansion:
z = (1, x, x , x )
2 3

Weights:

w = (0, 1, 0, 0.3)

  Lineare regression + basis expansion = nonlinear

13 Institut AIFB
Popular Basis Expansions

  j (x) = x1 , x2 , ..., xM 1

j (x) = 2
  x1 , x1 x2
p
  j (x) = log(x1 ), x1
  Picewise polynomials, splines, wavelet bases

j (x)
= exp( vkx xm k )

2

MX1
sig(arg) =
(x) = sig( vj xj )

1/(1 + exp( arg))


j=0
14 Institut AIFB
Challenges

How to determine and parametrize the basis expansions?

ANNs have proven to achieve good results with few


basisfunctions

15 Institut AIFB
Chapter 4.2.

Multi-Layer Neural Networks

16 Institut AIFB
Abstract Model of the Neuron

Cell body
Dendrites

Axon

Summation
Activation function

17 Institut AIFB
Feed-Forward Networks

Weighted & adaptive Weighted & adaptive


connections connections

Input layer Hidden layer Output layer


=
Input Vector

18 Institut AIFB
Application: ALVINN [Mitchell 1997]

Figure 1: Neural Network learning to


steer an autonomous vehicle [Mitchell 1997]
19 Institut AIFB
Components of an Artificial Neural Network

  Network topology (= model class!)


  Set of artificial neurons
  Structure of the connections between the neurons ("synapses")

  Actual parameter settings (= model)


  Connectivity between the neurons (exhibitory vs. inhibitory)
  Activation thresholds within the neurons

  Training Algorithm
  "Learning" means determining appropriate connection and
threshold weights

20 Institut AIFB
Multi-Layer Neural Networks

  Basic idea: combine / stack multiple neurons in multiple


layers to achieve more complex prediction models.

  Activation at "hidden layers" corresponds to non-linear


transformation of the input space.

  Example (next slide): two-layer neural network (don't count


input layer) with one "hidden" layer (3 units) and one output
layer (1 unit)

21 Institut AIFB
Expressivity of Multi-Layer ANNs

x1   1  
t=1  
0  
1  
1  
t=2   -­‐2   t=1  
1   1  
0  
1   t=1  
x2  

XOR
input  layer   hidden  layer   output  layer  

  t= threshold (if >=t return 1)


  XOR problem can be solved with one extra layer only

22 Institut AIFB
Interpretation of Multi-Layer ANNs

  Each neuron in the hidden layer computes a single non-


linear feature of the data (which can be adapted itself!)

  See Basis functions

23 Institut AIFB
Sigmoid Activation Function

  Introduces some non-linearity into the network's


calculations as it "squashes" the neuron's activation into
the range [0,1].
  Has an extremely simple derivative function which we need
for computing gradient descent

24 Institut AIFB
Notation

  Network Layers 1…n; layer 0 indicates input vector

wij(l) – connection weights from i-th neuron on layer l-1


to j-th neuron on layer l
  xi(l) – i-th input component to any neuron on layer l
netj(l) – net input to j-th neuron (activation) on layer l with
netj(l) = Sumi wij(l) xi(l)
oj(l) – output of j-th neuron on layer l
oj(l) = Sig(netj(l)) = xj(l+1)

25 Institut AIFB
Example – 2-layer ANN:
final activation at output layer

produces output oi(0)=xi(1)

1 w11(1)‫‏‬
1
2 w11(2)‫‏‬

2 1
...

...

w21 (2)‫‏‬
...
...

...

layer 0 layer 1 layer 2 (output)‫‏‬

produces output oj(1)=xj(2)

NB: To simplify notation we again drop b parameter and include it into the weight
vector: w0 = b. (Imagine this to be a dummy feature that equals 1 for all inputs.)

26 Institut AIFB
Backpropagation – General Considerations

  Recall: To simplify notation we will again drop the bias term b and
include it into the weight vector as: w0 = b.

  Generalization – we will consider the networks with multiple output


units; the corresponding quadratic error over the k outputs becomes:

or for an individual training example d

Backpropagation will work like (incremental) Gradient Descent


  Updates at output layer can be computed directly from quadratic error
  Updates at hidden layer need to take into account, how the units at the hidden layer
influence the eventual quadratic error indirectly

27 Institut AIFB
Backpropagation – Illustration

Data  Stream  (Propaga<on)‫ ‏‬ 

Input   hidden   hidden   Output  


Layer   Layer   Layer   Layer  

Data  
Input  

Error  

Error  Stream  (Backpropaga<on)‫ ‏‬ 

28 Institut AIFB
Backpropagation – General Considerations (cont)

  Like for the individual Perceptron, we can imagine an error surface


over all weights in the network (or subsets thereof)

  For any weight wjk(l) in the network we would like to know it's influence
on the final error E, i.e. the corresponding component of the gradient

  Note that weight wjk(l) influences the rest of the network (and thus the
error) only through the net input netk(l) of the neuron k it connects to.
  We can rewrite the partial derivative, regardless of further details, as:

auxilary variable, eventually leading


to the weight update (see later):

29 Institut AIFB
Computing ± – Case I: Units at the output layer (n)

We thus have:

30 Institut AIFB
Computing ± – Case II: Units at the hidden layer (h)

We thus have:

31 Institut AIFB
Backpropagation – Algorithm

random start values can help against local minima

32 Institut AIFB
Multi-Layer ANNs – Properties

  Universal approximation property of ANNs - Every


continuous function can be approximated arbitrarily closely
by a multi-layer ANN with just one hidden layer (under
further assumptions)
  Non-Linearity - Capable of modeling complex functions
  Robustness - Ignores irrelevant inputs and noise.
  Adaptability – it can adapt the weights based on
environment changes.
  Easy to use – black-box view, can be used with less
knowledge about relationship of the function to be
modeled.

33 Institut AIFB
Review: Multi-Layer ANNs – Components

Model class

•  Non-linear functions through arbitrarily complex feed-forward


network topologies

Learning algorithm

•  Gradient descent / Backpropagation

Optimization criterion

•  Minimize squared error at output layer

34 Institut AIFB
Problems of Multi-Layer ANNs

  ANNs tend to overfit easily


  recall that overfitting becomes worse as complexity / expressivity
increases
  careful evaluation (estimation of generalization capability) is
necessary (see later)

  ANNs tend to run into local minima (in contrast to single-


layer networks, the error surface isn't parabolic)

  Training times tend to be very long for complex networks

35 Institut AIFB
Problems of Multi-Layer ANNs

  Long oscillations in
'narrow valeys' E

E
  Stagnation on flat surfaces

Local minima E

36 Institut AIFB
Problems of Multi-Layer ANNs: Overfitting

Figure 8a: Plots of error E as a function of the number of weight updates, for two
different robot perception tasks (Mitchell 1997)
Figure taken from Mitchell (1997)

37 Institut AIFB
Chapter 3.2.c

Other Network Types

39 Institut AIFB
Other Network Types

  We have only looked at feed-forward network topologies,


where input vectors are 'fed' in the forward direction up:
  single layer (linear patterns) vs multi-layer (nonlinear patterns)
  classification vs regression

  Other network topologies that can solve different tasks:


  Recurrent Networks (output may feed back as input)
  Hopfield and Auto-associative networks:
  input = output
  aim: memorizing patterns
  Competitive Learning (clustering tasks)
Kohonen Networks / Self-Organizing Maps (clustering tasks)
  Deep Learning

40 Institut AIFB
Recurrent Networks

Substantially increased expressivity


vs substantially increased complexity

Figure taken from Mitchell (1997)

41 Institut AIFB
Deep Learning

  Deep Networks broadly divided in different categories


  Sparse Coding

  Autoencoders

  Restricted Boltzmann Machines

  Deep Belief Networks

  Advanced Deep Models


  Deep Boltzmann Machines

42 Institut AIFB
Deep Learning - Motivation

Learning more adaptive, robust and structured representations.


Object Detection
Text and Image Retrieval

Speech Recognition Multi-modal Learning

43 Institut AIFB
Sparse Coding [Olshausen & Field, 1996]

  Used for unsupervised feature learning


  Initially developed to explain the early visual processing in the brain.

Input image patches :


Learn (Dictionary of Bases):

Each Input data vector is represented as the sparse linear


combination of bases.

Where represent “activations” and is sparse.

44 Institut AIFB
Sparse Coding – Learned Bases (Example)
Images Learned Bases “edges”

New
Sample
[0,0………0.8,..,0.3,…………..0.5..] = Coefficients (Feature Representation)

45 Institut AIFB
Sparse Coding - Application

Caltech101 Object Category Dataset

Image Classification (e.g. SVM)

46 Institut AIFB
Spare Coding (Training)

Reconstruction Error Sparsity Penalty

Optimization Approach

1.  Fix bases - solve activations a (Lasso problem)


2.  Fix activations a – solve bases (Convex QP problem)

47 Institut AIFB
Spare Coding (Testing)

  Learn the new sample feature representation using the bases.

Input : (image patch ) + (Dictionary of Bases)


Output : Sparse Representation a of the new sample

Example : For Image patch

[0,0………0.8,..,0.3,…………..0.5..] = Coefficients (Feature Representation)

48 Institut AIFB
Autoencoder [Hinton & Zemel, 1994]

Feature Representation

Top-down, Bottom-up,
Generative, Encoder Decoder Feed-
Feed-back Forward

Input

49 Institut AIFB
Autoencoder - Application

  Functions for Encoder and Decoder.

Input : Image (x)

Encoder : z = where W (Weight Matrix) is Encoder Filter and

Decoder : Dz where D (Weight Matrix) is the Decoder Filter

50 Institut AIFB
Autoencoder (Training)
Given an input x. The reconstruction of encoder and decoder is.

Decoder Encoder
where D represents input and output and K represent hidden layers (K< D)
Optimization : W and D determined by minimizing reconstruction error.

51 Institut AIFB
Knowledge Discovery Lecture WS14/15
22.10.2014 Einführung
Basics, Overview
29.10.2014 Design of KD-experiments
05.11.2014 Linear Classifiers
12.11.2014 Data Warehousing & OLAP
19.11.2014 Non-Linear Classifiers (ANNs) Supervised Techniques,
26.11.2014 Kernels, SVM Vector+Label Representation
03.12.2014 entfällt
10.12.2014 Decision Trees
17.12.2014 IBL & Clustering Unsupervised Techniques
07.01.2015 Relational Learning I
Semi-supervised Techniques,
14.01.2015 Relational Learning II
Relational Representation
21.01.2015 Relational Learning III
28.01.2015 Textmining
04.01.2015 Gastvortrag Meta-Topics
11.02.2015 Crisp, Visualisierung

52 Institut AIFB

You might also like