Training Neural

Networks
Robert Turetsky
Columbia University
rjt72@columbia.edu
Systems, Man and Cybernetics Society
IEEE North Jersey Chapter
December 12, 2000
Objective
‡ Introduce fundamental concepts in
Artificial Neural Networks
‡ Discuss methods of training ANNs
‡ Explore some uses of ANNs
‡ Assess the accuracy of artificial neurons
as models for biological neurons
‡ Discuss current views, ideas and
research
Organization
‡ Why Neural Networks?
‡ Single TLUs
‡ Training Neural Nets: Back propagation
‡ Working with Neural Networks
‡ Modeling the neuron
‡ The multi-agent architecture
‡ Directions and destinations
Why Neural Networks?
The ³Von Neumann´ architecture
‡ Memory for
programs and
data
‡ CPU for math
and logic
‡ Control unit to
steer program
flow
Von Neumann vs. ANNs
‡ Follows Rules
‡ Solution can/must
be formally specified
‡ Cannot generalize
‡ Not error tolerant
‡ Learns from data
‡ Rules on data are
not visible
‡ Able to generalize
‡ Copes well with
noise
Von Neumann
Neural Net
Circuits that LEARN
‡ Three types of learning:
± Supervised Learning
± Unsupervised Learning
± Reinforcement Learning
‡ Hebbian networks: reward µgood¶ paths,
punish µbad¶ paths
‡ Train neural net by adjusting weights
‡ PAC (Probably Approximately Correct)
theory: Kerns & Vazirani 1994, Haussler
1990
Supervised Learning Concepts
‡ Training set ±: Input/output pairs
± Supervised learning because we know the
correct action for every input in ±
± We want our Neural Net to act correctly in as
many training vectors as possible
± Choose training set to be a typical set of inputs
± The Neural net will (hopefully) generalize to all
inputs based on training set
‡ Validation Set: Check to see how well our
training can generalize
Neural Net Applications
‡ Miros Corp.: Face recognition
‡ Handwriting Recognition
‡ BrainMaker: Medical Diagnosis
‡ Bushnell: Neural net for combinational
automatic test pattern generation
‡ ALVINN: Knight Rider in real life!
‡ Getting rich: LBS Capital Management
predicts the S&P 500
History of Neural Networks
‡ 1943: McCullough and Pitts - Modeling the
Neuron for Parallel Distributed Processing
‡ 1958: Rosenblatt - Perceptron
‡ 1969: Minsky and Papert publish limits on the
ability of a perceptron to generalize
‡ 1970¶s and 1980¶s: ANN renaissance
‡ 1986: Rumelhart, Hinton + Williams present
backpropagation
‡ 1989: Tsividis: Neural Network on a chip
Threshold Logic Units
The building blocks of
Neural Networks
The TLU at a glance
‡ TLU: Threshold Logic Unit
‡ Loosely based on the firing of biological
neurons
‡ Many inputs, one binary output
‡ Threshold: Biasing function
‡ Squashing function compresses infinite
input into range of 0 - 1
The TLU in Action
Training TLUs: Notation
‡ ¿ = Threshold of TLU
‡ X = Input Vector
‡ W= Weight Vector
‡ s = X · W
ie: if s > ¿, op = 1
if s < ¿, op = 0
‡ d = desired output of TLU
‡ f = output of TLU with current X and W
Augmented Vectors
‡ Motivation: Train threshold ¿ at the
same time as input weights
‡ X W > ¿ is the same as X W- ¿ > 0
‡ Set threshold of TLU = 0
‡ Augment W: W= [w
1
, w
2
, « w
n
, -¿]
‡ Augment X: X = [x
1
, x
2
, .. x
n
, 1]
‡ New TLU equation: X · W > 0
(for augmented X and W)
Gradient Descent Methods
‡ Error Function: How far off are we?
± Example Error function:
‡ s depends on weight values
‡ Gradient Descent: Minimize error by
moving weights along the decreasing
slope of error
‡ The Idea: iterate through the training set
and adjust the weights to minimize the
gradient of the error
¸ )
¿
.
=
Ȅ ȋ
i i
i
f d İ
2
Gradient Descent: The Math
We have s = (d - f)
2
Gradient of s:
Using the chain rule:
Since , we have
Also:
Which finally gives:
¦
|
¦

¸

·
·
·
·
·
·
·
·
÷1 1
,..., ,...,
n i
w w w W
I I I I
W
s
s W ·
·
·
·
·
· I I
' !
·
·
W
s
'
·
·
!
·
·
s W
I I
s
f
f d
s ·
·
!
·
·
) ( 2
I
'
x
x
!
x
x
s
f
f d ) ( 2
I
Gradient Descent: Back to reality
‡ So we have
‡ The problem: xf / xs is not differentiable
‡ Three solutions:
± Ignore It: The Error-Correction Procedure
± Fudge It: Widrow-Hoff
± Approximate it: The Generalized Delta
Procedure
'
x
x
!
x
x
s
f
f d ) ( 2
I
Training a TLU: Example
‡ Train a neural network to match the
following linearly separable training set:
Behind the scenes: Planes
and Hyperplanes
What can a TLU learn?
Linearly Separable Functions
‡ A single TLU can implement any
Linearly separable function
‡ AB¶ is Linearly separable
‡ A ‡ B is not
NEURAL NETWORKS
An Architecture for Learning
Neural Network Fundamentals
‡ Chain multiple TLUs together
‡ Three layers:
± Input Layer
± Hidden Layers
± Output Layer
‡ Two classifications:
± Feed-Forward
± Recurrent
Neural Network Terminology
Training ANNs: Backpropagation
‡ Main Idea: distribute the error function
across the hidden layers, corresponding
to their effect on the output
‡ Works on feed-forward networks
‡ Use sigmoid units to train, and then we
can replace with threshold functions.
Back-Propagation: Birds-eye view
‡ Repeat:
± Choose training pair and copy it to input layer
± Cycle that pattern through the net
± Calculate error derivative between output
activation and target output
± Back propagate the summed product of the
weights and errors in the output layer to
calculate the error on the hidden units
± Update weights according to the error on that
unit
‡ Until error is low or the net settles
Back-Prop: Sharing the Blame
‡ We want to assign
± W
i
j
= weights of i-th sigmoid in j-th layer
± X
j-1
= inputs to our TLU (outputs from
previous layer)
± c
i
j
= learning rate constant of i-th sigmoid in
j-th layer
± H
i
j
= sensitivity of the network output to
changes in the input of our TLU
± Important equation:
1
n
j j
i
j
i
j
i
j
i
X c W W H

j
i
j
i
j
i
s s
f
f d
x
x
!
x
x
!
I
H
2
Back-Prop: Calculating H
i
j
‡ For the output layer: H
i
j
= H
k
± H
i
j
= H
k
= (d-f)xf/ xs
k
± H
i
j
= (d-f)f(1-f) for sigmoid
± Therefore W
k
<- W
k
+ c
k
(d - f) f (1 -f ) X
k-1
‡ For the hidden layers:
± See Nilsson 1998 for calculation
± Recursive Formula: base case H
k
=(d-f)f(1-f)
1
1
1
1
) 1 (

!

§

!
j
il
m
l
j j
i
j
i
j
i
w f f
j
l
H H
Back-Prop: Example
‡ Train a 2-layer Neural net with the
following input:
‡ x
1
0
= 1, x
2
0
= 0, x
3
0
= 1, d = 0
‡ x
1
0
= 0, x
2
0
= 0, x
3
0
= 1, d = 1
‡ x
1
0
= 0, x
2
0
= 1, x
3
0
= 1, d =0
‡ x
1
0
= 1, x
2
0
= 1, x
3
0
= 1, d = 1
Back-Prop: Problems
‡ Learning rate is non-optimal
± One solution: ³Learn´ the learning rate
‡ Network Paralysis: Weights grow so
large that f
i
j
(1-f
i
j
) --> 0, and the net never
learns
‡ Local Extrema: Gradient Descent is a
greedy method
‡ These problems are acceptable in many
cases, even if workarounds can¶t be
found
Back-Prop: Momentum
‡ We want to choose a learning rate that
is as large as possible
± Speed up convergence
± Avoid oscillations
‡ Add momentum term dependent on
past weight change:
) ( ) (
, ,
t w X t w
j i
j j
i j i
E K ! (

Another Method: ALOPEX
‡ Used for visual receptive field mapping
by Tzanakou and Harth,1973
‡ Originally developed for receptive field
mapping in the visual pathway of frogs
‡ The main ideas:
± Use cross-correlation to determine a
direction of movement in gradient field
± Add a random element to avoid local
extrema
WORKING WITH
NEURAL NETS
AI the easy way!
ANN Project Lifecycle
‡ Task identification and design
‡ Feasibility
‡ Data Coding
‡ Network Design
‡ Data Collection
‡ Data Checking
‡ Training and Testing
‡ Error Analysis
‡ Network Analysis
‡ System Implementation
ANN Design Tradeoffs
enera o s A ra e ri e
o o e i
eso ion
e e of a ra
ig
s a ad a e O Defini ion of e ro e s e e - osed
Di ensiona i red ion Da a oding an di ensions
o N er of ne or ni s ig
ess da a needed
arse da a
en dis ri ion
Nois da a o era ed
Da a o e ion ore Da a needed
Dense da a
ne en dis ri ion
No noise o era ed
enera i es e o nseen
da e
Tes ri eria ea es re ired e e of
a ra
ina ra ain ro e in raining o erfi ing
‡ A good design i find a a an e
e een ese o e re es!
ANN Design Balance: Depth
‡ Too few hidden layers will cause errors in accuracy
‡ Too many errors will cause errors in generalization!
0
5
10
15
20
25
30
35
40
2 4 6 8
1
0
Number of hidden units
P
e
r
c
e
n
t

E
r
r
o
r
Validation Set
Error
Training Set
Error
CLICK!
Modeling the neuron
Wetware: Biological Neurons
The Process: Neuron Firing
‡ Each electrical signal received at a synapse
causes neurotransmitter release
‡ The neurotransmitter travels along the synaptic
cleft and received by the other neuron at a
receptor site
‡ Post-Synaptic-Potential (PSP) either increases
(hyperpolarizes) or decreases (depolarizes) the
polarization of the post-synaptic membrane (the
receptors)
‡ In hyperpolarization, the spike train is inhibited.
In depolarization, the spike train is excited.
The Process: Part 2
‡ Each PSP travels along the dendrite of the
new neuron, and spreads itself over the cell
body
‡ When the effects of the PSP reaches the
axon-hillock, it is summed with other PSPs.
‡ If the sum is greater than a certain threshold,
the neuron fires a spike along the axon
‡ Once the spike reaches the synapse of an
efferent neuron, the process starts in that
neuron
The neuron to the TLU
‡ Cell Body (Soma) = accumulator plus its
threshold function
‡ Dendrites = inputs to the TLU
‡ Axon = output of the TLU
‡ Information Encoding:
± Neurons use frequency
± TLUs use value
Modeling the Neuron: Capabilities
‡ Humans and Neural Nets are both:
± Good at pattern recognition
± Bad at mathematical calculation
± Good at compressing lots of information
into a yes/no decision
± Taught via training period
‡ TLUs win because neurons are slow
‡ Wetware wins because we have a
cheap source of billions of neurons
Do ANNs model neuron structures?
‡ No: Hundreds of types of specialized nerons,
only one TLU
‡ No: Weights to neural threshold controlled by
many neurotransmitters, not just one
‡ Yes: Most of the complexity in the neuron is
devoted to sustaining life, not information
processing
‡ Maybe: There is no real method for
backpropagation in the brain. Instead, firing
of neurons increases connection strength
High Level: Agent Architecture
‡ Our minds are composed of a series of
non-intelligent agents
‡ The hierarchy, interconnections, and
interactions between the agents creates
our intelligence
‡ There is no one agent in control
‡ We learn by forming new connections
between agents
‡ We improve by dealing with agents at a
higher level, ie creating mental µscripts¶
Agent Hierarchy: Playing with Blocks
Begin A n
B il er
ee
in
ras o e
et
o e elease
P t
A
rom the o tsi e, B il er
know how to b il towers.
rom insi e, B il er j st t rns
on other agents.
How We Remember: K-Line Theory
New Knowledge: Connections
‡ Sandcastles in the sky: Everything we know is
connected to everything else we know
‡ Knowledge is acquired by making connections
new between ³things´ we already know
ird ish
nimal
ak ir
lant
live irus
oat
ood
lane
etal
ouse
Stone
Not live
Thing
ird lane
ir
og Car
and
ish oat
Sea
Thing
Learning Meaning
‡ Uniframing: Combining several
descriptions into one
‡ Accumulating: Collecting incompatible
descriptions
‡ Reformulating: modifying a description¶s
character
‡ Transforming: bridging between
structures and functions or actions
The Exception Principle
‡ It rarely pays to tamper with a rule that
nearly always works. It is better to
complement it with an accumulation of
exceptions
‡ Birds can Fly
‡ Birds can fly unless they are penguins
and ostriches
The Exception Principle:
Overfitting
‡ Birds can fly, unless they are penguins
and ostriches, or if they happen to be
dead, or have broken wings, or are
confined to cages, or have their feet
stuck in cement, or have undergone
experiences so dreadful as to render
them psychologically incapable of flight
‡ In real thought, finding exceptions to
everything is usually unnecessary.
Minsky¶s Princples
‡ Most new knowledge is simply finding a
new way to relate things we already
know
‡ There is nothing wrong with circular
logic or having imperfect rules
‡ Any idea will seem self-evident... once
you¶ve forgotten learning it.
‡ Easy things are hard: We¶re least aware
of what our minds do best
TO THE FUTURE AND
BEYOND
Why you should be nice
to your computer
I¶m lonely and I¶m bored.
Come play with me!
Computers are Dumb
‡ ³Deep Blue might be able to win at
chess, but it won¶t know to come in from
the rain.´
‡ Computers can only know what they¶re
told, or what they¶re told to learn
‡ Computers lack a sense of mortality and
a physical self with which to preserve
‡ All of this will change when computers
can reach µconsciousness¶
I, Silicon Consciousness
‡ Kurzweil: By 2019, a $1000 computer
will be equivalent to the human brain.
‡ By 2029, machines will claim to be
conscious. We will believe them.
‡ By 2049, nanobot swarms will make
virtual reality obsolete in real reality.
‡ By 2099, man and machine will have
completely merged.
You mean to tell me?????
‡ We humans will gradually introduce
machines into our bodies, as implants
‡ Our machines will grow more human as
they learn, and learn to design themselves
‡ The Neo-Luddite scenarios:
± AI succeeds in creating conscious beings. All
life is at the mercy of the machines.
± Humans retain control: workers are obsolete.
The power to decide the fate of the masses is
now completely in the hands of the elite.
Neural Networks: Conclusions
‡ Neural Networks are a powerful tool for:
± Pattern recognition
± Generalizing to a problem
± Machine learning
‡ Training Neural Networks
± Can be done, but exercise great care
± Still has room for improvement
‡ Understanding and creating
consciousness?
± Still working on it :)

Objective
‡ Introduce fundamental concepts in Artificial Neural Networks ‡ Discuss methods of training ANNs ‡ Explore some uses of ANNs ‡ Assess the accuracy of artificial neurons as models for biological neurons ‡ Discuss current views, ideas and research

Organization
‡ ‡ ‡ ‡ ‡ ‡ ‡ Why Neural Networks? Single TLUs Training Neural Nets: Back propagation Working with Neural Networks Modeling the neuron The multi-agent architecture Directions and destinations

Why Neural Networks?

The ³Von Neumann´ architecture
‡ Memory for programs and data ‡ CPU for math and logic ‡ Control unit to steer program flow

Von Neumann vs. ANNs Von Neumann ‡ Follows Rules ‡ Solution can/must be formally specified ‡ Cannot generalize ‡ Not error tolerant Neural Net ‡ Learns from data ‡ Rules on data are not visible ‡ Able to generalize ‡ Copes well with noise .

Circuits that LEARN ‡ Three types of learning: ± Supervised Learning ± Unsupervised Learning ± Reinforcement Learning ‡ Hebbian networks: reward µgood¶ paths. punish µbad¶ paths ‡ Train neural net by adjusting weights ‡ PAC (Probably Approximately Correct) theory: Kerns & Vazirani 1994. Haussler 1990 .

Supervised Learning Concepts ‡ Training set <: Input/output pairs ± Supervised learning because we know the correct action for every input in < ± We want our Neural Net to act correctly in as many training vectors as possible ± Choose training set to be a typical set of inputs ± The Neural net will (hopefully) generalize to all inputs based on training set ‡ Validation Set: Check to see how well our training can generalize .

: Face recognition Handwriting Recognition BrainMaker: Medical Diagnosis Bushnell: Neural net for combinational automatic test pattern generation ‡ ALVINN: Knight Rider in real life! ‡ Getting rich: LBS Capital Management predicts the S&P 500 .Neural Net Applications ‡ ‡ ‡ ‡ Miros Corp.

Hinton + Williams present backpropagation ‡ 1989: Tsividis: Neural Network on a chip .Perceptron ‡ 1969: Minsky and Papert publish limits on the ability of a perceptron to generalize ‡ 1970¶s and 1980¶s: ANN renaissance ‡ 1986: Rumelhart.Modeling the Neuron for Parallel Distributed Processing ‡ 1958: Rosenblatt .History of Neural Networks ‡ 1943: McCullough and Pitts .

Threshold Logic Units The building blocks of Neural Networks .

one binary output ‡ Threshold: Biasing function ‡ Squashing function compresses infinite input into range of 0 .1 .The TLU at a glance ‡ TLU: Threshold Logic Unit ‡ Loosely based on the firing of biological neurons ‡ Many inputs.

The TLU in Action .

Training TLUs: Notation ‡ ‡ ‡ ‡ U = Threshold of TLU X = Input Vector W = Weight Vector s=X·W ie: if s u U. op = 0 ‡ d = desired output of TLU ‡ f = output of TLU with current X and W . op = 1 if s < U.

« wn. w2. xn. -U] ‡ Augment X: X = [x1.Augmented Vectors ‡ Motivation: Train threshold U at the same time as input weights ‡ X ™ W u U is the same as X ™ W . x2.U u 0 ‡ Set threshold of TLU = 0 ‡ Augment W: W = [w1.. . 1] ‡ New TLU equation: X · W u 0 (for augmented X and W) .

Gradient Descent Methods ‡ Error Function: How far off are we? ± Example Error function: ! § .

d i  f i i 2 ‡ I depends on weight values ‡ Gradient Descent: Minimize error by moving weights along the decreasing slope of error ‡ The Idea: iterate through the training set and adjust the weights to minimize the gradient of the error .

.... we have Which finally gives: xI xf ! 2(d  f ) ' x xs . .f)2 Gradient of I: Since Also: xs !' xW xI xW « xI xI xI » .Gradient Descent: The Math We have I = (d .. ¬ ¼ xw1 xwi xwn1 ½ ­ xI xI xs xW xs xW xI xI ! ' xW xs Using the chain rule: xI xf ! 2( d  f ) xs xs ....

Gradient Descent: Back to reality ‡ So we have ‡ The problem: xf / xs is not differentiable ‡ Three solutions: ± Ignore It: The Error-Correction Procedure ± Fudge It: Widrow-Hoff ± Approximate it: The Generalized Delta Procedure xI xf ! 2(d  f ) ' x xs .

Training a TLU: Example ‡ Train a neural network to match the following linearly separable training set: .

Behind the scenes: Planes and Hyperplanes .

What can a TLU learn? .

Linearly Separable Functions ‡ A single TLU can implement any Linearly separable function ‡ AB¶ is Linearly separable ‡ A ‡ B is not .

NEURAL NETWORKS An Architecture for Learning .

Neural Network Fundamentals ‡ Chain multiple TLUs together ‡ Three layers: ± Input Layer ± Hidden Layers ± Output Layer ‡ Two classifications: ± Feed-Forward ± Recurrent .

Neural Network Terminology .

and then we can replace with threshold functions. corresponding to their effect on the output ‡ Works on feed-forward networks ‡ Use sigmoid units to train.Training ANNs: Backpropagation ‡ Main Idea: distribute the error function across the hidden layers. .

Back-Propagation: Birds-eye view ‡ Repeat: ± Choose training pair and copy it to input layer ± Cycle that pattern through the net ± Calculate error derivative between output activation and target output ± Back propagate the summed product of the weights and errors in the output layer to calculate the error on the hidden units ± Update weights according to the error on that unit ‡ Until error is low or the net settles .

Back-Prop: Sharing the Blame ‡ We want to assign Wi n Wi  ci H i X j j j j j 1 ± Wij = weights of i-th sigmoid in j-th layer ± Xj-1 = inputs to our TLU (outputs from previous layer) ± cij = learning rate constant of i-th sigmoid in j-th layer ± Hij = sensitivity of the network output to changes in the input of our TLU xf xI j ± Important equation: H i ! .

d  f j !  j xsi 2 xsi .

Wk + ck (d .Back-Prop: Calculating Hij ‡ For the output layer: Hij = Hk ± Hij = Hk = (d-f)xf/ xsk ± Hij = (d-f)f(1-f) for sigmoid ± Therefore Wk <.f) f (1 -f ) Xk-1 ‡ For the hidden layers: ± See Nilsson 1998 for calculation ± Recursive Formula: base case Hk =(d-f)f(1-f) m j 1 H i j ! f i j (1  f i j ) § H l j 1wilj 1 l !1 .

d =0 ‡ x10 = 1. d = 1 ‡ x10 = 0. x20 = 1. d = 0 ‡ x10 = 0. x20 = 0.Back-Prop: Example ‡ Train a 2-layer Neural net with the following input: ‡ x10 = 1. d = 1 . x20 = 1. x30 = 1. x30 = 1. x30 = 1. x20 = 0. x30 = 1.

Back-Prop: Problems ‡ Learning rate is non-optimal ± One solution: ³Learn´ the learning rate ‡ Network Paralysis: Weights grow so large that fij(1-fij) --> 0. even if workarounds can¶t be found . and the net never learns ‡ Local Extrema: Gradient Descent is a greedy method ‡ These problems are acceptable in many cases.

Back-Prop: Momentum ‡ We want to choose a learning rate that is as large as possible ± Speed up convergence ± Avoid oscillations ‡ Add momentum term dependent on past weight change: (wi . j (t  ) ! K i j X j   Ewi . j (t ) .

1973 ‡ Originally developed for receptive field mapping in the visual pathway of frogs ‡ The main ideas: ± Use cross-correlation to determine a direction of movement in gradient field ± Add a random element to avoid local extrema .Another Method: ALOPEX ‡ Used for visual receptive field mapping by Tzanakou and Harth.

WORKING WITH NEURAL NETS AI the easy way! .

ANN Project Lifecycle ‡ ‡ ‡ ‡ ‡ ‡ ‡ ‡ ‡ ‡ Task identification and design Feasibility Data Coding Network Design Data Collection Data Checking Training and Testing Error Analysis Network Analysis System Implementation .

ANN Design Tradeoffs enera o s a o s o e i eso ion e e of a ra Defini ion of e ro e Da a oding N er of ne or Da a o e ion A ig s e e .osed ra e ri e ad a eO Di ensiona i red ion o ess da a needed arse da a en dis ri ion Nois da a o era ed enera i es e o nseen da e ina ra ni s Tes ri eria in raining ain ro e an di ensions ig ore Da a needed Dense da a ne en dis ri ion No noise o era ed ea es re ired e e of a ra o erfi ing ‡ A good design i find a a an e e een ese o e re es! .

ANN Design Balance: Depth 40 35 30 25 20 15 10 5 0 Number of hidden units 10 2 4 6 8 Percent Error Validation Set Error Training Set Error ‡ Too few hidden layers will cause errors in accuracy ‡ Too many errors will cause errors in generalization! .

CLICK! Modeling the neuron .

Wetware: Biological Neurons .

the spike train is excited. In depolarization. the spike train is inhibited.The Process: Neuron Firing ‡ Each electrical signal received at a synapse causes neurotransmitter release ‡ The neurotransmitter travels along the synaptic cleft and received by the other neuron at a receptor site ‡ Post-Synaptic-Potential (PSP) either increases (hyperpolarizes) or decreases (depolarizes) the polarization of the post-synaptic membrane (the receptors) ‡ In hyperpolarization. .

the process starts in that neuron . the neuron fires a spike along the axon ‡ Once the spike reaches the synapse of an efferent neuron. and spreads itself over the cell body ‡ When the effects of the PSP reaches the axon-hillock.The Process: Part 2 ‡ Each PSP travels along the dendrite of the new neuron. ‡ If the sum is greater than a certain threshold. it is summed with other PSPs.

The neuron to the TLU ‡ Cell Body (Soma) = accumulator plus its threshold function ‡ Dendrites = inputs to the TLU ‡ Axon = output of the TLU ‡ Information Encoding: ± Neurons use frequency ± TLUs use value .

Modeling the Neuron: Capabilities ‡ Humans and Neural Nets are both: ± Good at pattern recognition ± Bad at mathematical calculation ± Good at compressing lots of information into a yes/no decision ± Taught via training period ‡ TLUs win because neurons are slow ‡ Wetware wins because we have a cheap source of billions of neurons .

Do ANNs model neuron structures? ‡ No: Hundreds of types of specialized nerons. not information processing ‡ Maybe: There is no real method for backpropagation in the brain. Instead. firing of neurons increases connection strength . not just one ‡ Yes: Most of the complexity in the neuron is devoted to sustaining life. only one TLU ‡ No: Weights to neural threshold controlled by many neurotransmitters.

High Level: Agent Architecture ‡ Our minds are composed of a series of non-intelligent agents ‡ The hierarchy. ie creating mental µscripts¶ . and interactions between the agents creates our intelligence ‡ There is no one agent in control ‡ We learn by forming new connections between agents ‡ We improve by dealing with agents at a higher level. interconnections.

rom insi e. A in et P t ee ras o e o e elease . B il er j st t rns on other agents. B il er know how to b il towers.Agent Hierarchy: Playing with Blocks B il er Begin A n rom the o tsi e.

How We Remember: K-Line Theory .

New Knowledge: Connections Thing live nimal ird ish ak irus lant ir ood oat Not live etal lane Stone ouse ird ir lane og Thing and Car ish Sea oat ‡ Sandcastles in the sky: Everything we know is connected to everything else we know ‡ Knowledge is acquired by making connections new between ³things´ we already know .

Learning Meaning ‡ Uniframing: Combining several descriptions into one ‡ Accumulating: Collecting incompatible descriptions ‡ Reformulating: modifying a description¶s character ‡ Transforming: bridging between structures and functions or actions .

The Exception Principle ‡ It rarely pays to tamper with a rule that nearly always works. It is better to complement it with an accumulation of exceptions ‡ Birds can Fly ‡ Birds can fly unless they are penguins and ostriches .

or have their feet stuck in cement. or have broken wings. or are confined to cages.The Exception Principle: Overfitting ‡ Birds can fly. finding exceptions to everything is usually unnecessary. unless they are penguins and ostriches. . or if they happen to be dead. or have undergone experiences so dreadful as to render them psychologically incapable of flight ‡ In real thought.

‡ Easy things are hard: We¶re least aware of what our minds do best . once you¶ve forgotten learning it...Minsky¶s Princples ‡ Most new knowledge is simply finding a new way to relate things we already know ‡ There is nothing wrong with circular logic or having imperfect rules ‡ Any idea will seem self-evident.

TO THE FUTURE AND BEYOND Why you should be nice to your computer .

Come play with me! .I¶m lonely and I¶m bored.

but it won¶t know to come in from the rain. or what they¶re told to learn ‡ Computers lack a sense of mortality and a physical self with which to preserve ‡ All of this will change when computers can reach µconsciousness¶ .Computers are Dumb ‡ ³Deep Blue might be able to win at chess.´ ‡ Computers can only know what they¶re told.

We will believe them.I. nanobot swarms will make virtual reality obsolete in real reality. man and machine will have completely merged. ‡ By 2049. ‡ By 2029. Silicon Consciousness ‡ Kurzweil: By 2019. machines will claim to be conscious. . a $1000 computer will be equivalent to the human brain. ‡ By 2099.

The power to decide the fate of the masses is now completely in the hands of the elite.You mean to tell me????? ‡ We humans will gradually introduce machines into our bodies. All life is at the mercy of the machines. . as implants ‡ Our machines will grow more human as they learn. ± Humans retain control: workers are obsolete. and learn to design themselves ‡ The Neo-Luddite scenarios: ± AI succeeds in creating conscious beings.

Neural Networks: Conclusions ‡ Neural Networks are a powerful tool for: ± Pattern recognition ± Generalizing to a problem ± Machine learning ‡ Training Neural Networks ± Can be done. but exercise great care ± Still has room for improvement ‡ Understanding and creating consciousness? ± Still working on it :) .