You are on page 1of 114

Analog Computation and Learning in VLSI

Thesis by
Vincent F. Koosh
In Partial Fulfillment of the Requirements
for the Degree of
Doctor of Philosophy
1891
C
A
L
I
F
O
R
N
I
A

I
N
S
T
I T
UT E
O
F

T
E
C
H
N
O
L
O
G
Y
California Institute of Technology
Pasadena, California
2001
(Submitted April 20, 2001)
ii
c 2001
Vincent F. Koosh
All Rights Reserved
iii
Acknowledgements
First, I would like to thank my advisor, Rod Goodman, for providing me with the
resources and tools necessary to produce this work. Next, I would like to thank
the members of my thesis defense committee, Jehoshua Bruck, Christof Koch, Alain
Martin, and Chris Diorio. I would also like to thank Petr Vogel for being on my
candidacy committee. Thanks also goes out to some of my undergraduate professors,
Andreas Andreou and Gert Cauwenberghs, for starting me off on this analog VLSI
journey and for providing occasional guidance along the way. Many thanks go to
the members of my lab and the Caltech VLSI community for fruitful conversations
and useful insights. This includes Jeff Dickson, Bhusan Gupta, Samuel Tang, Theron
Stanford, Reid Harrison, Alyssa Apsel, John Cortese, Art Zirger, Tom Olick, Bob
Freeman, Cindy Daniell, Shih-Chii Liu, Brad Minch, Paul Hasler, Dave Babcock,
Adam Hayes, William Agassounon, Phillip Tsao, Joseph Chen, Alcherio Martinoli,
and many others. I thank Karin Knott for providing administrative support for our
lab. I must also thank my parents, Rosalyn and Charles Soffar, for making all things
possible for me, and I thank my brothers, Victor Koosh and Jonathan Soffar. Thanks
also goes to Lavonne Martin for always lending a helpful hand and a kind ear. I also
thank Grace Esteban for providing encouragement for many years. Furthermore, I
would like to thank many of my other friends and colleagues, too numerous to name,
for the many joyous times.
iv
Abstract
Nature has evolved highly advanced systems capable of performing complex compu-
tations, adaptation, and learning using analog components. Although digital sys-
tems have significantly surpassed analog systems in terms of performing precise, high
speed, mathematical computations, digital systems cannot outperform analog sys-
tems in terms of power. Furthermore, nature has evolved techniques to deal with
imprecise analog components by using redundancy and massive connectivity. In this
thesis, analog VLSI circuits are presented for performing arithmetic functions and
for implementing neural networks. These circuits draw on the power of the analog
building blocks to perform low power and parallel computations.
The arithmetic function circuits presented are based on MOS transistors operating
in the subthreshold region with capacitive dividers as inputs to the gates. Because
the inputs to the gates of the transistors are floating, digital switches are used to
dynamically reset the charges on the floating gates to perform the computations.
Circuits for performing squaring, square root, and multiplication/division are shown.
A circuit that performs a vector normalization, based on cascading the preceding
circuits, is shown to display the ease with which simpler circuits may be combined to
obtain more complicated functions. Test results are shown for all of the circuits.
Two feedforward neural network implementations are also presented. The first
uses analog synapses and neurons with a digital serial weight bus. The chip is trained
in loop with the computer performing control and weight updates. By training with
the chip in the loop, it is possible to learn around circuit offsets. The second neural
network also uses a computer for the global control operations, but all of the local op-
erations are performed on chip. The weights are implemented digitally, and counters
are used to adjust them. A parallel perturbative weight update algorithm is used.
The chip uses multiple, locally generated, pseudorandom bit streams to perturb all
of the weights in parallel. If the perturbation causes the error function to decrease,
v
the weight change is kept, otherwise it is discarded. Test results are shown of both
networks successfully learning digital functions such as AND and XOR.
vi
Contents
Acknowledgements iii
Abstract iv
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Analog Computation . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 VLSI Neural Networks . . . . . . . . . . . . . . . . . . . . . . 2
2 Analog Computation with Translinear Circuits 4
2.1 BJT Translinear Circuits . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Subthreshold MOS Translinear Circuits . . . . . . . . . . . . . . . . . 5
2.3 Above Threshold MOS Translinear Circuits . . . . . . . . . . . . . . 6
2.4 Translinear Circuits With Multiple Linear Input Elements . . . . . . 7
3 Analog Computation with Dynamic Subthreshold Translinear Cir-
cuits 8
3.1 Floating Gate Translinear Circuits . . . . . . . . . . . . . . . . . . . 9
3.1.1 Squaring Circuit . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.2 Square Root Circuit . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.3 Multiplier/Divider Circuit . . . . . . . . . . . . . . . . . . . . 12
3.2 Dynamic Gate Charge Restoration . . . . . . . . . . . . . . . . . . . 14
3.3 Root Mean Square (Vector Sum) Normalization Circuit. . . . . . . . 16
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.1 Early Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.2 Clock Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
vii
3.5.3 Speed, Accuracy Tradeoffs . . . . . . . . . . . . . . . . . . . . 27
3.6 Floating Gate Versus Dynamic Comparison . . . . . . . . . . . . . . 27
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 VLSI Neural Network Algorithms, Limitations, and Implementa-
tions 30
4.1 VLSI Neural Network Algorithms . . . . . . . . . . . . . . . . . . . . 30
4.1.1 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.2 Neuron Perturbation . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.3 Serial Weight Perturbation . . . . . . . . . . . . . . . . . . . . 32
4.1.4 Summed Weight Neuron Perturbation . . . . . . . . . . . . . . 33
4.1.5 Parallel Weight Perturbation . . . . . . . . . . . . . . . . . . . 34
4.1.6 Chain Rule Perturbation . . . . . . . . . . . . . . . . . . . . . 35
4.1.7 Feedforward Method . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 VLSI Neural Network Limitations . . . . . . . . . . . . . . . . . . . 36
4.2.1 Number of Learnable Functions with Limited Precision Weights 36
4.2.2 Trainability with Limited Precision Weights . . . . . . . . . . 37
4.2.3 Analog Weight Storage . . . . . . . . . . . . . . . . . . . . . . 38
4.3 VLSI Neural Network Implementations . . . . . . . . . . . . . . . . . 39
4.3.1 Contrastive Neural Network Implementations . . . . . . . . . 40
4.3.2 Perturbative Neural Network Implementations . . . . . . . . . 41
5 VLSI Neural Network with Analog Multipliers and a Serial Digital
Weight Bus 43
5.1 Synapse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Serial Weight Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 Feedforward Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5 Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.6 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
viii
6 A Parallel Perturbative VLSI Neural Network 68
6.1 Linear Feedback Shift Registers . . . . . . . . . . . . . . . . . . . . . 69
6.2 Multiple Pseudorandom Noise Generators . . . . . . . . . . . . . . . 71
6.3 Multiple Pseudorandom Bit Stream Circuit . . . . . . . . . . . . . . . 74
6.4 Up/down Counter Weight Storage Elements . . . . . . . . . . . . . . 74
6.5 Parallel Perturbative Feedforward Network . . . . . . . . . . . . . . . 78
6.6 Inverter Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.7 Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.8 Error Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.9 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7 Conclusions 93
7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.1.1 Dynamic Subthreshold MOS Translinear Circuits . . . . . . . 93
7.1.2 VLSI Neural Networks . . . . . . . . . . . . . . . . . . . . . . 93
7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Bibliography 96
ix
List of Figures
3.1 Capacitive voltage divider . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Squaring circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Square root circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Multiplier/divider circuit . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Dynamically restored squaring circuit . . . . . . . . . . . . . . . . . . 14
3.6 Dynamically restored square root circuit . . . . . . . . . . . . . . . . 15
3.7 Dynamically restored multiplier/divider circuit . . . . . . . . . . . . . 16
3.8 Root mean square (vector sum) normalization circuit . . . . . . . . . 17
3.9 Squaring circuit results . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.10 Square root circuit results . . . . . . . . . . . . . . . . . . . . . . . . 22
3.11 Multiplier/divider circuit results . . . . . . . . . . . . . . . . . . . . . 23
3.12 RMS normalization circuit results . . . . . . . . . . . . . . . . . . . . 24
5.1 Binary weighted current source circuit . . . . . . . . . . . . . . . . . 45
5.2 Synapse circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Synapse differential output current as a function of differential input
voltage for various digital weight settings . . . . . . . . . . . . . . . . 47
5.4 Neuron circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.5 Neuron differential output voltage, ∆V
out
, as a function of ∆I
in
, for
various values of V
gain
. . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.6 Neuron positive output, V
out+
, as a function of ∆I
in
, for various values
of V
offset
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.7 Serial weight bus shift register . . . . . . . . . . . . . . . . . . . . . . 55
5.8 Nonoverlapping clock generator . . . . . . . . . . . . . . . . . . . . . 56
5.9 2 input, 2 hidden unit, 1 output neural network . . . . . . . . . . . . 58
5.10 Differential to single ended converter and digital buffer . . . . . . . . 59
x
5.11 Parallel Perturbative algorithm . . . . . . . . . . . . . . . . . . . . . 60
5.12 Training of a 2:1 network with AND function . . . . . . . . . . . . . 62
5.13 Training of a 2:2:1 network with XOR function starting with small
random weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.14 Network with “ideal” weights chosen for implementing XOR . . . . . 64
5.15 Training of a 2:2:1 network with XOR function starting with “ideal”
weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.1 Amplified thermal noise of a resistor . . . . . . . . . . . . . . . . . . 69
6.2 2-tap, m-bit linear feedback shift register . . . . . . . . . . . . . . . . 69
6.3 Maximal length LFSRs with only 2 feedback taps . . . . . . . . . . . 72
6.4 Dual counterpropagating LFSR multiple random bit generator . . . . 75
6.5 Synchronous up/down counter . . . . . . . . . . . . . . . . . . . . . . 76
6.6 Full adder for up/down counter . . . . . . . . . . . . . . . . . . . . . 77
6.7 Counter static register . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.8 Block diagram of parallel perturbative neural network circuit . . . . . 79
6.9 Inverter block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.10 Pseudocode for parallel perturbative learning network . . . . . . . . . 83
6.11 Error comparison section . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.12 Error comparison operation . . . . . . . . . . . . . . . . . . . . . . . 84
6.13 Training of a 2:1 network with the AND function . . . . . . . . . . . 88
6.14 Training of a 2:2:1 network with the XOR function . . . . . . . . . . 89
6.15 Training of a 2:2:1 network with the XOR function with more iterations 90
xi
List of Tables
5.1 Computations for network with “ideal” weights chosen for implement-
ing XOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.1 LFSR with m=4, n=3 yielding 1 maximal length sequence . . . . . . 70
6.2 LFSR with m=4, n=2, yielding 3 possible subsequences . . . . . . . . 70
1
Chapter 1 Introduction
Although many applications are well suited for digital solutions, analog circuit design
continues to have an important place. The places where analog circuits are required
and excel are the places where digital equivalents cannot live up to the task. For
example, ultra high speed circuits, such as in wireless applications, require the use
of analog design techniques. Another important area where digital circuits may not
be sufficient is that of low power applications such as in biomedical, implantable
devices or portable, handheld devices. Finally, the importance of analog circuitry in
interfacing with the world will never be displaced because the world itself consists of
analog signals.
In this thesis, several circuit designs are described that fit well into the ana-
log arena. First, a set of circuits for performing basic analog calculations such as
squaring, square rooting, multiplication and division are described. These circuits
work with extremely low current levels and would fit well into applications requir-
ing ultra low power designs. Next, several circuits are presented for implementing
neural network architectures. Neural networks have proven useful in areas requiring
man-machine interactions such as handwriting or speech recognition. Although these
neural networks can be implemented with digital microprocessors, the large growth
in portable devices with limited battery life increases the need of finding custom low
power solutions. Furthermore, the area of operation of the neural network circuits
can be modified from low power to high speed to meet the needs of the specific ap-
plication. The inherent parallelism of neural networks allows a compact high speed
solution in analog VLSI.
2
1.1 Overview
1.1.1 Analog Computation
In chapter 2, a brief review of analog computational circuits is given with an empha-
sis on translinear circuits. First, traditional translinear circuits are discussed which
are implemented with devices showing an exponential characteristic, such as bipolar
transistors. Next, MOS transistor versions of translinear circuits are discussed. In
subthreshold MOS translinear circuits the exponential characteristic is exploited sim-
ilarly to bipolar transistors; whereas, above threshold MOS translinear circuits use
the square law characteristic and require a different design style. Also mentioned are
translinear circuits that utilize networks of multiple, linear input elements. These
form the basis of the circuits described in the next chapter.
In chapter 3, A class of analog CMOS circuits that can be used to perform many
analog computational tasks is presented. The circuits utilize MOSFETs in their
subthreshold region as well as capacitors and switches to produce the computations.
A few basic current-mode building blocks that perform squaring, square root, and
multiplication/division are shown. This should be sufficient to gain understanding of
how to implement other power law circuits. These circuit building blocks are then
combined into a more complicated circuit that normalizes a current by the square
root of the sum of the squares (vector sum) of the currents. Each of these circuits
have switches at the inputs of their floating gates which are used to dynamically set
and restore the charges at the floating gates to proceed with the computation.
1.1.2 VLSI Neural Networks
In chapter 4, a brief review of VLSI neural networks is given. First, a discussion of
the various learning rules which are appropriate for analog VLSI implementation are
discussed. Next, some of the issues associated with analog VLSI networks is presented
such as limited precision weights and weight storage issues. Also, a brief discussion
of several VLSI neural network implementations is presented.
3
In chapter 5, a VLSI feedforward neural network is presented that makes use of
digital weights and analog synaptic multipliers and neurons. The network is trained
in a chip-in-loop fashion with a host computer implementing the training algorithm.
The chip uses a serial digital weight bus implemented by a long shift register to input
the weights. The inputs and outputs of the network are provided directly at pins on
the chip. The training algorithm used is a parallel weight perturbation technique.
Training results are shown for a 2 input, 1 output network trained with an AND
function, and for a 2 input, 2 hidden layer, 1 output network trained with an XOR
function.
In chapter 6, a VLSI neural network that uses a parallel perturbative weight
update technique is presented. The network uses the same synapses and neurons
as the previous network, but all of the local, parallel, weight update computations
are performed on-chip. This includes the generation of random perturbations and
counters for updating the digital words where the weights are stored. Training results
are also shown for a 2 input, 1 output network trained with an AND function, and
for a 2 input, 2 hidden layer, 1 output network trained with an XOR function.
4
Chapter 2 Analog Computation with
Translinear Circuits
There are many forms of analog computation. For example, analog circuits can per-
form many types of signal filtering including lowpass, bandpass, and highpass filters.
Operational amplifier circuits are commonly used for these tasks as well as for generic
derivative, integrator, and gain circuits. The emphasis of this chapter, however, shall
be on circuits with the ability to implement arithmetic functions such as signal mul-
tiplication, division, power law implementations, and trigonometric functions. A well
known class of circuits exists to perform such tasks called translinear circuits. The
basic idea behind using these types of circuits is that of using logarithms for con-
verting multiplications and divisions into additions and subtractions. This type of
operation was well know to users of slide rules in the past.
2.1 BJT Translinear Circuits
The earliest forms of these type of circuits were based on exploiting the exponential
current-voltage relationships of diodes or bipolar junction transistors[1][2].
For a bipolar junction transistor, the collector current, I
C
, as a function of its
base-emitter voltage, V
be
, is given as
I
C
= I
S
exp

V
be
U
t

where I
S
is the saturation current, and U
t
= kT/q is the thermal voltage, which
is approximately 26mV at room temperature. In using this formula, the base-emitter
voltage is considered the input, and the collector current is the output which performs
the exponentiation or antilog operation. It is also possible to think of the collector
5
current as the input and the base-emitter voltage as the output such as in a diode
connected transistor, where the collector has a wire, or possibly a voltage buffer to
remove base current offsets, connecting it to the base. In this case, the equation can
be seen as follows:
V
be
= U
t
ln

I
C
I
S

From here it is possible to connect the transistors, using log and antilog techniques,
to obtain various multiplication, division and power law circuits. However, it is not
always trivial to obtain the best circuit configuration for a given application. It is
also possible to obtain trigonometric functions by using other power law functional
approximations[3].
2.2 Subthreshold MOS Translinear Circuits
Because of the similarities of the subthreshold MOS transistor equation and the bipo-
lar transistor equation, it is sometimes possible to directly translate BJT translinear
circuits into subthreshold MOS translinear circuits. However, care must be taken
because of the nonidealities introduced in subthreshold MOS.
For a subthreshold MOS transistor, the drain current, I
d
, as a function of its
gate-source voltage, V
gs
, is given as[30]
I
d
= I
o
exp

κV
gs
U
t

where I
o
is the preexponential current factor, and κ is a process dependent pa-
rameter which is normally around 0.7. Thus, it is clear that the subthreshold MOS
transistor equation differs from the BJT equation with the introduction of the pa-
rameter κ.
Due to this extra parameter, direct copies of the functional form of BJT translinear
circuits into subthreshold MOS translinear circuits do not always lead to the same
6
implemented arithmetic function. For example, for a particular circuit that performs a
square rooting operation on a current using BJTs, when converted to the subthreshold
MOS equivalent, it will actually raise the current to the
κ
κ+1
power[30]. When κ = 1,
this reduces to the appropriate
1
2
power, but since normally κ ≈ 0.7, the power
becomes ≈ 0.41. Nevertheless, several types of subthreshold MOS circuits which
display κ independence do exist[4][5]. These involve a few added restrictions on the
types of translinear loop topologies allowed.
Subthreshold MOS translinear circuits are more limited in their dynamic range
than BJT translinear circuits because of entrance into strong inversion. Furthermore,
the characteristics slowly decay in the transition region from subthreshold to above
threshold. However, due to the lack of a required base current, lower power operation
may be possible with the subthreshold MOS circuits. Furthermore, easier integration
with standard digital circuits in a CMOS process is possible.
2.3 Above Threshold MOS Translinear Circuits
The ability of above threshold MOS transistors to exhibit translinear behavior has
also been demonstrated[6].
For an above threshold MOS transistor in saturation, the drain current, I
d
, as a
function of its gate-source voltage, V
gs
, is given as
I
d
= K (V
gs
−V
t
)
2
Because of the square law characteristic used, these circuits are sometimes called
quadratic-translinear or MOS translinear to distinguish them from translinear circuits
based on exponential characteristics. Several interesting circuits have been demon-
strated such as those for performing squaring, multiplication, division, and vector
sum computation[6][10][11]. However, because of the inflexibility of the square law
compared to an exponential law, general function synthesis is usually not as straight-
forward as for exponential translinear circuits[7][9]. Although many of the circuits
7
rely upon transistors in saturation, some also take advantage of transistors biased in
the triode or linear region[12].
These circuits are usually more restricted in their dynamic range than bipolar
translinear circuits. They are limited at the low end by weak inversion and at the high
end by mobility reduction[7]. Furthermore, the square-law current-voltage equation
is usually not as good an approximation to the actual circuit behavior as for both
bipolar circuits and MOS circuits that are deep within the subthreshold region. Also,
for decreasingly small submicron channel lengths, the current-voltage equation begins
to approximate a linear function more than a quadratic[8].
2.4 Translinear Circuits With Multiple Linear In-
put Elements
Another class of translinear circuits exists that involves the addition of a linear net-
work of elements used in conjunction with exponential translinear devices. The linear
elements are used to add, subtract and multiply variables by a constant. From the
point of view of log-antilog mathematics, it is clear that multiplying values by a
constant, n, can be transformed into raising a value to the n-th power. Thus, the ad-
dition of a linear network provides greater flexibility in function synthesis. When the
exponential translinear devices used are bipolar transistors, the linear networks are
usually defined by resistor networks[13][14], and when the exponential translinear de-
vices used are subthreshold MOS transistors, the linear networks are usually defined
by capacitor networks[15][16]. This class of circuits can be viewed as a generalization
of the translinear principle.
8
Chapter 3 Analog Computation with
Dynamic Subthreshold Translinear
Circuits
A class of analog CMOS circuits has been presented which made use of MOS tran-
sistors operating in their subthreshold region[15][16]. These circuits use capacitively
coupled inputs to the gate of the MOSFET in a capacitive voltage divider configura-
tion. Since the gate has no DC path to ground it is floating and some means must
be used to initially set the gate charge and, hence, voltage value. The gate voltage is
initially set at the foundry by a process which puts different amounts of charge into
the oxides. Thus, when one gets a chip back from the foundry, the gate voltages are
somewhat random and must be equalized for proper circuit operation. One technique
is to expose the circuit to ultraviolet light while grounding all of the pins. This has
the effect of reducing the effective resistance of the oxide and allowing a conduction
path. Although this technique ensures that all the gate charges are equalized, it does
not always constrain the actual value of the gate voltage to a particular value. This
technique works in certain cases[17], such as when the floating gate transistors are
in a fully differential configuration, because the actual gate charge is not critical so
long as the gate charges are equalized. If the floating gate circuits are operated over
a sufficient length of time, stray charge may again begin to accumulate requiring an-
other ultraviolet exposure. For the circuits presented here it is imperative that the
initial voltages on the gates are set precisely and effectively to the same value near
the circuit ground. Therefore, a dynamic restoration technique is utilized that makes
it possible to operate the circuits indefinitely.
9
V
T
V
1
V
2
C
1
C
2
Figure 3.1: Capacitive voltage divider
3.1 Floating Gate Translinear Circuits
The analysis begins by describing the math that governs the implementation of these
circuits. A more thorough analysis for circuit synthesis is given elsewhere[15][16]. A
few simple circuit building blocks that should give the main idea of how to implement
other designs will be presented.
For the capacitive voltage divider shown in fig. 3.1, if all of the voltages are
initially set to zero volts and then V
1
and V
2
are applied, the voltage at the node V
T
becomes
V
T
=
C
1
V
1
+C
2
V
2
C
1
+C
2
If C
1
= C
2
then this equation reduces to
V
T
=
V
1
2
+
V
2
2
The current-voltage relation for a MOSFET transistor operating in subthreshold
and in saturation (V
ds
> 4U
t
, where U
t
= kT/q) is given by[30]:
I
ds
= I
o
exp(κV
gs
/U
t
)
Using the above equation for a transistor operating in subthreshold, as well as a
capacitive voltage divider, the necessary equations of the computations desired can
be produced.
In the following formulations, all of the transistors are assumed to be identical.
10
I
in
,V
in
I
ref
,V
ref
I
out
Figure 3.2: Squaring circuit
Also, all of the capacitors are of the same size.
3.1.1 Squaring Circuit
For the squaring circuit shown in figure 3.2, the I-V equations for each of the tran-
sistors can be written as follows.
U
t
κ
ln

I
out
I
o

= V
in
U
t
κ
ln

I
ref
I
o

= V
ref
U
t
κ
ln

I
in
I
o

=
V
in
2
+
V
ref
2
Then, these results can be combined to obtain the following:
U
t
κ
ln

I
in
I
o

=
U
t
κ
ln

I
out
I
o

1
2
+
U
t
κ
ln

I
ref
I
o

1
2
I
out
= I
o

¸
I
in
/I
o
(I
ref
/I
o
)
1
2
¸

2
=
I
2
in
I
ref
11
I
in
,V
in
I
ref
,V
ref
I
out
Figure 3.3: Square root circuit
3.1.2 Square Root Circuit
The equations for the square root circuit shown in figure 3.3 are shown similarly as
follows:
U
t
κ
ln

I
ref
I
o

= V
ref
U
t
κ
ln

I
in
I
o

= V
in
U
t
κ
ln

I
out
I
o

=
V
in
2
+
V
ref
2
Combining the equations then gives
U
t
κ
ln

I
out
I
o

=
U
t
κ
ln

I
in
I
o

1
2
+
U
t
κ
ln

I
ref
I
o

1
2
I
out
=

I
ref
I
in
12
I
ref
,V
ref
I
in1
,V
in1
I
out
I
in2
,V
in2
Figure 3.4: Multiplier/divider circuit
3.1.3 Multiplier/Divider Circuit
For the multiplier/divider circuit shown in figure 3.4, the calculations are again per-
formed as follows:
U
t
κ
ln

I
in1
I
o

=
V
in1
2
U
t
κ
ln

I
in2
I
o

=
V
in2
2
U
t
κ
ln

I
ref
I
o

=
V
ref
2
+
V
in2
2
V
ref
2
=
U
t
κ
ln

I
ref
I
o


U
t
κ
ln

I
in2
I
o

U
t
κ
ln

I
out
I
o

=
V
ref
2
+
V
in1
2
U
t
κ
ln

I
out
I
o

=
U
t
κ

ln

I
ref
I
o

−ln

I
in2
I
o

+ ln

I
in1
I
o

13
I
out
= I
ref
I
in1
I
in2
Although the input currents are labeled in such a way as to emphasize the dividing
nature of this circuit, it is possible to rename the inputs such that the divisor is the
reference current and the two currents in the numerator would be the multiplying
inputs.
Note that the divider circuit output is only valid when I
ref
is larger than I
in2
.
This is because the gate of the transistor with current I
ref
is limited to the voltage
V
fg
ref
at the gate by the current source driving I
ref
. Since this gate is part of a
capacitive voltage divider between V
ref
and V
in2
, when V
in2
> V
fg
ref
, the voltage at
the node V
ref
is zero and cannot go lower. Thus, no extra charge is coupled onto the
output transistors. Thus, when I
in2
> I
ref
, I
out
≈ I
in1
.
Also, from the input/output relation, it would seem that I
in1
and I
ref
are in-
terchangeable inputs. Unfortunately, based on the previous discussion, whenever
I
in2
> I
in1
, I
out
≈ I
ref
. This would mean that no divisions with fractional outputs of
I
in1
I
in2
< 1 could be performed. When such a fractional output is desired, the use of the
circuit in this configuration would not perform properly.
The preceding discussion indicates that the final simplified input/output relation
of the circuit is slightly deceptive. Care must be taken to ensure that the circuit is
operated in the proper operating region with appropriately sized inputs and reference
currents. Furthermore, when doing the analysis for similar power law circuits the
circuit designer must be aware of the charge limitations at the individual floating
gates.
A more accurate input/output relation for the multiplier/divider circuit is thus
given as follows:
14
f
1
f
2
I
in
,V
in
I
ref
,V
ref
I
out
Figure 3.5: Dynamically restored squaring circuit
I
out
=

I
ref
I
in1
I
in2
if I
in2
< I
ref
I
in1
if I
in2
> I
ref
3.2 Dynamic Gate Charge Restoration
All of the above circuits assume that some means is available to initially set the gate
charge level so that when all currents are set to zero, the gate voltages are also zero.
One method of doing so which lends itself well to actual circuit implementation is that
of using switches to dynamically set the charge during one phase of operation, and
then to allow the circuit to perform computations during a second phase of operation.
The squaring circuit in figure 3.5 is shown with switches now added to dynami-
cally equalize the charge. A non-overlapping clock generator generates the two clock
15
f
1
f
2
I
in
,V
in
I
ref
,V
ref
I
out
Figure 3.6: Dynamically restored square root circuit
signals. During the first phase of operation, φ
1
is high and φ
2
is low. This is the
reset phase of operation. Thus, the input currents do not affect the circuit and all
sides of the capacitors are discharged and the floating gates of the transistors are
also grounded and discharged. This establishes an initial condition with no current
through the transistors corresponding to zero gate voltage. Then, during the second
phase of operation, φ
1
goes low and φ
2
goes high. This is the compute phase of oper-
ation. The transistor gates are allowed to float, and the input currents are reapplied.
The circuit now exactly resembles the aforementioned floating gate squaring circuit.
Thus, it is able to perform the necessary computation.
The square root (figure 3.6) and multiplier/divider circuit (figure 3.7) are also
constructed in the same manner by adding switches connected to φ
2
at the drains of
each of the transistors and switches connected to φ
1
at each of the floating gate and
16
I
ref
,V
ref
I
in2
,V
in2
I
out
f
1
f
2
I
in1
,V
in1
Figure 3.7: Dynamically restored multiplier/divider circuit
capacitor terminals.
3.3 Root Mean Square (Vector Sum) Normaliza-
tion Circuit.
The above circuits can be used as building blocks and combined with current mirrors
to perform a number of useful computations. For example, a normalization stage can
be made which normalizes a current by the square root of the sum of the squares
of other currents. Such a stage is useful in many signal processing tasks. This
normalization stage is seen in figure 3.8. The reference current, I
ref
, is mirrored to
all of the reference inputs of the individual stages. The reference current is doubled
with the 1:2 current mirror into the divider stage. This is necessary because the
reference current must be larger than the largest current that will be divided by.
Since the current that is being divided will be the square root of a sum of squares of
two currents, when I
in1
= I
in2
= I
max
, it is necessary to make sure that the reference
17
I
o
u
t
f
1
f
2
I
i
n
1
I
i
n
1
I
i
n
2
I
r
e
f
Figure 3.8: Root mean square (vector sum) normalization circuit
18
current for the divider section is greater than

2I
max
. Using the 1:2 current mirror
and setting I
ref
= I
max
, the reference current in the divider section will be 2I
max
,
which is sufficient to enforce the condition.
The first two stages that read the input currents are the squaring circuit stages.
The outputs of these stages are summed and then fed back into the square root stage
with a current mirror. The input to the square root stage is thus given by
I
2
in1
I
ref
+
I
2
in2
I
ref
The square root stage then performs the computation

I
ref

I
2
in1
I
ref
+
I
2
in2
I
ref

=

I
2
in1
+I
2
in2
The output of the square root stage is then fed into the multiplier/divider stage as
the divisor current. The reference current for the divider stage is twice the reference
current for the other stages as discussed before. The other input to the divider stage
will be a mirrored copy of one of the input currents. Thus, the output of the divider
stage is given by
2I
ref
I
in1

I
2
in1
+I
2
in2
The output can then be fed back through another 2:1 current mirror (not shown)
to remove the factor of 2. Thus, the overall transfer function computed would be
I
out
= I
ref
I
in1

I
2
in1
+I
2
in2
The addition of other divider stages can be used to normalize other input currents.
This circuit easily extends to more variables by adding more input squaring stages
and connecting them all to the input of the current mirror that outputs to the square
root stage.
19
3.4 Experimental Results
The above circuits were fabricated in a 1.2µm double poly CMOS process. All of the
pfet transistors used for the mirrors were W=16.8µ, L=6µ. The nfet switches were all
W=3.6µ, L=3.6µ. The floating gate nfets were all W=30µ, L=30µ. The capacitors
were all 2.475 pF.
The data gathered from the current squaring circuit, square root circuit and mul-
tiplier/divider circuit is shown in figures 3.9, 3.10, and 3.11, respectively. Figure 3.12
shows the data from the vector sum normalization circuit.
The solid line in the figures represents the ideal fit. The circle markers represent
the actual data points.
The circuits show good performance over several orders of magnitude in current
range. At the high end of current, the circuit deviates from the ideal when one or
more transistors leaves the subthreshold region. The subthreshold region for these
transistors is below approximately 100nA. For example, in figure 3.10, an offset from
the theoretical fit for the I
ref
= 100nA case is seen. Since the reference transistor
is leaving the subthreshold region, more gate voltage is necessary per amount of
current. This is because in above threshold, the current goes as the square of the
voltage; whereas, in subthreshold it follows the exponential characteristic. Since
the diode connected reference transistor sets up a slightly higher gate voltage, more
charge and, hence, more voltage is coupled onto the output transistor. This causes
the output current to be slightly larger than expected.
Extra current range at the high end can be achieved by increasing the W/L of
the transistors so that they remain subthreshold at higher current levels. The current
W/L is 1, increasing W/L to 10 would change the subthreshold current range of
these circuits to be below approximately 1µA and hence increase the high end of
the dynamic range appropriately. Leakage currents limit the low end of the dynamic
range.
Some of the anomalous points, such as those seen in figures 3.9 and 3.11, are
believed to be caused by autoranging errors in the instrumentation used to collect
20
the data.
The divider circuit, as previously discussed, does not perform the division after
I
in2
> I
ref
, and instead outputs I
in1
as is seen in the figure.
The normalization circuit shows very good performance. This is because the
reference current and the maximum input currents were chosen to keep the divider
and all circuits within the proper subthreshold operating regions of the building block
circuits. This is helped by the compressive nature of the function.
The reference current for the normalization circuit, I
ref
, at the input of the refer-
ence current mirror array was set to 10nA. However, the ideal fit required a value of
14nA to be used. This was not seen in the other circuits, thus it is assumed that this
is due to the Early effect of the current mirrors. In fact, the SPICE simulations of
the circuit also predict the value of 14nA. Therefore, it is possible to use the SPICE
simulations to change the mirror transistor ratios to obtain the desired output. Since
this is merely a multiplicative effect, it is possible to simply scale the W/L of the final
output mirror stage to correct it. Alternatively, it is possible to increase the length
of the mirror transistors to reduce Early effect or to use a more complicated mirror
structure such as a cascoded mirror.
3.5 Discussion
3.5.1 Early Effects
Although setting the floating gates near the circuit ground has been the goal of the
dynamic technique discussed above, it is possible to allow the floating gate values
to be reset to something other than the circuit ground. The main effect this has
is to change the dynamic range. If the floating gates are reset to a value higher
than ground, the transistors will enter the above threshold region with smaller input
currents. In fact, in the circuits previously discussed, the switches will have a small
voltage drop across them and the floating gates are actually set slightly above ground.
It may also be possible to use a reference voltage smaller than ground in order to set
21
10
−10
10
−9
10
−8
10
−7
10
−11
10
−10
10
−9
10
−8
10
−7
10
−6
10
−5
Iin (Amps)
I
o
u
t

(
A
m
p
s
)
Iout = Iin
2
/ Iref)
Iref=100pA Iref=1nA Iref=10nA Iref=100nA
Figure 3.9: Squaring circuit results
22
10
−10
10
−9
10
−8
10
−7
10
−10
10
−9
10
−8
10
−7
10
−6
Iin (Amps)
I
o
u
t

(
A
m
p
s
)
Iout = sqrt(Iin * Iref)
Iref=100pA
Iref=1nA
Iref=10nA
Iref=100nA
Figure 3.10: Square root circuit results
23
10
−10
10
−9
10
−8
10
−7
10
−11
10
−10
10
−9
10
−8
10
−7
10
−6
I2 (Amps)
I
o
u
t

(
A
m
p
s
)
Iout = Iref * I1 / I2, Iref=10nA
I1=100pA
I1=1nA
I1=10nA
Figure 3.11: Multiplier/divider circuit results
24
10
−10
10
−9
10
−8
10
−10
10
−9
10
−8
10
−7
I1 (Amps)
I
o
u
t

(
A
m
p
s
)
Iout = Iref * I1 / sqrt(I1
2
+ I2
2
)
I2=10nA
I2=1nA
I2=100pA
10
−10
10
−9
10
−8
10
−10
10
−9
10
−8
10
−7
I2 (Amps)
I
o
u
t

(
A
m
p
s
)
Iout = Iref * I1 / sqrt(I1
2
+ I2
2
)
I1=100pA
I1=1nA
I1=10nA
Figure 3.12: RMS normalization circuit results
25
the floating gates in such a way as to increase the dynamic range.
One problem with these circuits is the presence of a large Early effect. This effect
comes from two sources. First, the standard Early effect comes from an increase in
channel current due to channel length modulation[30]. This can be modeled by an
extra linear term that is dependent on the drain-source voltage, V
ds
.
I
ds
= I
o
e

κVgs
U
t

1 +
V
ds
V
o

The voltage V
o
, called the Early voltage, depends on process parameters and is
directly proportional to the channel length, L. Another aspect of the Early effect
which is important for circuits with floating gates is due to the overlap capacitance of
the gate to the drain/source region of the transistor[15]. In these circuits, the main
overlap capacitance of interest is the floating gate to drain capacitance, C
fg−d
. This
also depends on process parameters and is directly proportional to the width, W.
This parasitic capacitance acts as another input capacitor from the drain voltage, V
d
.
In this way, the drain voltage couples onto the floating gate by an amount V
d
C
fg−d
C
T
,
where C
T
is the total input capacitance. Since this is coupled directly onto gate, this
has an exponential effect on the current level.
The method chosen to overcome these combined Early effects is to first make the
transistors long. If L is sufficiently long, then the Early voltage, V
o
, will be large and
this will reduce the contribution of the linear Early effect. This also requires increasing
the width, W, by the same amount as the length, L, to keep the same subthreshold
current levels and not lose any dynamic range. However, this increase in W increases
C
fg−d
which worsens the exponential Early effect. Thus, it is necessary to further
increase the size of the input capacitors to make the total input capacitance large
compared to the floating gate to drain parasitic capacitance to reduce the impact
of the exponential Early effect. In this way, it is possible to trade circuit size for
accuracy.
It may be possible to use some of the switches as cascode transistors to reduce the
Early effect in the output transistors of the various stages. The input and reference
26
transistors already have their drain voltages fixed due to the diode connection and
would not benefit from this, but the output transistor drains are allowed to go near
V
dd
. This would involve not setting φ
2
all the way to V
dd
during the computation and
instead setting it to some lower cascode voltage. This may allow a reduction in the
size of the transistors and capacitors. Furthermore, it is important to make the input
capacitors large enough that other parasitic capacitances are very small compared to
them.
3.5.2 Clock Effects
Unlike switched capacitor circuits that require a very high clock rate compared to the
input frequencies, the clock rate for these circuits is determined solely by the leakage
rate of the switch transistors. Thus, it is possible to make the clock as slow as 1 Hz
or slower. The input can change faster than the clock rate and the output will be
valid during most of the computation phase.
The output does require a short settling period due to the presence of glitches in
the output current from charge injection by the switches. The amount of charge injec-
tion is determined by the area of the switches and is minimized with small switches.
Also making the input capacitors large with respect to the size of the switches will re-
duce the effects of the glitches by making the injected charge less effective at inducing
a voltage on the gate.
It may be possible to use a current mode filter or use two transistors in comple-
mentary phase at the output to compensate for the glitches[18].
Other techniques are also available which can improve matching characteristics
and reduce the size of the circuits. One such technique would involve using a single
transistor with a multiphase clock that can be used as a replacement for all the input
and output transistors in the building blocks[18].
27
3.5.3 Speed, Accuracy Tradeoffs
It is necessary to increase the size of the input capacitors and reduce the size of
the switches in order to improve the accuracy and decrease the effect of glitches.
However, this also has an effect on the speeds at which the circuits may be operated
and the settling times of the signals. Subthreshold circuits are already limited to
low currents which in turn limits their speed of operation since it takes longer to
charge up a capacitor or gate capacitance with small currents. Thus, the increases in
input capacitance necessary to increase accuracy further limits the speed of operation.
Furthermore, the addition of a switch along the current path reduces the speed of the
circuits because of the effective on-resistance of the switch.
The switches feeding into the capacitor inputs can be viewed as a lowpass R-C
filter at the inputs of the circuits. It is possible to make the switches larger in order to
decrease the effective resistance, but this will increase the parasitic capacitance and
require an increase in the size of the input capacitors to obtain the same accuracy.
Thus, there is an inherent tradeoff between the speed of operation and the accuracy
of the circuits.
3.6 Floating Gate Versus Dynamic Comparison
As previously discussed, it is possible to use these circuits in either the purely floating
gate version or in the dynamic version. In this section, a brief comparison is made
between the advantages and disadvantages of the two design styles.
Floating gate advantages
• Larger dynamic range - The switches of the dynamic version introduce leakage
currents and offsets which reduce dynamic range.
• No glitches - The floating gate version does not suffer from the appearance of
glitches on the output and, thus, does not have the same settling time require-
ments.
28
• Constant operation - The output is always valid.
• Better matching with smaller input capacitors - There are fewer parasitics be-
cause of the lack of switches, thus, it is possible to reduce the size of input
capacitors and still have good matching.
• Less area - There are fewer transistors because of the lack of switches and also
smaller input capacitors may be used.
• Less power - It is possible to go to smaller current values and, also, since there
are fewer stacked transistors, smaller operating voltages are possible. Also,
power is saved by not operating any switches.
• Faster - Since it is possible to use smaller input capacitors and there is no added
switch resistance along the current path, it is possible to operate the circuits at
higher speeds.
Dynamic advantages
• No UV radiation required - In many cases it is desirable not to introduce an
extra post processing step in order to operate the circuits. Furthermore, in
some instances, ultraviolet irradiation fails to place these circuits into a proper
subthreshold operating range and instead places the floating gate in a state
where the transistors are biased above threshold.
• Decreased packaging costs - Packaging costs are increased when it is necessary
to accommodate a UV transparent window[19].
• Robust against long term drift - Transistor threshold drift is a known problem
in CMOS circuits[20] and Flash EEPROMs[19]. Much of the threshold shift is
attributed to hot electron injection. These electrons get stuck in trap states in
the oxide, thereby causing the shift. Much effort is placed into reducing the
number of oxide traps to minimize this effect. The electrons can then exit the
oxide and normally contribute to a very small gate current. However, in floating
29
gate circuits, the electrons are not removed and remain as excess charge which
leads to increased threshold drift. Furthermore, as process scales continue to
decrease, direct gate tunneling effects will further exacerbate the problem. The
dynamically restored circuits remove these excess charges during the reset phase
of operation and provide greater immunity against long term drift.
Although the advantages that the pure floating gate circuits provide outnumber those
of the dynamic circuits, the advantages afforded by the dynamic technique are of
such significant importance that many circuit designers would choose to utilize the
technique. In practice, a full consideration of each style’s constraints with respect
to the overall system design criteria would favor one or the other of the two design
styles.
3.7 Conclusion
A set of circuits for analog circuit design that may be useful for analog computation
circuits and neural network circuits has been presented. It is hoped that it is clear
from the derivations how to obtain other power law circuits that may be necessary
and how to combine them to perform useful complex calculations. The dynamic
charge restoration technique is shown to be a useful implementation of this class
of analog circuits. Furthermore, the dynamic charge restoration technique may be
applied to other floating gate computational circuits that may otherwise require initial
ultraviolet illumination or other methods to set the initial conditions.
30
Chapter 4 VLSI Neural Network
Algorithms, Limitations, and
Implementations
4.1 VLSI Neural Network Algorithms
4.1.1 Backpropagation
Early work in neural networks was hindered by the problem that single layer networks
could only represent linearly separable functions[41]. Although it was known that mul-
tilayer networks could learn any boolean function and, hence, did not suffer from this
limitation, the lack of a learning rule to program multilayered networks reduced inter-
est in neural networks for some time[40]. The backpropagation algorithm[42], based
on gradient descent, provided a viable means for programming multilayer networks
and created a resurgence of interest in neural networks.
A typical sum-squared error function used for neural networks is defined as follows:
E (
− →
w) =
1
2
¸
i
(T
i
−O
i
)
2
where T
i
is the target output for input pattern i from the training set, and O
i
is
the respective network output. The network output is a composite function of the
network weights and composed of the underlying neuron sigmoidal function, s(.).
In backpropagation, weight updates, ∆w
ij

∂E
∂w
ij
, are calculated for each of
the weights. The functional form of the update involves the weights themselves
propagating the errors backwards to more internal nodes, which leads to the name
backpropagation. Furthermore, the functional form of the weight update rules are
composed of derivatives of the neuron sigmoidal function. Thus, for example, if
31
s(x) = tanh(ax), then the weight update rule would involve functions of the form
s

(x) = 2a (tanh(ax)) (1 −tanh(ax)). For software implementations of the neural
network, the ease of computing this type of function allows an effective means to
implement the backpropagation algorithm.
However, if the hardware implementation of the sigmoidal function tends to de-
viate from an ideal function such as the tanh function, it becomes necessary to com-
pute the derivative of the function actually being implemented which may not have
a simple functional form. Otherwise, unwanted errors would get backpropagated and
accumulate. Furthermore, deviations in the multiplier circuits can add offsets which
would also accumulate and significantly reduce the ability of the network to learn.
For example, in one implementation, it was discovered that offsets of 1% degraded
the learning success rate of the XOR problem by 50%[43]. In typical analog VLSI
implementations, offsets much larger than 1% are quite common, which would lead
to seriously degraded operation.
4.1.2 Neuron Perturbation
Because of the problems associated with analog VLSI backpropagation implementa-
tions, efforts have been made to find suitable learning algorithms which are indepen-
dent of the model of the neuron[44]. For example, in the above example the neuron
was modeled as a tanh function. In a model-free learning algorithm, the functional
form of the neuron, as well as the functional form of its derivative, does not enter
into the weight update rule. Furthermore, analog implementations of the derivative
can prove difficult and implementations which avoid these difficulties are desirable.
Most of the algorithms developed for analog VLSI implementation utilize stochastic
techniques to approximate the gradients rather than to directly compute them.
One such implementation[45], called the Madaline Rule III (MR III), approximates
a gradient by applying a perturbation to the input of each neuron and then measuring
the difference in error, ∆E = E(with perturbation)-E(without perturbation). This
error is then used for the weight update rule.
32
∆w
ij
= −η
∆E
∆net
i
x
j
where net
i
is the weighted, summed input to the neuron i that is perturbed, and
x
j
is either a network input or a neuron output from a previous layer which feeds into
neuron i.
This weight update rule requires access to every neuron input and every neuron
output. Furthermore, multiplication/division circuitry would be required to imple-
ment the learning rule. However, no explicit derivative hardware is required, and no
assumptions are made about the neuron model.
Another variant of this method, called the virtual targets method and which is not
model-free, required the use of the neuron function derivatives in the weight update
rule[47]. The method showed good robustness against noise, but studies of the effects
of inaccurate derivative approximations were not done.
4.1.3 Serial Weight Perturbation
Another implementation which is more optimal for analog implementation involves
serially perturbing the weights rather than perturbing the neurons[23]. The weight
updates are based on a finite difference approximation to the gradient of the mean
squared error with respect to a weight. The weight update rule is given as follows:
∆w
ij
=
E

w
ij
+ pert
ij

−E (w
ij
)
pert
ij
This is the forward difference approximation to the gradient. For small enough
perturbations, pert
ij
, a reasonable approximation is achieved. Smaller perturbations
provide better approximations to the gradient at the expense of very small steps in
the gradient direction which then requires many extra iterations to find the minimum.
Thus, a natural tradeoff between speed and accuracy exists.
The gradient approximation can also be improved with the central difference ap-
33
proximation:
∆w
ij
=
E

w
ij
+
pert
ij
2

−E

w
ij

pert
ij
2

pert
ij
However, this requires more feedforward passes of the network to obtain the error
terms. A usual implementation would use a weight update rule of the following form:
∆w
ij
= −
η
pert

E

w
ij
+ pert
ij

−E (w
ij
)

Since pert and η are both small constants, the above weight update rule merely
involves changing the weights by the error difference scaled by a small constant.
It is clear that this implementation should require simpler circuitry than the neu-
ron perturbation algorithm. The multiplication/division step is replaced by scaling
with a constant. Also, there is no need to provide access to the computations of
internal nodes. Studies have also shown that this algorithm performs better than
neuron perturbation in networks with limited precision weights[46].
4.1.4 Summed Weight Neuron Perturbation
The summed weight neuron perturbation approach[24] combines elements from both
the neuron perturbation technique and the serial weight perturbation technique. Since
the serial weight perturbation technique applies perturbations to the weights one at
a time, its computational complexity was shown to be of higher order than that of
the neuron perturbation technique.
Applying a weight perturbation to each of the weights connected to a particular
neuron is equivalent to perturbing the input of that neuron as is done in neuron
perturbation. Thus, in summed weight neuron perturbation, weight perturbations
are applied for each particular neuron, the error is then calculated and weight updates
are made using the same weight update rule as for serial weight perturbation.
This allows for a reduction in the number of feedforward passes, yet still does not
require access to the outputs of internal nodes of the network. However, this speed
34
up is at the cost of increased hardware to store weight perturbations. Normally, the
weight perturbations are kept at a fixed small size, but the sign of the perturbation
is random. Thus, the weight perturbation storage can be achieved with 1 bit per
weight.
4.1.5 Parallel Weight Perturbation
The parallel weight perturbation techniques[21][25][48] extend the above methods by
simultaneously applying perturbations to all weights. The perturbed error is then
measured and compared to the unperturbed error. The same weight update rule is
then used as in the serial weight perturbation method. Thus, all of the weights are
updated in parallel.
If there are W total weights in a network, a serial weight perturbation approach
gives a factor W reduction in speed over a straight gradient descent approach because
the overall gradient is calculated based on W local approximations to the gradient.
With parallel weight perturbation, on average, there is a factor W
1
2
reduction in
speed over direct gradient descent[25]. This can be better understood if the parallel
perturbative method is thought of in analogy to Brownian motion where there is
considerable movement of the network through weight space, but with a general
guiding force in the proper direction. Thus, a speedup of W
1
2
is achieved over serial
weight perturbation. This increase in speed may not always be seen in practice
because it depends upon certain conditions being met such as the perturbation size
being sufficiently small.
It is also possible to improve the convergence properties of the network by av-
eraging over multiple perturbations for each input pattern presentation[21]. This is
feasible only if the speed of averaging over multiple perturbations is greater than the
speed of applying input patterns. This usually holds true because the input patterns
must be obtained externally; whereas, the multiple perturbations are applied locally.
Furthermore, convergence rates can be improved by using an annealing schedule
where large perturbation sizes are used initially and then reduced. In this way, large
35
weight changes are discovered first and gradually decreased to get finer and finer
resolution in approaching the minimum.
4.1.6 Chain Rule Perturbation
The chain rule perturbation method is a semi-parallel method that mixes ideas from
both serial weight perturbation and neuron perturbation[26]. In a 1 hidden layer
network, the output layer weights are first updated serially using the weight update
formula from the serial weight perturbation method. Next, a hidden layer neuron
output, n
i
, is perturbed by amount npert
i
. The effect on the error is measured and a
partial gradient term,
∆E
npert
i
, is stored. This neuron output perturbation is repeated
sequentially and a partial error gradient term stored for every neuron in the hidden
layer. Then, each of the weights, w
ij
, feeding into the hidden layer neurons are per-
turbed in parallel by an amount wpert
ij
. The resultant change in hidden layer neuron
outputs, ∆n
i
, from this parallel weight perturbation is measured and then used to
calculate and store another partial gradient term,
∆n
i
wpert
ij
, for each weight/neuron pair.
Finally, the hidden layer weights are updated in parallel according to the following
rule:
∆w
ij
= −η
∆E
wpert
= −η
∆E
npert
i
∆n
i
wpert
ij
Hence, the name of the method comes from the analog of the weight update rule to
the chain rule in derivative calculus. For more hidden layers, the same hidden layer
weight update procedure is repeated.
This method allows a speedup over serial weight perturbation by a factor ap-
proaching the number of hidden layer neurons at the expense of the extra hardware
necessary to store the partial gradient terms. For example, in a network with i inputs,
j hidden nodes, and k outputs, the chain rule perturbation method requires k·j +j +i
perturbations; whereas, serial perturbation requires k · j +j · i perturbations.
36
4.1.7 Feedforward Method
The feedforward method is a variant of serial weight perturbation that incorporates
a direct search technique[49]. First, for a weight, w
ij
, the direction of search is found
by applying a fraction of the perturbation. For example if 0.1pert is applied and
the error goes down, then the direction of search is positive; otherwise it is negative.
Next, in the forward phase, perturbations of size pert continue to be applied in the
direction of search until the error does not decrease or some predetermined maximum
number of forward perturbations is reached. Then, in the backward phase, a step of
size
pert
2
in the reverse direction is taken. Successive backwards steps are taken with
each step decreasing by another factor of 2 until the error again goes up or another
predetermined maximum number of backward perturbations is reached. These steps
are repeated for every weight.
This method is effectively a sophisticated annealing schedule for the serial weight
perturbation method.
Many other variations and combinations of these stochastic learning rules are pos-
sible. Most of the stochastic algorithms and implementations concentrate on iterative
approaches to weight updates as discussed above, but continuous-time implementa-
tions are also possible[50].
4.2 VLSI Neural Network Limitations
4.2.1 Number of Learnable Functions with Limited Precision
Weights
Clearly in a neural network with a small, fixed number of bits per weight, the ability
of the network to classify multiple functions is limited. Assume that each weight is
comprised of b bits of information and that there are W total weights. The total
number of bits in the network is given as B = bW. Since each bit can take on only 1
of 2 values, the total number of different functions that the network can implement
is no more than 2
B
. In computer simulations of neural networks where floating point
37
precision is used and b is on the order of 32 or 64, this is usually not a limitation to
learning. However, if b is a small number like 5, then the ability of the network to
learn a specific function may be compromised. Furthermore, 2
B
is merely an upper
bound. There are many different settings of the weights that will lead to the same
function. For example, the hidden neuron outputs can be permuted since the final
output of the network does not depend upon the order of the hidden layer neurons.
Also, the signs of the outputs of the hidden layer neurons can also be flipped by
adjusting the weights coming into that neuron and by also flipping the sign of the
weight on the output of the neuron to ensure the same output. In one result, the
total number of boolean functions which can be implemented is given approximately
by
2
B
N!2
N
for a fully connected network with N units in each layer[51].
4.2.2 Trainability with Limited Precision Weights
Several studies have been done on the effect of the number of bits per weight con-
straining the ability of a neural network to properly train[52][53][54]. Most of these
are specific to a particular learning algorithm. In one study, the effects of the number
of bits of precision on backpropagation were investigated[52]. The weight precision
for a feedforward pass of the network, the weight precision for the backpropagation
weight update calculation, and the output neuron quantization were all tested for a
range of bits. The effect of the number of bits on the ability of the network to prop-
erly converge was determined. The empirical results showed that a weight precision
of 6 bits was sufficient for feedforward operation. Also, neuron output quantization
of 6 bits was sufficient. This implies that standard analog VLSI neurons should be
sufficient for a VLSI neural network implementation. For the backpropagation weight
update calculation, at least 12 bits were required to obtain reasonable convergence.
This shows that a perturbative algorithm that relies only on feedforward operation
can be effective in training with as few as half the number of bits required for a back-
propagation implementation. Due to the increased hardware complexity necessary
to achieve the extra number of bits, the perturbative approaches will be desirable in
38
analog VLSI implementations.
4.2.3 Analog Weight Storage
Due to the inherent size requirements to store and update digital weights with A/D
and D/A converters, some analog VLSI implementations concentrate on reducing
the space necessary for weight storage by using analog weights. Usually, this involves
storing weights on a capacitor. Since the capacitors slowly discharge, it is necessary to
implement some sort of scheme to continuously refresh the weight. One such scheme,
which showed long term 8 bit precision, involves quantization and refresh of the
weights by use of A/D/A converters which can be shared by multiple weights[60][61].
Other schemes involve periodic refresh by use of quantized levels defined by input
ramp functions or by periodic repetition of input training patterns[59].
Since significant space and circuit complexity overhead exists for capacitive stor-
age of weights, attempts have also been made at direct implementation of nonvolatile
analog memory. In one such implementation, a dense array of synapses is achieved
with only two transistors required per synapse[64]. The weights are stored perma-
nently on a floating gate transistor which can be updated under UV illumination. A
second drive transistor is used as a switch to sample the input drive voltage onto a
drive electrode in order to adjust the floating gate voltage, and, hence, weight value.
The weight multiplication with an input voltage is achieved by biasing the float-
ing gate transistor into the triode region which produces an output current which
is roughly proportionate to the product of its floating gate voltage, representing the
weight, and its drain to source voltage, representing the input. The use of UV illu-
mination also allowed a simple weight decay term to be introduced into the learning
rule if necessary. Another approach which also implements a dense array of synapses
based on floating gates utilized tunneling and injection of electrons onto the floating
gate rather than UV illumination[28][29]. Using this technique, only one floating gate
transistor per synapse was required. A learning rule was developed which balances
the effects of the tunneling and injection. In another approach, a 3 transistor analog
39
memory cell was shown to have up to 14 bits of resolution[65]. Some of the weight
update speeds obtainable using the tunneling and injection techniques may be limited
because rapid writing of the floating gates requires large oxide currents which may
damage the oxide. Nevertheless, these techniques can be combined with dynamic
refresh techniques where weights and weight updates are quickly stored and learned
on capacitors, but eventually written into nonvolatile analog memories[66].
4.3 VLSI Neural Network Implementations
Many neural network implementations in VLSI can be found in the literature. This in-
cludes purely digital approaches, purely analog approaches and hybrid analog-digital
approaches. Although the digital implementations are more straightforward to build
using standard digital design techniques, they often scale poorly with large networks
that require many bits of precision to implement the backpropagation algorithms[57][58].
However, some size reductions can be realized by reducing the bit precision and us-
ing the stochastic learning techniques available[55][56]. Other techniques use time
multiplexed multipliers to reduce the number of multipliers necessary.
An early attempt at implementation of an analog neural network involved making
separate VLSI chip modules for the neurons, synapses and routing switches[62]. The
modules could all be interconnected and controlled by a computer. The weights
were digital and were controlled by a serial weight bus from the computer. Since
the components were modular, it could be scaled to accommodate larger networks.
However, each of the individual circuit blocks was quite large and hence was not well
suited for dense integration on a single neural network chip. For example, the circuits
for the neuron blocks consisted of three multistage operational amplifiers and three
100k resistors. The learning algorithms were based on output information from the
neurons and no circuitry for local weight updates was provided.
40
4.3.1 Contrastive Neural Network Implementations
Some of the earliest VLSI implementations of neural networks were not based on
the stochastic weight update algorithms, but on contrastive weight updates[67]. The
basic idea behind a contrastive technique is to run the network in two phases. In the
first phase, a teacher signal is sent, and the outputs of the network are clamped. In
the second phase, the network outputs are unclamped and free to change. The weight
updates are based on a contrast function that measures the difference between the
clamped, teacher phase, and the unclamped, free phase.
One of the first fully integrated VLSI neural networks used one of these contrastive
techniques and showed rapid training on the XOR function[33]. The chip incorporated
a local learning rule, feedback connections and stochastic elements. The learning rule
was based on a Boltzmann machine algorithm, which uses bidirectional, asymmetric
weights[63]. Briefly, the algorithm works by checking correlations between the input
and output neuron of a synapse. Correlations are checked between a teacher phase,
where the output neurons are clamped to the correct output, and a student phase,
where the outputs are unclamped. If the teacher and student phases share the same
correlation, then no change is made to the weight. If the neurons are correlated in
the teacher phase, but not the student phase, then the weight is incremented. If
it is reversed, then the weight is decremented. The weights were stored as 5 bit
digital words. On each pattern presentation, the network weights are perturbed
with noise and the correlations counted. Although the system actually needs noise
in order to learn, the weight update rule is not based on the previously mentioned
stochastic weight update rules, but is an example of contrastive learning that requires
randomness in order to learn. Analog noise sources were used to produce the random
perturbations. Unfortunately, the gains required to amplify thermal noise caused the
system to be unstable, and highly correlated oscillations were present between all of
the noise sources. Thus, it was necessary to intermittently apply the noise sources in
order to facilitate learning. The network showed good success with a 2:2:2:1 network
learning XOR. The learning rate was shown to be over 100,000 times faster than that
41
achieved with simulations on a digital computer.
Another implementation showed a contrastive version of backpropagation[43]. The
basic idea was to also use both a clamped phase and a free phase. The weight updates
are calculated in both phases using the standard backpropagation weight update rules.
The actual weight update was taken to be the difference between the clamped weight
change and the free weight change. The use of the contrastive technique allowed a
form of backpropagation despite circuit offsets. However, this implementation was
not model free, and a derivative generator was also necessary in order to compute
the weight updates. The weights were stored as analog voltages on capacitors with
the eventual goal of using nonvolatile analog weights. Due to the size of the circuitry,
only one layer of the network was integrated on a single chip. The multilayer network
was obtained by cascading several chips together. Successful learning of the XOR
function was shown.
4.3.2 Perturbative Neural Network Implementations
A low power chip implementing the summed weight neuron perturbation algorithm
has been demonstrated[22]. The chip was designed for implantable applications which
necessitated the need for extremely low power and small area. The weights were stored
digitally and a serial weight bus was used to set the weights and weight updates. The
chip was trained in loop with a computer applying the input patterns, controlling
the serial weight bus and reading the output values for the error computation. To
reduce power, subthreshold synapses and neurons were used. The synapses consisted
of transconductance amplifiers with the weights encoded as a digital word which
set the bias current. Neurons consisted of current to voltage converters using a diode
connected transistor. They also included a few extra transistors to deal with common
mode cancelation. A 10:6:3 network was trained to distinguish two cardiac arrythmia
patterns. The chips showed a success rate with over 95% of patient data. With a 3V
supply, only 200 nW was consumed. This is a good example of a situation where an
analog VLSI neural network is very suitable as compared to a digital solution.
42
Another perturbative implementation was used for training a recurrent neural
network on a continuous time trajectory[39]. A parallel perturbative method was
used with analog volatile weights and local weight refresh circuitry. Two counter-
propagating linear feedback shift registers, which were combined with a network of
XORs, were used to digitally generate the random perturbations. The weight updates
were computed locally. The global functions of error accumulation and comparison
were performed off chip. Some of the weight refresh circuitry was also implemented
off chip. The network was successfully trained to generate two quadrature phase
sinusoidal outputs.
The chain rule perturbation algorithm has also been successfully implemented on
chip[66]. The analog weights were stored dynamically and provisions were made on
the chip for nonvolatile analog storage using tunneling and injection, but its use was
not demonstrated. The weight updates were computed locally. Many of the global
functions such as error computation were also performed on the chip. Only a few
control signals from a computer were necessary for its operation. The size of the per-
turbations was externally settable. However, all perturbations seemed to be of the
same sign, thus not requiring any specialized pseudorandom number generation. The
network was shown to successfully train on XOR and other functions. However, the
error function during learning showed some very peculiar behavior. Rather than grad-
ually decreasing to a small value, the error actually increased steadily, then entered
large oscillations between the high error and near zero error. After the oscillations,
the error was close to zero and stayed there. The authors attribute this strange and
erratic behavior to the fact that a large learning rate and a large perturbation size
were used.
43
Chapter 5 VLSI Neural Network with
Analog Multipliers and a Serial Digital
Weight Bus
Training an analog neural network directly on a VLSI chip provides additional benefits
over using a computer for the initial training and then downloading the weights. The
analog hardware is prone to have offsets and device mismatches. By training with
the chip in the loop, the neural network will also learn these offsets and adjust the
weights appropriately to account for them. A VLSI neural network can be applied in
many situations requiring fast, low power operation such as handwriting recognition
for PDAs or pattern detection for implantable medical devices[22].
There are several issues that must be addressed to implement an analog VLSI
neural network chip. First, an appropriate algorithm suitable for VLSI implementa-
tion must be found. Traditional error backpropagation approaches for neural network
training require too many bits of floating point precision to implement efficiently in
an analog VLSI chip. Techniques that are more suitable involve stochastic weight
perturbation[21],[23],[24],[25],[26],[27], where a weight is perturbed in a random di-
rection, its effect on the error is determined and the perturbation is kept if the error
was reduced; otherwise, the old weight is restored. In this approach, the network
observes the gradient rather than actually computing it.
Serial weight perturbation[23] involves perturbing each weight sequentially. This
requires a number of iterations that is directly proportional to the number of weights.
A significant speed-up can be obtained if all weights are perturbed randomly in paral-
lel and then measuring the effect on the error and keeping them all if the error reduces.
Both the parallel and serial methods can potentially benefit from the use of anneal-
ing the perturbation. Initially, large perturbations are applied to move the weights
44
quickly towards a minimum. Then, the perturbation sizes are occasionally decreased
to achieve finer selection of the weights and a smaller error. In general, however,
optimized gradient descent techniques converge more rapidly than the perturbative
techniques.
Next, the issue of how to appropriately store the weights on-chip in a non-volatile
manner must be addressed. If the weights are simply stored as charge on a capacitor,
they will ultimately decay due to parasitic conductance paths. One method would
be to use an analog memory cell [28],[29]. This would allow directly storing the
analog voltage value. However, this technique requires using large voltages to obtain
tunneling and/or injection through the gate oxide and is still being investigated.
Another approach would be to use traditional digital storage with EEPROMs. This
would then require having A/D/A (one A/D and one D/A) converters for the weights.
A single A/D/A converter would only allow a serial weight perturbation scheme that
would be slow. A parallel scheme, which would perturb all weights at once, would
require one A/D/A per weight. This would be faster, but would require more area.
One alternative would remove the A/D requirement by replacing it with a digital
counter to adjust the weight values. This would then require one digital counter and
one D/A per weight.
5.1 Synapse
A small synapse with one D/A per weight can be achieved by first making a binary
weighted current source (Figure 5.1) and then feeding the binary weighted currents
into diode connected transistors to encode them as voltages. These voltages are then
fed to transistors on the synapse to convert them back to currents. Thus, many D/A
converters are achieved with only one binary weighted array of transistors. It is clear
that the linearity of the D/A will be poor because of matching errors between the
current source array and synapses which may be located on opposite sides of the chip.
This is not a concern because the network will be able to learn around these offsets.
The synapse[26],[22] is shown in figure 5.2. The synapse performs the weighting
45
Vdd
1 1
i0 i1 i2 i3 i4
2 4 8 16
Iin
Figure 5.1: Binary weighted current source circuit
b5 b5
b0 b1 b4 b3 b2
i4 i3 i2 i1 i0
Vin- Vin+
b5
Iout+ Iout-
Figure 5.2: Synapse circuit
46
of the inputs by multiplying the input voltages by a weight stored in a digital word
denoted by b0 through b5. The sign bit, b5, changes the direction of current to
achieve the appropriate sign.
In the subthreshold region of operation, the transistor equation is given by[30]
I
d
= I
d0
e
κVgs/Ut
and the output of the synapse is given by[30],[22]
∆I
out
= I
out+
−I
out−
=

+I
0
W tanh

κ(V
in+
−V
in−
)
2Ut

b5 = 1
−I
0
W tanh

κ(V
in+
−V
in−
)
2Ut

b5 = 0
where W is the weight of the synapse encoded by the digital word and I
0
is the
least significant bit (LSB) current.
Thus, in the subthreshold linear region, the output is approximately given by
∆I
out
≈ g
m
∆V
in
=
κI
0
2U
t
W∆V
in
In the above threshold regime, the transistor equation in saturation is approxi-
mately given by
I
D
≈ K(V
gs
−V
t
)
2
The synapse output is no longer described by a simple tanh function, but is
nevertheless still sigmoidal with a wider “linear” range.
In the above threshold linear region, the output is approximately given by
∆I
out
≈ g
m
∆V
in
= 2

KI
0

W∆V
in
47
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
x 10
−6
31
16
8
4
0
−4
−8
−16
−31
∆ V
in
(Volts)


I
o
u
t

(
A
m
p
s
)
Synapse Output Current vs. Voltage Input for Various Weights
Figure 5.3: Synapse differential output current as a function of differential input
voltage for various digital weight settings
It is clear that above threshold, the synapse is not doing a pure weighting of the
input voltage. However, since the weights are learned on chip, they will be adjusted
accordingly to the necessary value. Furthermore, it is possible that some synapses will
operate below threshold while others above, depending on the choice of LSB current.
Again, on-chip learning will be able to set the weights to account for these different
modes of operation.
Figure 5.3 shows the differential output current of the synapse as a function of
differential input voltage for various digital weight settings. The input current of the
binary weighted current source was set to 100 nA. The output currents range from
the subthreshold region for the smaller weights to the above threshold region for the
48
large weights. All of the curves show their sigmoidal characteristics. Furthermore, it is
clear the width of the linear region increases as the current moves from subthreshold
to above threshold. For the smaller weights, W = 4, the linear region spans only
approximately 0.2V - 0.4V. For the largest weights, W = 31, the linear range has
expanded to roughly 0.6V - 0.8V. As was discussed above, when the current range
moves above threshold, the synapse does not perform a pure linear weighting. The
largest synapse output current is not 3.1µA as would be expected from a linear
weighting of 31 × 100nA, but a smaller number. Notice that the zero crossing of
∆I
out
occurs slightly positive of ∆V
in
= 0. This is a circuit offset that is primarily
due to slight W/L differences of the differential input pair of the synapse, and it is
caused by minor fabrication variations.
5.2 Neuron
The synapse circuit outputs a differential current that will be summed in the neuron
circuit shown in figure 5.4. The neuron circuit performs the summation from all of the
input synapses. The neuron circuit then converts the currents back into a differential
voltage feeding into the next layer of synapses. Since the outputs of the synapse will
all have a common mode component, it is important for the neuron to have common
mode cancelation[22]. Since one side of the differential current inputs may have a
larger share of the common mode current, it is important to distribute this common
mode to keep both differential currents within a reasonable operating range.
I
in+
= I
in+
d
+
I
in+
d
+I
in−
d
2
= I
in+
d
+I
cm
d
I
in−
= I
in−
d
+
I
in+
d
+I
in−
d
2
= I
in−
d
+I
cm
d
∆I = I
in+
−I
in−
= I
in+
d
−I
in−
d
49
Vdd
Vdd Vdd
Voffset Voffset
Iin+ Iin- Vout- Vout+
Vcm Vcm
Vgain Vgain
2:1:1
Iin+
d
Iin-
d
Icm
d
Icm
d
V
1+
V
1-
Figure 5.4: Neuron circuit
50
I
cm
=
I
in+
+I
in−
2
=
I
in+
d
+I
in−
d
+ 2I
cm
d
2
= 2I
cm
d
⇒I
in+
d
= I
in+

I
cm
2
=
∆I
2
+
I
cm
2
⇒I
in−
d
= I
in−

I
cm
2
= −
∆I
2
+
I
cm
2
If the ∆I is of equal size or larger than I
cm
, the transistor with I
in−
d
may begin
to cutoff and the above equations would not exactly hold; however, the current cutoff
is graceful and should not normally affect performance. With the common mode
signal properly equalized, the differential currents are then mirrored into the current
to voltage transformation stage. This stage effectively takes the differential input
currents and uses a transistor in the triode region to provide a differential output.
This stage will usually be operating above threshold, because the V
offset
and V
cm
controls are used to ensure that the output voltages are approximately mid-rail. This
is done by simply adding additional current to the diode connected transistor stack.
Having the outputs mid-rail is important for proper biasing of the next stage of
synapses. The above threshold transistor equation in the triode region is given by
I
d
= 2K(V
gs
− V
t

V
ds
2
)V
ds
≈ 2K(V
gs−
V
t
)V
ds
for small enough V
ds
. If K
1
denotes
the prefactor with W/L of the cascode transistor and K
2
denotes the same for the
transistor with gate V
out
, the voltage output of the neuron will then be given by
I
in
= K
1
(V
gain
−V
t
−V
1
)
2
≈ K
2
(2(V
out
−V
t
)V
1
)
V
1
= V
gain
−V
t

I
in
K
1
51
I
in
= 2K
2
(V
out
−V
t
)

V
gain
−V
t

I
in
K
1

V
out
=
I
in
2K
2
(V
gain
−V
t
) −2

K
2
2
K
1
I
in
+V
t
if
W
1
L
1
=
W
2
L
2
then K
1
= K
2
= K,
⇒V
out
=
I
in
2K(V
gain
−V
t
) −2

KI
in
+V
t
for small I
in,
R ≈
1
2K(V
gain
−V
t
)
Thus, it is clear that V
gain
can be used to adjust the effective gain of the stage.
Figure 5.5 shows how the neuron differential output voltage, ∆V
out
, varies as a
function of differential input current for several values of V
gain
. The neuron shows
fairly linear performance with a sharp bend on either side of the linear region. This
sharp bend occurs when one of the two linearized, diode connected transistors with
gate attached to V
out
turns off.
Figure 5.6 displays only the positive output, V
out+
of the neuron. The diode
connected transistor, with gate attached to V
out+
turns off where the output goes flat
on the left side of the curves. This corresponds to the left bend point in figure 5.5.
Furthermore, the baseline output voltage corresponds roughly to V
offset
. However, as
V
offset
increases in value, its ability to increase the baseline voltage is reduced because
of the cascode transistor on its drain. At some point, especially for small values of
V
gain
, the V
cm
transistor becomes necessary to provide additional offset. Overall, the
neuron shows very good linear current to voltage conversion with separate gain and
52
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
x 10
−6
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
∆ I
in
(Amps)


V
o
u
t

(
V
o
l
t
s
)
Neuron Output Voltage vs. Current Input for Various V
gain
1.5 V
1.65 V
2 V
Figure 5.5: Neuron differential output voltage, ∆V
out
, as a function of ∆I
in
, for various
values of V
gain
53
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
x 10
−6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
∆ I
in
(Amps)
V
o
u
t
+

(
V
o
l
t
s
)
Neuron Positive Output Voltage vs. Current Input for Various V
offset
2 V
2.2 V
2.4 V
Figure 5.6: Neuron positive output, V
out+
, as a function of ∆I
in
, for various values of
V
offset
54
offset controls.
5.3 Serial Weight Bus
A serial weight bus is used to apply weights to the synapses. The weight bus merely
consists of a long shift register to cover all of the possible weight and threshold bits.
Figure 5.7 shows the circuit schematic of the shift register used to implement the
weight bus. The input to the serial weight bus comes from a host computer which
implements the learning algorithm. The last output bit of the shift register, Q5,
feeds into the data input, D, of the next shift register to form a long compound shift
register.
The shift register storage cell is accomplished by using two staticizer or jamb
latches[34] in series. The latch consists of two interlocked inverters. The weak in-
verter, labeled with a W in the figure, is made sufficiently weak such that a signal
coming from the main inverter and passing through the pass gate is sufficiently strong
to overpower the weak inverter and set the state. Once the pass gate is turned off,
the weak inverter keeps the current bit state. The shift register is controlled by a dual
phase nonoverlapping clock. During the read phase of operation, when φ
1
is high, the
data from the previous register is read into the first latch. Then, during the write
phase of operation, when φ
2
is high, the bit is written to the second latch for storage.
Figure 5.8 shows the nonoverlapping clock generator[35]. A single CLK input
comes from an off chip clock. The clock generator ensures that the two clock phases
do not overlap in order that no race conditions exist between reading and writing.
The delay inverters are made weak by making the transistors long. They set the
appropriate delay such that after one clock transitions from high to low, the next
one does not transition from low to high until after the small delay. Since the clock
lines go throughout the chip and drive long shift registers, it is necessary to make
the buffer transistors with large W/L ratios to drive the large capacitances on the
output.
55
W
j
2
j
2
Q
0
W
j
1
W
j
2
j
2
Q
1
W
D
j
1
j
1
W
j
2
j
2
Q
5
W
j
1
j
1
j
1
.

.

.
Figure 5.7: Serial weight bus shift register
56
C
L
K
j
1
j
1
j
2
j
2
D
e
l
a
y

i
n
v
e
r
t
e
r
s
B
u
f
f
e
r

I
n
v
e
r
t
e
r
s
Figure 5.8: Nonoverlapping clock generator
57
5.4 Feedforward Network
Using the synapse and neuron circuit building blocks it is possible to construct a
multilayer feed-forward neural network. Figure 5.9 shows a figure of a 2 input, 2
hidden unit, 1 output feedforward network for training of XOR.
Note that the nonlinear squashing function is actually performed in the next layer
of synapse circuits rather than in the neuron as in a traditional neural network.
However, this is equivalent as long as the inputs to the first layer are kept within the
linear range of the synapses. For digital functions, the inputs need not be constrained
as the synapses will pass roughly the same current regardless of whether the digital
inputs are at the flat part of the synapse curve near the linear region or all the way at
the end of the flat part of the curve. Furthermore, for non-differential digital signals,
it is possible to simply tie the negative input to mid-rail and apply the standard digital
signal to the positive synapse input. The biases, or thresholds, for each neuron are
simply implemented as synapses tied to fixed bias voltages. The biases are learned in
the same way as the weights.
Also, depending on the type of network outputs desired, additional circuitry may
be needed for the final squashing function. For example, if a roughly linear output
is desired, the differential output can be taken directly from the neuron outputs.
In figure 5.9, a differential to single ended converter is shown on the output. The
gain of this converter determines the size of the linear region for the final output.
Normally, during training, a somewhat linear output with low gain is desired to have
a reasonable slope to learn the function on. However, after training, it is possible to
take the output after a dual inverter digital buffer to get a strong standard digital
signal to send off-chip or to other sections of the chip.
Figure 5.10 shows a schematic of a differential to single ended converter and a dig-
ital buffer. The differential to single ended converter is based on the same transcon-
ductance amplifier design used for the synapse. The V
b
knob is used to set the bias
current of the amplifier. This bias current then controls the gain of the converter.
The gain is also controlled by sizing of the transistors.
58
b
5
b
4
b
3
b
2
b
1
b
0
V
in
+
V
in

i0 i1 i2 i3 i4 D
1
D D
0
V
in
1
+
V
in
1

C
L
K
V
b
ia
s
+
V
b
ia
s

b
5
b
4
b
3
b
2
b
1
b
0
V
in
+
V
in

i0 i1 i2 i3 i4 D
2
D D
1
C
L
K
V
in
2
+
V
in
2

b
5
b
4
b
3
b
2
b
1
b
0
V
in
+
V
in

i0 i1 i2 i3 i4 D
3
D D
2
Q
5
Q
4
Q
3
Q
2
Q
1
Q
0
D
C
L
K
S
h
if
t
R
e
g
.
Q
5
Q
4
Q
3
Q
2
Q
1
Q
0
D
C
L
K
S
h
if
t
R
e
g
.
Q
5
Q
4
Q
3
Q
2
Q
1
Q
0
D
C
L
K
S
h
if
t
R
e
g
.
b
5
b
4
b
3
b
2
b
1
b
0
i4 i3 i2 i1 i0
Io
u
t
+
Io
u
t

V
in
+
V
in

S
y
n
a
p
s
e
b
5
b
4
b
3
b
2
b
1
b
0
i4 i3 i2 i1 i0
Io
u
t
+
Io
u
t

V
in
+
V
in

S
y
n
a
p
s
e
b
5
b
4
b
3
b
2
b
1
b
0
i4 i3 i2 i1 i0
Io
u
t
+
Io
u
t

V
in
+
V
in

S
y
n
a
p
s
e
Q
5
Q
4
Q
3
Q
2
Q
1
Q
0
D
C
L
K
S
h
if
t
R
e
g
.
Q
5
Q
4
Q
3
Q
2
Q
1
Q
0
D
C
L
K
S
h
if
t
R
e
g
.
Q
5
Q
4
Q
3
Q
2
Q
1
Q
0
D
C
L
K
S
h
if
t
R
e
g
.
b
5
b
4
b
3
b
2
b
1
b
0
i4 i3 i2 i1 i0
Io
u
t
+
Io
u
t

V
in
+
V
in

S
y
n
a
p
s
e
b
5
b
4
b
3
b
2
b
1
b
0
i4 i3 i2 i1 i0
Io
u
t
+
Io
u
t

V
in
+
V
in

S
y
n
a
p
s
e
b
5
b
4
b
3
b
2
b
1
b
0
i4 i3 i2 i1 i0
Io
u
t
+
Io
u
t

V
in
+
V
in

S
y
n
a
p
s
e
b
5
b
4
b
3
b
2
b
1
b
0
V
in
+
V
in

i0 i1 i2 i3 i4 D
4
D D
3
V
in
1
+
V
in
1

C
L
K
V
b
ia
s
+
V
b
ia
s

b
5
b
4
b
3
b
2
b
1
b
0
V
in
+
V
in

i0 i1 i2 i3 i4 D
5
D D
4
C
L
K
V
in
2
+
V
in
2

b
5
b
4
b
3
b
2
b
1
b
0
V
in
+
V
in

i0 i1 i2 i3 i4 D
6
D D
5
Q
5
Q
4
Q
3
Q
2
Q
1
Q
0
D
C
L
K
S
h
if
t
R
e
g
.
Q
5
Q
4
Q
3
Q
2
Q
1
Q
0
D
C
L
K
S
h
if
t
R
e
g
.
Q
5
Q
4
Q
3
Q
2
Q
1
Q
0
D
C
L
K
S
h
if
t
R
e
g
.
b
5
b
4
b
3
b
2
b
1
b
0
i4 i3 i2 i1 i0
Io
u
t
+
Io
u
t

V
in
+
V
in

S
y
n
a
p
s
e
b
5
b
4
b
3
b
2
b
1
b
0
i4 i3 i2 i1 i0
Io
u
t
+
Io
u
t

V
in
+
V
in

S
y
n
a
p
s
e
b
5
b
4
b
3
b
2
b
1
b
0
i4 i3 i2 i1 i0
Io
u
t
+
Io
u
t

V
in
+
V
in

S
y
n
a
p
s
e
b
5
b
4
b
3
b
2
b
1
b
0
V
in
+
V
in

i0 i1 i2 i3 i4 D
7
D D
6
C
L
K
V
b
ia
s
+
V
b
ia
s

b
5
b
4
b
3
b
2
b
1
b
0
V
in
+
V
in

i0 i1 i2 i3 i4 D
8
D D
7
C
L
K
b
5
b
4
b
3
b
2
b
1
b
0
V
in
+
V
in

i0 i1 i2 i3 i4
D D
8
V
c
m
V
g
a
in
V
o
f
f
s
e
t
V
o
f
f
s
e
t
V
g
a
in
V
c
m
V
o
f
f
s
e
t
V
g
a
in
V
c
m
V
c
m
V
g
a
in
V
o
f
f
s
e
t
V
o
u
t
+
V
o
u
t

Iin
+
Iin

N
e
u
r
o
n
V
c
m
V
g
a
in
V
o
f
f
s
e
t
V
o
u
t
+
V
o
u
t

Iin
+
Iin

N
e
u
r
o
n
V
c
m
V
g
a
in
V
o
f
f
s
e
t
V
o
u
t
+
V
o
u
t

Iin
+
Iin

N
e
u
r
o
n
T
r
a
n
s
.
A
m
p
V
+
V
− D
if
f
e
r
e
n
t
ia
l t
o
S
in
g
le
e
n
d
e
d
C
o
n
v
e
r
t
e
r
D
ig
it
a
l B
u
f
f
e
r
V
o
u
t
D
ig
O
u
t
Figure 5.9: 2 input, 2 hidden unit, 1 output neural network
59
Vb
Digital
Vin− Vin+
Out
Differential to Single Ended
Digital Buffer
Out
Converter

Figure 5.10: Differential to single ended converter and digital buffer
60
Initialize Weights;
Get Error;
while(Error > Error Goal);
Perturb Weights;
Get New Error;
if (New Error < Error),
Weights = New Weights;
Error = New Error;
else
Restore Old Weights;
end
end
Figure 5.11: Parallel Perturbative algorithm
5.5 Training Algorithm
The neural network is trained by using a parallel perturbative weight update rule[21].
The perturbative technique requires generating random weight increments to adjust
the weights during each iteration. These random perturbations are then applied to
all of the weights in parallel. In batch mode, all input training patterns are applied
and the error is accumulated. This error is then checked to see if it was higher
or lower than the unperturbed iteration. If the error is lower, the perturbations are
kept, otherwise they are discarded. This process repeats until a sufficiently low error is
achieved. An outline of the algorithm is given in figure 5.11. Since the weight updates
are calculated offline, other suitable algorithms may also be used. For example, it
is possible to apply an annealing schedule wherein large perturbations are initially
applied and gradually reduced as the network settles.
5.6 Test Results
A chip implementing the above circuits was fabricated in a 1.2µm CMOS process. All
synapse and neuron transistors were 3.6µm/3.6µm to keep the layout small. The unit
size current source transistors were also 3.6µm/3.6µm. An LSB current of 100nA was
61
chosen for the current source. The above neural network circuits were trained with
some simple digital functions such as 2 input AND and 2 input XOR. The results of
some training runs are shown in figures 5.12-5.13. As can be seen from the figures, the
network weights slowly converge to a correct solution. Since the training was done on
digital functions, a differential to single ended converter was placed on the output of
the final neuron. This was simply a 5 transistor transconductance amplifier. The error
voltages were calculated as a total sum voltage error over all input patterns at the
output of the transconductance amplifier. Since V
dd
was 5V, the output would only
easily move to within about 0.5V from V
dd
because the transconductance amplifier
had low gain. Thus, when the error gets to around 2V, it means that all of the outputs
are within about 0.5V from their respective rail and functionally correct. A double
inverter buffer can be placed at the final output to obtain good digital signals. At the
beginning of each of the training runs, the error voltage starts around or over 10V
indicating that at least 2 of the input patterns give an incorrect output.
Figure 5.12 shows the results from a 2 input, 1 output network learning an AND
function. This network has only 2 synapses and 1 bias for a total of 3 weights. The
weight values can go from -31 to +31 because of the 6 bit D/A converters used on
the synapses.
Figure 5.13 shows the results of training a 2 input, 2 hidden unit, 1 output network
with the XOR function. The weights are initialized as small random numbers. The
weights slowly diverge and the error monotonically decreases until the function is
learned. As with gradient techniques, occasional training runs resulted in the network
getting stuck in a local minimum and the error would not go all the way down.
An ideal neural network with weights appropriately chosen to implement the XOR
function is shown in figure 5.14. The inputs for the network and neuron outputs are
chosen to be (-1,+1), as opposed to (0,1). This choice is made because the actual
synapses which implement the squashing function give maximum outputs of +/- I
out
.
The neurons are assumed to be hard thresholds. The function computed by each of
the neurons is given by
62
0 10 20 30 40 50 60 70 80
−20
−10
0
10
20
30
W
e
i
g
h
t
Weight and Error vs. Iteration for training AND network
0 10 20 30 40 50 60 70 80
0
2
4
6
8
10
12
14
iteration
E
r
r
o
r

V
o
l
t
a
g
e
Figure 5.12: Training of a 2:1 network with AND function
63
0 50 100 150 200 250 300 350 400 450 500
−40
−20
0
20
40
W
e
i
g
h
t
Weight and Error vs. Iteration for training XOR network
0 50 100 150 200 250 300 350 400 450 500
2
4
6
8
10
12
iteration
E
r
r
o
r

V
o
l
t
a
g
e
Figure 5.13: Training of a 2:2:1 network with XOR function starting with small
random weights
64
t=20 t=−20
t=−20
10 −20
20 20
20 20
A B
C
Output
X X
1 2
Figure 5.14: Network with “ideal” weights chosen for implementing XOR
65
X
1
X
2
A B C
-1 -1 −60 ⇒−1 −60 ⇒−1 −10 ⇒−1
-1 +1 +20 ⇒+1 −20 ⇒−1 +10 ⇒+1
+1 -1 +20 ⇒+1 −20 ⇒−1 +10 ⇒+1
+1 +1 +60 ⇒+1 +20 ⇒+1 −30 ⇒−1
Table 5.1: Computations for network with “ideal” weights chosen for implementing
XOR
Out =

+1, if ((
¸
i
W
i
X
i
) +t) ≥ 0
−1, if ((
¸
i
W
i
X
i
) +t) < 0
Table 5.1 shows the inputs, intermediate computations, and outputs of the ideal-
ized network. It is clear that the XOR function is properly implemented. The weights
were chosen to be within the range of possible weight values, -31 to +31, of the actual
network. This ideal network would perform perfectly well with smaller weights. For
example, all of the input weights could be set to 1 as opposed to 20, and the A and
B thresholds would then be set to 1 and -1 respectively, without any change in the
network function. However, weights of large absolute value within the possible range
were chosen (20 as opposed to 1), because small weights would be within the linear
region of the synapses. Also, small weights such as +/- 1 might tend to get flipped
due to circuit offsets. A SPICE simulation was done on the actual circuit using these
ideal weights and the outputs were seen to be the correct XOR outputs.
Figure 5.15 shows the 2:2:1 network trained with XOR, but with the initial weights
chosen as the mathematically correct weights for the ideal synapses and neurons.
Although the ideal weights should, both in theory and based on simulations, start off
with correct outputs, the offsets and mismatches of the circuit cause the outputs to be
incorrect. However, since the weights start near where they should be, the error goes
down rapidly to the correct solution. This is an example of how a more complicated
network could be trained on computer first to obtain good initial weights and then
the training could be completed with the chip in the loop. Also, for more complicated
66
0 50 100 150 200 250 300 350 400 450 500
−40
−20
0
20
40
W
e
i
g
h
t
Weight and Error vs. Iteration for training XOR network
0 50 100 150 200 250 300 350 400 450 500
2
4
6
8
10
12
iteration
E
r
r
o
r

V
o
l
t
a
g
e
Figure 5.15: Training of a 2:2:1 network with XOR function starting with “ideal”
weights
67
networks, using a more sophisticated model of the synapses and neurons that more
closely approximates the actual circuit implementation would be advantageous for
computer pretraining.
5.7 Conclusions
A VLSI implementation of a neural network has been demonstrated. Digital weights
are used to provide stable weight storage. Analog multipliers are used because full
digital multipliers would occupy considerable space for large networks. Although the
functions learned were digital, the network is able to accept analog inputs and provide
analog outputs for learning other functions. A parallel perturbation technique was
used to train the network successfully on the 2-input AND and XOR functions.
68
Chapter 6 A Parallel Perturbative VLSI
Neural Network
A fully parallel perturbative algorithm cannot truly be realized with a serial weight
bus, because the act of changing the weights is performed by a serial operation. Thus,
it is desirable to add circuitry to allow for parallel weight updates.
First, a method for applying random perturbation is necessary. The randomness
is necessary because it defines the direction of search for finding the gradient. Since
the gradient is not actually calculated, but observed, it is necessary to search for
the downward gradient. It is possible to use a technique which does a nonrandom
search. However, since no prior information about the error surface is known, in the
worst case a nonrandom technique would spend much more time investigating upward
gradients which the network would not follow.
A conceptually simple technique of generating random perturbations would be
to amplify the thermal noise of a diode or resistor (figure 6.1). Unfortunately, the
extremely large value of gain required for the amplifier makes the amplifier susceptible
to crosstalk. Any noise generated from neighboring circuits would also get amplified.
Since some of this noise may come from clocked digital sections, the noise would
become very regular, and would likely lead to oscillations rather than the uncorrelated
noise sources that are desired.
Such a scheme was attempted with separate thermal noise generators for each
neuron[33]. The gain required for the amplifier was nearly 1 million and highly
correlated oscillations of a few MHz were observed among all the noise generators.
Therefore, another technique is required.
Instead, the random weight increments can be generated digitally with linear
feedback shift registers that produce a long pseudorandom sequence. These random
bits are used as inputs to a counter that stores and updates the weights. The counter
69
A
V
noise
Figure 6.1: Amplified thermal noise of a resistor
1 2 3 n m ... ...
out
Figure 6.2: 2-tap, m-bit linear feedback shift register
outputs go directly to the D/A converter inputs of the synapses. If the weight updates
led to a reduction in error, the update is kept. Otherwise, an inverter block is activated
which inverts the counter inputs coming from the linear feedback shift registers. This
has the effect of restoring the original weights. A block diagram of the full neural
network circuit function is provided in figure 6.8.
6.1 Linear Feedback Shift Registers
The use of linear feedback shift registers for generating pseudorandom sequences has
been known for some time[31]. A simple 2 tap linear feedback shift register is shown
in figure 6.2. A shift register with m bits is used. In the figure shown, an XOR is
performed from bit locations m and n and its output is fed back to the input of the
shift register. In general, however, the XOR can be replaced with a multiple input
parity function from any number of input taps. If one of the feedback taps is not
70
1 0 0 0
0 1 0 0
0 0 1 0
1 0 0 1
1 1 0 0
0 1 1 0
1 0 1 1
0 1 0 1
1 0 1 0
1 1 0 1
1 1 1 0
1 1 1 1
0 1 1 1
0 0 1 1
0 0 0 1
1 0 0 0
Table 6.1: LFSR with m=4, n=3 yielding 1 maximal length sequence
1 0 0 0 1 1 0 0 0 1 1 0
0 1 0 0 1 1 1 0 1 0 1 1
1 0 1 0 1 1 1 1 1 1 0 1
0 1 0 1 0 1 1 1 0 1 1 0
0 0 1 0 0 0 1 1
0 0 0 1 1 0 0 1
1 0 0 0 1 1 0 0
Table 6.2: LFSR with m=4, n=2, yielding 3 possible subsequences
taken from the final bit, m, but instead the largest feedback tap number is from some
earlier bit, l, then this is equivalent to an l-bit linear feedback shift register and the
output is merely delayed by the extra m−l shift register bits.
A maximal length sequence from such a linear feedback shift register would be of
length 2
M
−1 and includes all 2
M
possible patterns of 0’s and 1’s with the exception
of the all 0’s sequence. The all 0’s sequence is considered a dead state, because
XOR(0, 0) = 0 and the sequence never changes. It is important to note that not all
feedback tap selections will lead to maximal length sequences.
Table 6.1 shows an example of a maximal length sequence by choosing feedback
taps of positions 4 and 3 from a 4 bit LFSR. It is clear that as long as the shift register
71
does not begin in the all 0’s state, it will progress through all 2
M
−1 = 15 sequences
and then repeat. Table 6.2 shows an example where non-maximal length sequences
are possible by utilizing feedback taps 4 and 2 from a 4 bit LFSR. In this situation it
is possible to enter one of three possible subsequences. The initial state will determine
which subsequence is entered. There is no overlap of the subsequences so it is not
possible to transition from one to another. In this situation, the subsequence lengths
are of size 6, 6, and 3. The sum of the sizes of the subsequences equal the size of a
maximal length sequence.
It is desirable to use the maximal length sequences because they allow the longest
possible pseudorandom sequence for the given complexity of the LFSR. Thus, it is
important to choose the correct feedback taps. Tables have been tabulated which
describe the lengths of possible sequences and subsequences based on LFSR lengths
and feedback tap locations[31].
LFSRs of certain lengths cannot achieve maximal length sequences with only 2
feedback tap locations. For example, for m = 8, it is required to have an XOR(4,5,6)
to get a length 255 sequence, and, for m = 16, it is required to have XOR(4,13,15)
to obtain the length 65536 sequence[32]. Table 6.3 provides a list of LFSRs that can
obtain maximal length sequences with only 2 feedback taps[32].
It is clear that it is possible to make very long pseudorandom sequences with
moderately sized shift registers and a single XOR gate.
In applications where analog noise is desired, it is possible to low pass filter the
output of the digital pseudorandom noise sequence. This is done by using a filter
corner frequency which is much less than the clock frequency of the shift register.
6.2 Multiple Pseudorandom Noise Generators
From the previous discussion it is clear that linear feedback shift registers are a
useful technique for generating pseudorandom noise. However, a parallel perturbative
neural network requires as many uncorrelated noise sources as there are weights.
Unfortunately, an LFSR only provides one such noise source. It is not possible to use
72
m n Length
3 2 7
4 3 15
5 3 31
6 5 63
7 6 127
9 5 511
10 7 1023
11 9 2047
15 14 32767
17 14 131071
18 11 262143
20 17 1048575
21 19 2097151
22 21 4194303
23 18 8388607
25 22 33554431
28 25 268435455
29 27 536870911
31 28 2147483647
33 20 8589934591
35 33 34359738367
36 25 68719476735
39 35 549755813887
Figure 6.3: Maximal length LFSRs with only 2 feedback taps
73
the different taps of a single LFSR as separate noise sources because these taps are
merely the same noise pattern offset in time and thus highly correlated. One solution
would be to use one LFSR with different feedback taps and/or initial states for every
noise source required. For large networks with long training times, this approach
becomes prohibitively expensive in terms of area and possibly power required to
implement the scheme. Because of this, several approaches have been taken to resolve
the problem, many of which utilize cellular automata.
In one approach[36], they build from a standard LFSR with the addition of an
XOR network with inputs coming from the taps of an LFSR and with outputs pro-
viding the multiple noise sources. First, the analysis begins with an LFSR consisting
of shift register cells numbering a
0
to a
N−1
, with the cell marked a
0
receiving the
feedback. Each of the N units in the LFSR except for the a
0
unit is given by
a
t+1
i
= a
t
i
The first unit is described by
a
t
0
=
N−1

i=0
c
i
a
t
i
where ⊕ represents the modulo 2 summation and the c
i
are the boolean feedback
coefficients where a coefficient of 1 represents a feedback tap at that cell.
The outputs are given by
g
t
=
N−1

i=0
d
i
a
t
i
A detailed discussion is given as to the appropriate choice of the c
i
and d
i
coeffi-
cients to ensure good noise properties.
In another approach a standard boolean cellular automaton is used[37][38]. The
boolean cellular automaton is comprised of N unit cells, a
i
, wherein their state at
time t + 1 depends only on the states of all of the unit cells at time t.
74
a
t+1
i
= f

a
t
0
, a
t
1
, ..., a
t
N−1

An analysis was performed to choose a cellular automaton with good independence
between output bits. The following transition rule was chosen:
a
t+1
i
= (a
t
i+1
+a
n
i
) ⊕a
t
i−1
⊕a
t
i−1
Furthermore, every 4th bit is chosen as an output to ensure independent, unbiased,
bits.
6.3 Multiple Pseudorandom Bit Stream Circuit
Another simplified approach utilizes two counterpropagating LFSRs with an XOR
network to combine outputs from different taps to obtain uncorrelated noise[39]. For
example, figure 6.4 shows two LFSRs with m = 7, n = 6 and m = 6, n = 5. Since
the two LFSRs are counterpropagating, the length of the output sequences obtained
from the XOR network are (2
6
− 1)(2
7
− 1) = 8001 bits[31]. In the figure, the bits
are combined to provide 9 pseudorandom bit sequences. It is possible to obtain more
channels and larger sequence lengths with the use of larger LFSRs. This was the
scheme that was ultimately implemented in hardware.
6.4 Up/down Counter Weight Storage Elements
The weights in the network are represented directly as the bits of an up/down digital
counter. The output bits of the counter feed directly into the digital input word
weight bits of the synapse circuit. Updating the weights becomes a simple matter
of incrementing or decrementing the counter to the desired value. The counter used
is a standard synchronous up/down counter using full adders and registers (figure
6.5)[34].
75
r
0
r
1
r
2
r
3
r
4
r
5
r
6
r
7
0
1
2
3
4
5
6
7
8
9
1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
D
2
5

b
i
t
s
h
i
f
t

r
e
g
i
s
t
e
r
r
e
s
e
t
0
1
2
3
4
5
6
7
8
9
1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
D f
1
2
5

b
i
t
s
h
i
f
t

r
e
g
i
s
t
e
r
r
e
s
e
t
r
e
s
e
t
r
8
f
2
f
1
f
2
f
1
f
2
f
1
f
2
f
1
f
2
f
1
f
2
Figure 6.4: Dual counterpropagating LFSR multiple random bit generator
76
A
B
Cin
Cout
S
A
B
Cin
Cout
S
b0
b1
down/up
A
B
Cin
Cout
S
D Q
D Q
D Q
b2
A
B
Cin
Cout
S
D Q
A
B
Cin
Cout
S
D Q
A
B
Cin
Cout
S
D Q
overflow
b3
b4
b5
f1
f1
f2
f2
f1
f1
f2
f2
f1
f1
f2
f2
f1
f1
f2
f2
f1
f1
f2
f2
f1
f1
f2
f2
f1
f1
f2
f2
Figure 6.5: Synchronous up/down counter
77
A
B
C
C
out
SUM
Vdd
Figure 6.6: Full adder for up/down counter
78
W j
2
j
2
Q
W
D
j
1
j
1
Figure 6.7: Counter static register
The full adder is shown in figure 6.6[34]. Since every weight requires 1 counter,
and every counter requires 1 full adder for each bit, the adders take up a considerable
amount of room. Thus, the adder is optimized for space instead of for speed. Most
of the adder transistors are made close to minimum size.
The register cell (figure 6.7) used in the counter is simply a 1 bit version of the
serial weight bus shown in figure 5.7. This register cell is both small and static, since
the output must remain valid for standard feedforward operation after learning is
complete.
6.5 Parallel Perturbative Feedforward Network
Figure 6.8 shows a block diagram of the parallel perturbative neural network circuit
operation. The synapse, binary weighted encoder and neuron circuits are the same as
those used for the serial weight bus neural network. However, instead of the synapses
interfacing with a serial weight bus, counters with their respective registers are used
to store and update the weights.
6.6 Inverter Block
The counter up/down inputs originate in the pseudorandom bit generator and pass
through an inverter block (figure 6.9). The inverter block is essentially composed of
pass gates and inverters. If the invert signal is low, the bit passes unchanged. If the
invert bit is high, then the inversion of the pseudorandom bit gets passed. Control of
the inverter block is what allows weight updates to either be kept or discarded.
79
Pseudorandom bitgenerator
Inverterblock
Counter
up/down
b0..b5
Toother
counters
Synapse
input
otherlayers
Neuron
fromothersynapses
Binaryweightedencoder
ErrorComparison
Neuron
output
Figure 6.8: Block diagram of parallel perturbative neural network circuit
80
b
0
in
b
0
in
v
e
r
t
b
1
in
b
2
in
b
3
in
b
4
in
b
5
in
b
6
in
b
7
in
b
8
in
b
1
b
2
b
3
b
4
b
5
b
6
b
7
b
8
Figure 6.9: Inverter block
81
6.7 Training Algorithm
The algorithm implemented by the network is a parallel perturbative method[21][25].
The basic idea of the algorithm is to perform a modified gradient descent search of
the error surface without calculating derivatives or using explicit knowledge of the
functional form of the neurons. The way that this is done is by applying a set of
small perturbations on the weights and measuring the effect on the network error.
In a typical perturbative algorithm, after perturbations are applied to the weights, a
weight update rule of the following form is used:

− →
w = −η
E

− →
w +
−−→
pert

−E (
− →
w)
pert
where
− →
w is a weight vector of all the weights in the network,
−−→
pert is a perturbation
vector applied to the weights which is normally of fixed size, pert, but with random
signs, and η is a small parameter used to set the learning rate. Thus, the weight
update rule can be seen as based on an approximate derivative of the error functions
with respect to the weights.
∂E

− →
w

∆E

− →
w
=
E

− →
w +
−−→
pert

−E (
− →
w)
(
− →
w + pert) −
− →
w
=
E

− →
w +
−−→
pert

−E (
− →
w)
pert
However, this type of weight update rule requires weight changes which are pro-
portional to the difference in error terms. A simpler, but functionally equivalent,
weight update rule can be achieved as follows. First, the network error, E (
− →
w), is
measured with the current weight vector,
− →
w. Next, a perturbation vector,
−−→
pert, of
fixed magnitude, but random sign is applied to the weight vector yielding a new weight
vector,
− →
w +
−−→
pert. Afterwards, the effect on the error, ∆E = E

− →
w +
−−→
pert

−E (
− →
w),
is measured. If the error decreases then the perturbations are kept and the next
iteration is performed. If the error increases, the original weight vector is restored.
Thus, the weight update rule is of the following form:
82
− →
w
t+1
=

− →
w
t
+
−−→
pert
t
, if E

− →
w
t
+
−−→
pert
t

< E (
− →
w
t
)
− →
w
t
, if E

− →
w
t
+
−−→
pert
t

> E (
− →
w
t
)
The use of this rule may require more iterations since it does not perform an
actual weight change every iteration and since the weight updates are not scaled
with the resulting changes in error. Nevertheless, it significantly simplifies the weight
update circuitry. For either update rule, some means must be available to apply
the weight perturbations; however, this rule does not require additional circuitry to
change the weight values proportionately with the error difference, and, instead, relies
on the same circuitry for the weight update as for applying the random perturbations.
Some extra circuitry is required to remove the perturbations when the error doesn’t
decrease, but this merely involves inverting the signs of all of the perturbations and
reapplying in order to cancel out the initial perturbation.
A pseudocode version of the algorithm is presented in figure 6.10.
6.8 Error Comparison
The error comparison section is responsible for calculating the error of the current
iteration and interfacing with a control section to implement the algorithm. Both
sections could be performed off-chip by using a computer for chip-in-loop training, as
was chosen for the current implementation. This allows flexibility in performing the
global functions necessary for implementing the training algorithm, while the local
functions are performed on-chip. However, the control section could be implemented
on chip as a standard digital section such as a finite state machine. Also, there are
several alternatives for implementing the error comparison on chip. First, the error
comparison could simply consist of analog to digital converters which would then pass
the digital information to the control block. Another approach would be to have the
error comparison section compare the analog errors directly and then output digital
control signals.
83
Initialize Weights;
Get Error;
while(Error > Error Goal);
Perturb Weights;
Get New Error;
if (New Error < Error),
Error = New Error;
else
Unperturb Weights;
end
end
function Perturb Weights;
Set Invert Bit Low;
Clock Pseudorandom bit generator;
Clock counters;
function Unperturb Weights;
Set Invert Bit High;
Clock counters;
Figure 6.10: Pseudocode for parallel perturbative learning network
E
in+
E
in-
E
p+
E
p-
E
o-
E
o+
StoreE
o
StoreE
p
CompareE
C C
E
out
in+
in-
V
ref
Figure 6.11: Error comparison section
84
E
in+
E
in-
E
p+
E
p-
E
o-
E
o+
C
E
out
in+
in-
V
ref
a)StoreE high
o
b)StoreE high
p
c)CompareEhigh
C
E
in+
E
in-
E
p+
E
p-
E
o-
E
o+
C C
E
p+
E
p-
E
o-
E
o+
C C
Figure 6.12: Error comparison operation
85
A sample error comparison section is shown in figure 6.11. The section uses digital
control signals to pass error voltages onto capacitor terminals and then combines the
capacitors in such a way to subtract the error voltages. This error difference is then
compared with a reference voltage to determine if the error decreased or increased.
The output consists of a digital signal which shows the appropriate error change.
The error input terminals, E
in+
and E
in−
, are connected, in sequence, to two stor-
age capacitors which are then compared. StoreE
o
is asserted when the unperturbed,
initial error is stored. StoreE
p
is asserted in order to store the perturbed error after
the weight perturbations have been applied. CompareE is asserted to compute the
error difference, and E
out
shows the sign of the difference computation.
Figure 6.12 shows the operation of the error comparison section in greater detail.
In figure 6.12a, StoreE
o
is high and the other signals are low. The capacitor where
E
p
is stored is left floating, while the inputs E
in+
and E
in−
are connected to E
o+
and E
o−
respectively. This stores the unperturbed error voltages on the capacitor.
Next, in figure 6.12b, StoreE
p
is held high while the other signals are low. In the
same manner as before, the inputs E
in+
and E
in−
are connected to E
p+
and E
p−
respectively. This stores the perturbed error voltages on the capacitor. Note that the
polarities of the two storage capacitors are drawn reversed. This will facilitate the
subtraction operation in the final step. In figure 6.12c, CompareE is asserted and
the other input signals are low. The two storage capacitors are connected together
and their charges combine. The total charge, Q
T
, across the parallel capacitors is
given as follows:
Q
T
= Q
+
−Q

= C (E
p+
+E
o−
) −C (E
p−
+E
o+
) = C ((E
p+
−E
p−
) −(E
o+
−E
o−
))
If both capacitors are of size C, the parallel capacitance will be of size 2C. The
voltage across the parallel capacitors can be obtained from Q
T
= 2CV
T
.
Thus,
V
T
=
Q
T
2C
=
1
2
((E
p+
−E
p−
) −(E
o+
−E
o−
)) =
1
2
(E
p
−E
o
)
86
Since these voltages are floating and must be referenced to some other voltage,
the combined capacitor is placed onto a voltage, V
ref
, which is usually chosen to be
approximately half the supply voltage to allow for the greatest dynamic range.
Also, during this step, this error difference feeds into a high gain amplifier to
perform the comparison. The high gain amplifier is shown as a differential transcon-
ductance amplifier which feeds into a double inverter buffer to output a strong digital
signal. The final output, E
out
, is given as follows:
E
out
=

1, if E
p
> E
o
0, if E
p
< E
o
where the 1 represents a digital high signal, and the 0 represents a digital low
signal.
This output can then be sent to a control section which will use it to determine
the next step in the algorithm.
The error inputs which are computed by a difference of the actual output and a
target output can be computed with a similar analog switched capacitor technique.
However, it might also be desirable to insert an absolute value computation stage.
Simple circuits for obtaining voltage absolute values can be found elsewhere[30].
Furthermore, this sample error computation stage can easily be expanded to ac-
commodate a multiple output neural network.
6.9 Test Results
In this section several sample training run results from the chip are shown. A chip
implementing the parallel perturbative neural network was fabricated in a 1.2µm
CMOS process. All synapse and neuron transistors were 3.6µm/3.6µm to keep the
layout small. The unit size current source transistors were also 3.6µm/3.6µm. An
LSB current of 100nA was chosen for the current source. The above neural network
circuits were trained with some simple digital functions such as 2 input AND and 2
input XOR. The results of some training runs are shown in figures 6.13-6.15. As can be
87
seen from the figures, the network weights slowly converge to a correct solution. The
error voltages were taken directly from the voltage output of the output neuron. The
error voltages were calculated as a total sum voltage error over all input patterns. The
actual output voltage error is arbitrary and depends on the circuit parameters. What
is important is that the error goes down. The differential to single-ended converter
and a double inverter buffer can be placed at the final output to obtain good digital
signals. Also shown is a digital error signal that shows the number of input patterns
where the network gives the incorrect answer for the output. The network analog
output voltage for each pattern is also displayed.
The actual weight values were not made accessible and, thus, are not shown;
however, another implementation might also add a serial weight bus to read the
weights and to set initial weight values. This would also be useful when pretraining
with a computer to initialize the weights in a good location.
Figure 5.12 shows the results from a 2 input, 1 output network learning an AND
function. This network has only 2 synapses and 1 bias for a total of 3 weights. The
network starts with 2 incorrect outputs which is to be expected with initial random
weights. Since the AND function is a very simple function to learn, after relatively
few iterations, all of the outputs are digitally correct. However, the network continues
to train and moves the weight vector in order to better match the training outputs.
Figures 6.14 and 6.15 show the results of training a 2 input, 2 hidden unit, 1
output network with the XOR function. In both figures, although the error voltage is
monotonically decreasing, the digital error occasionally increases. This is because the
network weights occasionally transition through a region of reduced analog output
error that is used for training, but which actually increases the digital error. This
seems to be occasionally necessary for the network to ultimately reach a reduced
digital error. Both figures also show that the function is essentially learned after only
several hundred iterations. In figure 6.15, the network is allowed to continue learning.
After being stuck in a local minimum for quite some time, the network finally finds
another location where it is able to go slightly closer towards the minimum.
Some of the analog output values occasionally show a large jump from one iter-
88
0 10 20 30 40 50 60 70 80 90 100
0
0.2
0.4
0.6
0.8
1
Errors and Outputs vs. Iteration for training AND network
E
r
r
o
r

(
V
o
l
t
s
)
0 10 20 30 40 50 60 70 80 90 100
−1
0
1
2
3
4
D
i
g
i
t
a
l

E
r
r
o
r
0 10 20 30 40 50 60 70 80 90 100
−0.2
−0.1
0
0.1
0.2
0.3
iteration
A
n
a
l
o
g

O
u
t
p
u
t
s

(
V
o
l
t
s
)
Figure 6.13: Training of a 2:1 network with the AND function
89
0 100 200 300 400 500 600 700 800 900 1000
0
0.05
0.1
0.15
0.2
Errors and Outputs vs. Iteration for training XOR network
E
r
r
o
r

(
V
o
l
t
s
)
0 100 200 300 400 500 600 700 800 900 1000
−1
0
1
2
3
4
D
i
g
i
t
a
l

E
r
r
o
r
0 100 200 300 400 500 600 700 800 900 1000
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
iteration
A
n
a
l
o
g

O
u
t
p
u
t
s

(
V
o
l
t
s
)
Figure 6.14: Training of a 2:2:1 network with the XOR function
90
0 200 400 600 800 1000 1200 1400 1600 1800 2000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Errors and Outputs vs. Iteration for training XOR network
E
r
r
o
r

(
V
o
l
t
s
)
0 200 400 600 800 1000 1200 1400 1600 1800 2000
−1
0
1
2
3
4
D
i
g
i
t
a
l

E
r
r
o
r
0 200 400 600 800 1000 1200 1400 1600 1800 2000
−0.1
−0.05
0
0.05
0.1
0.15
iteration
A
n
a
l
o
g

O
u
t
p
u
t
s

(
V
o
l
t
s
)
Figure 6.15: Training of a 2:2:1 network with the XOR function with more iterations
91
ation to another. This occurs when a weight value which is at maximum magnitude
overflows and resets to zero. The weights are merely stored in counters, and no spe-
cial circuitry was added to deal with these overflow conditions. It would require a
simple logic block to ensure that if a weight is at maximum magnitude and was incre-
mented, that it would not overflow and reset to zero. However, this circuitry would
need to be added to every weight counter and would be an unnecessary increase in
size. However, these overflow conditions should normally not be a problem. Since
the algorithm only accepts weight changes that decrease the error, if an overflow and
reset on a weight is undesirable, the weight reset will simply be discarded. In fact,
the weight reset may occasionally be useful for breaking out of local minima, where a
weight value has been pushed up against an edge which leads to a local minima, but a
sign flip or significant weight reduction is necessary to reach the global minimum. In
this type of situation, the network will be unwilling to increase the error as necessary
to obtain the global minimum.
6.10 Conclusions
A VLSI neural network implementing a parallel perturbative algorithm has been
demonstrated. Digital weights are used to provide stable weight storage. They are
stored and updated via up/down counters and registers. Analog multipliers are used
to decrease the space that a full parallel array of digital multipliers would occupy in
a large network. The network was successfully trained on the 2-input AND and XOR
functions. Although the functions learned were digital, the network is able to accept
analog inputs and provide analog outputs for learning other functions.
The chip was trained using a computer in a chip-in-loop fashion. All of the
local weight update computations and random noise perturbations were generated on
chip. The computer performed the global operations including applying the inputs,
measuring the outputs, calculating the error function and applying algorithm control
signals. This chip-in-loop configuration is the most likely to be used if the chip were
used as a fast, low power component in a larger digital system. For example, in a
92
handheld PDA, it may be desirable to use a low power analog neural network to assist
in the handwriting recognition task with the microprocessor performing the global
functions.
93
Chapter 7 Conclusions
7.1 Contributions
7.1.1 Dynamic Subthreshold MOS Translinear Circuits
The main contribution of this section involved the transformation of a set of floating
gate circuits useful for analog computation into an equivalent set of circuits using
dynamic techniques. This makes the circuits more robust, and easier to work with
since they do not require using ultraviolet illumination to initialize the circuits. Fur-
thermore, the simplicity with which these circuits can be combined and cascaded to
implement more complicated functions was demonstrated.
7.1.2 VLSI Neural Networks
The design of the neuron circuit, although similar to another neuron design[22], is
new. The main distinction is a change in the current to voltage conversion element
used. The previous element was a diode connected transistor. This element proved
to have too large of a gain, and, in particular, learning XOR was difficult with such
high neuron gain. Thus, a diode connected cascoded transistor with linearized gain
and a gain control stage was used instead.
The parallel perturbative training algorithm with simplified weight update rule
was used in the serial weight bus based neural network. Although many different
algorithms could have been used, such as serial weight perturbation, this was done
with an eye toward the eventual implementation of the fully parallel perturbative
implementation. Excellent training results were shown for training of a 2 input, 1
output network on the AND function and for training a 2 input, 2 hidden layer, 1
output network on the XOR function. Furthermore, an XOR network was trained
which was initialized with “ideal” weights. This was done in order to show the effects
94
of circuit offsets in making the outputs initially incorrect, but also the ability to
significantly decrease the training time required by using good initial weights.
The overall system design and implementation of the fully parallel perturbative
neural network with local computation of the simplified weight update rule was also
demonstrated. Although many of the subcircuits and techniques were borrowed from
previous designs, the overall system architecture is novel. Excellent training results
were again shown for a 2 input, 1 output network trained on the AND function and
for a 2 input, 2 hidden layer, 1 output network trained on the XOR function.
Thus, the ability to learn of a neural network using analog components for imple-
mentation of the synapses and neurons and 6 bit digital weights has been successfully
demonstrated. The choice of 6 bits for the digital weights was chosen to demon-
strate that learning was possible with limited bit precision. The circuits can easily
be extended to use 8 bit weights. Using more than 8 bits may not be desirable since
the analog circuitry itself may not have significant precision to take advantage of the
extra bits per weight. Significant strides can be taken to improve the matching char-
acteristics of the analog circuits, but then the inherent benefits of using a compact,
parallel, analog implementation may be lost.
The size of the neuron cell in dimensionless units was 300λ x 96λ, the synapse was
80λ x 150λ, and the weight counter/register was 340λ x 380λ. In the 1.2µm process
used to make the test chips, λ was equal to 0.6µm. In a modern process, such as a
0.35µm process, it would be possible to make a network with over 100 neurons and
over 10,000 weights in a 1cm x 1cm chip.
7.2 Future Directions
The serial weight bus neural network and parallel perturbative neural network can be
combined. By adding a serial weight bus to the parallel perturbative neural network, it
would also make it possible to initialize the neural network with weights obtained from
computer pretraining. In a typical design cycle, significant computer simulation would
be carried out prior to circuit fabrication. Part of this simulation may involve detailed
95
training runs of the network using either idealized models or full mixed-signal training
simulation with the actual circuit. With the serial weight bus, it would then make
it possible to initialize the network with the results of this simulation. Additionally,
for certain applications such as handwriting recognition, it may be desirable to train
one chip with a standard set of inputs, then read out those weights and copy them
to other chips. The next steps would also involve building larger networks for the
purposes of performing specific tasks.
96
Bibliography
[1] B. Gilbert, “Translinear Circuits: A Proposed Classification”, Electronics Let-
ters, vol. 11, no. 1, pp. 14-16, 1975.
[2] B. Gilbert, “Current-Mode Circuits From A Translinear Viewpoint: A Tutorial”,
Analogue IC Design: The Current-Mode Approach, C. Toumazou, F. J. Ledgey,
and D. G. Haigh, Eds., London: Peter Peregrinus Ltd., pp. 11-91, 1990.
[3] B. Gilbert, “A Monolithic Microsystem for Analog Synthesis of Trigonometric
Functions and Their Inverses”, IEEE Journal of Solid-State Circuits, vol. SC-17,
no. 6, pp. 1179-1191, 1982.
[4] A. G. Andreou and K. A. Boahen, “Translinear Circuits in Subthreshold MOS”,
Analog Integrated Circuits and Signal Processing, vol. 9, pp. 141-166, 1996.
[5] T. Serrano-Gotarredona, B. Linares-Barranco, and A. G. Andreou, “A General
Translinear Principle for Subthreshold MOS Transistors”, IEEE Transactions on
Circuits and Systems - I: Fundamental Theory and Applications, vol. 46, no. 5,
May 1999.
[6] K. Bult and H. Wallinga, “A Class of Analog CMOS Circuits Based on the
Square-Law Characteristic of an MOS Transistor in Saturation”, IEEE Journal
of Solid-State Circuits, vol. SC-22, no. 3, June 1987.
[7] E. Seevinck and R. J. Wiegerink, “Generalized Translinear Circuit Principle”,
IEEE Journal of Solid-State Circuits, vol. 26, no. 8, Aug. 1991.
[8] P. R. Gray and R. G. Meyer, Analysis and Design of Analog Integrated Circuits,
3rd Ed., New York: John Wiley and Sons, 1993.
97
[9] R. J. Wiegerink, “Computer Aided Analysis and Design of MOS Translinear
Circuits Operating in Strong Inversion”, Analog Integrated Circuits and Signal
Processing, vol. 9, pp. 181-187, 1996.
[10] W. Gai, H. Chen and E. Seevinck, “Quadratic-Translinear CMOS Multiplier-
Divider Circuit”, Electronics Letters, vol. 33, no. 10, May 1997.
[11] S. I. Liu and C. C. Chang, “A CMOS Square-Law Vector Summation Circuit”,
IEEE Transactions on Circuits and Systems - II: Analog and Digital Signal Pro-
cessing, vol. 43, no. 7, July 1996.
[12] C. Dualibe, M. Verleysen and P. Jespers, “Two-Quadrant CMOS Analogue Di-
vider”, Electronics Letters, vol. 34, no. 12, June 1998.
[13] X. Arreguit, E. A. Vittoz, and M. Merz, “Precision Compressor Gain Controller
in CMOS Technology”, IEEE Journal of Solid-State Circuits, vol. SC-22, no. 3,
pp. 442-445, 1987.
[14] E. A. Vittoz, “Analog VLSI Signal Processing: Why, Where, and How?” Analog
Integrated Circuits and Signal Processing, vol. 6, no. 1, pp. 27-44, July 1994.
[15] B. A. Minch, C. Diorio, P. Hasler, and C. A. Mead, ”Translinear Circuits us-
ing Subthreshold floating-gate MOS transistors”, Analog Integrated Circuits and
Signal Processing, Vol. 9, No. 2, 1996, pp. 167-179.
[16] B. A. Minch, ”Analysis, Synthesis, and Implementation of Networks of Multiple-
Input Translinear Elements”, Ph.D. Thesis, California Institute of Technology,
1997.
[17] K. W. Yang and A. G. Andreou, ”A Multiple-Input Differential-Amplifier Based
on Charge Sharing on a Floating-Gate MOSFET”, Analog Integrated Circuits
and Signal Processing, vol. 6, no. 3, 1994, pp. 167-179.
98
[18] E. Vittoz, ”Dynamic Analog Techniques”, Design of Analog-Digital VLSI Cir-
cuits for Telecommunications and Signal Processing, Ed. Franca J., Tsividis Y.,
Prentice Hall, Englewood Cliffs, New Jersey, 1994.
[19] S. Aritome, R. Shirota, G. Hemink, T. Endoh, and F. Masuoka, “Reliability
Issues of Flash Memory Cells”, Proceedings of the IEEE, vol. 81, no. 5, pp. 776-
788, May 1993,
[20] R. W. Keyes, The Physics of VLSI Systems, New York: Addison-Wesley, 1987.
[21] J. Alspector, R. Meir, B. Yuhas, A. Jayakumar, and D. Lippe, “A Parallel Gra-
dient Descent Method for Learning in Analog VLSI Neural Networks”, Advances
in Neural Information Processing Systems, San Mateo, CA: Morgan Kaufman
Publishers, vol. 5, pp. 836-844, 1993.
[22] R. Coggins, M. Jabri, B. Flower, and S. Pickard, “A Hybrid Analog and Digital
VLSI Neural Network for Intracardiac Morphology Classification”, IEEE Journal
of Solid-State Circuits, vol. 30, no. 5, pp. 542-550, May 1995.
[23] M. Jabri and B. Flower, “Weight Perturbation: An Optimal Architecture and
Learning Technique for Analog VLSI Feedforward and Recurrent Multilayer Net-
works”, IEEE Transactions on Neural Networks, vol. 3, no. 1, pp. 154-157, Jan.
1992.
[24] B. Flower and M. Jabri, “Summed Weight Neuron Perturbation: An O(N) Im-
provement over Weight Perturbation”, Advances in Neural Information Process-
ing Systems, San Mateo, CA: Morgan Kaufman Publishers, vol. 5, pp. 212-219,
1993.
[25] G. Cauwenberghs, “A Fast Stochastic Error-Descent Algorithm for Supervised
Learning and Optimization”, Advances in Neural Information Processing Sys-
tems, San Mateo, CA: Morgan Kaufman Publishers, vol. 5, pp. 244-251, 1993.
99
[26] P. W. Hollis and J. J. Paulos, “A Neural Network Learning Algorithm Tailored
for VLSI Implementation”, IEEE Transactions on Neural Networks, vol. 5, no.
5, pp. 784-791, Sept. 1994.
[27] G. Cauwenberghs, “Analog VLSI Stochastic Perturbative Learning Architec-
tures”, Analog Integrated Circuits and Signal Processing, 13, 195-209, 1997.
[28] C. Diorio, P. Hasler, B. A. Minch, and C. A. Mead, “A Single-Transistor Silicon
Synapse”, IEEE Transactions on Electron Devices, vol. 43, no. 11, pp. 1972-1980,
Nov. 1996.
[29] C. Diorio, P. Hasler, B. A. Minch, and C. A. Mead, “A Complementary Pair of
Four-Terminal Silicon Synapses”, Analog Integrated Circuits and Signal Process-
ing, vol. 13, no. 1-2, pp. 153-166, May-June 1997.
[30] C. Mead, Analog VLSI and Neural Systems, New York: Addison-Wesley, 1989.
[31] S. W. Golomb, Shift Register Sequences, Revised Ed., Laguna Hills, CA: Aegean
Park Press, 1982.
[32] P. Horowitz, and W. Hill, The Art of Electronics, Second Ed., Cambridge, UK:
Cambridge University Press, pp. 655-664, 1989.
[33] J. Alspector, B. Gupta, and R. B. Allen, “Performance of a stochastic learn-
ing microchip” in Advances in Neural Information Processing Systems 1, D. S.
Touretzky, Ed., San Mateo, CA: Morgan Kaufman, 1989, pp. 748-760
[34] N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI Design, Second
Ed., Reading, Mass.: Addison-Wesley, 1993.
[35] R. Gregorian and G. C. Temes, Analog MOS Integrated Circuits for Signal Pro-
cessing, New York: John Wiley & Sons, 1986.
[36] J. Alspector, J. W. Gannett, S. Haber, M. B. Parker, and R. Chu, “A VLSI-
Efficient Technique for Generating Multiple Uncorrelated Noise Source and Its
100
Application to Stochastic Neural Networks”, IEEE Transactions on Circuits and
Systems, vol. 38, no. 1, pp. 109-123, Jan. 1991.
[37] S. Wolfram, Theory and Applications of Cellular Automata, World Scientific
Publishing Co. Pte. Ltd., 1986.
[38] A. Dupret, E. Belhaire, and P. Garda, “Scalable Array of Gaussian White Noise
Sources for Analogue VLSI Implementation”, Electronics Letters, vol. 31, no. 17,
pp. 1457-1458, Aug. 1995.
[39] G. Cauwenberghs, “An Analog VLSI Recurrent Neural Network Learning a
Continuous-Time Trajectory”, IEEE Transactions on Neural Networks, vol. 7,
no. 2, pp. 346-361, March 1996.
[40] J. Hertz, A. Krogh, and R. G. Palmer, Introduction to the Theory of Neural
Computation, New York: Addison-Wesley, 1991.
[41] M. L. Minsky and S. A. Papert, Perceptrons, Cambridge: MIT Press, 1969.
[42] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning Representations
by Back-Propagating Errors”, Nature, vol. 323, pp. 533-536, Oct. 9, 1986.
[43] T. Morie and Y. Amemiya, “An All-Analog Expandable Neural Network LSI with
On-Chip Backpropagation Learning”, IEEE Journal of Solid-State Circuits, vol.
29, no. 9, pp. 1086-1093, Sept. 1994.
[44] A. Dembo and T. Kailath, “Model-Free Distributed Learning”, IEEE Transac-
tions on Neural Networks, vol. 1, no. 1, pp. 58-70, March 1990.
[45] B. Widrow and M. A. Lehr, “30 Years of Adaptive Neural Networks: Perceptron,
Madaline, and Backpropagation”, Proceedings of the IEEE, vol. 78, no. 9, pp.
1415-1442, Sept. 1990.
[46] Y. Xie and M. Jabri, “Analysis of the effects of quantization in multilayer neural
networks using a statistical model”, IEEE Transactions on Neural Networks, vol.
3, no. 2, pp. 334-338, 1992.
101
[47] A. F. Murray, “Multilayer Perceptron Learning Optimized for On-Chip Imple-
mentation: A Noise-Robust System”, Neural Computation, no. 4, pp. 367-381,
1992.
[48] Y. Maeda, J. Hirano, and Y. Kanata, “A Learning Rule of Neural Networks via
Simultaneous Perturbation and Its Hardware Implementation”, Neural Networks,
vol. 8, no. 2, pp. 251-259, 1995.
[49] V. Petridis and K. Paraschidis, “On the Properties of the Feedforward Method:
A Simple Training Law for On-Chip Learning”, IEEE Transactions on Neural
Networks, vol. 6, no. 6, Nov. 1995.
[50] D. B. Kirk, D. Kerns, K. Fleischer, and A. H. Barr, “Analog VLSI Implemen-
tation of Multi-dimensional Gradient Descent”, Advances in Neural Information
Processing Systems, San Mateo, CA: Morgan Kaufman Publishers, vol. 5, pp.
789-796, 1993.
[51] J. S. Denker and B. S. Wittner, “Network Generality, Training Required, and
Precision Required”, Neural Information Processing Systems, D. Z. Anderson
Ed., New York: American Institute of Physics, pp. 219-222, 1988.
[52] P. W. Hollis, J. S. Harper, and J. J. Paulos, “The Effects of Precision Constraints
in a Backpropagation Learning Network”, Neural Computation, vol. 2, pp. 363-
373, 1990.
[53] M. Stevenson, R. Winter, and B. Widrow, “Sensitivity of Feedforward Neural
Networks to Weight Errors”, IEEE Transactions on Neural Networks, vol. 1, no.
1, March 1990.
[54] A. J. Montalvo, P. W. Hollis, and J. J. Paulos, “On-Chip Learning in the Analog
Domain with Limited Precision Circuits”, International Joint Conference on
Neural Networks, vol. 1, pp. 196-201, June 1992.
[55] T. G. Clarkson, C. K. Ng, and Y. Guan, “The pRAM: An Adaptive VLSI Chip”,
IEEE Transactions on Neural Networks, vol. 4, no. 3, May 1993.
102
[56] A. Torralba, F. Colodro, E. Ib´ a˜ nez, and L. G. Franquelo, “Two Digital Circuits
for a Fully Parallel Stochastic Neural Network”, IEEE Transactions on Neural
Networks, vol. 6, no. 5, Sept. 1995.
[57] H. Card, “Digital VLSI Backpropagation Networks”, Can. J. Elect. & Comp.
Eng., vol. 20, no. 1, 1995.
[58] H. C. Card, S. Kamarsu, and D. K. McNeill, “Competitive Learning and Vector
Quantization in Digital VLSI Systems”, Neurocomputing, vol. 18, pp. 195-227,
1998.
[59] H. C. Card, C. R. Schneider, and R. S. Schneider, “Learning Capacitive Weights
in Analog CMOS Neural Networks”, Journal of VLSI Signal Processing, vol. 8,
pp. 209-225, 1994.
[60] G. Cauwenberghs and A. Yariv, “Fault-Tolerant Dynamic Multilevel Storage
in Analog VLSI”, IEEE Transactions of Circuits and Systems - II: Analog and
Digital Signal Processing, vol. 41, no. 12, Dec. 1994.
[61] G. Cauwenberghs, “A Micropower CMOS Algorithmic A/D/A Converter”, IEEE
Transactions on Circuits and Systems - I: Fundamental Theory and Applications,
vol. 42, no. 11, Nov. 1995.
[62] P. Mueller, J. Van Der Spiegel, D. Blackman, T. Chiu, T. Clare, C. Donham,
T. P. Hsieh, and M. Loinaz, “Design and Fabrication of VLSI Components for
a General Purpose Analog Neural Computer”, Analog VLSI Implementation of
Neural Systems, C. Mead and M. Ismail, Eds., Norwell, Mass.: Kluwer, pp.
135-169, 1989.
[63] R. B. Allen and J. Alspector, “Learning of Stable States in Stochastic Asym-
metric Networks”, IEEE Transactions on Neural Networks, vol. 1, no. 2, June
1990.
103
[64] G. Cauwenberghs, C. F. Neugebauer, and A. Yariv, “Analysis and Verification
of an Analog VLSI Incremental Outer-Product Learning System”, IEEE Trans-
actions on Neural Networks, vol. 3, no. 3, May 1992.
[65] C. Diorio, S. Mahajan, P. Hasler, B. Minch, and C. Mead, “A High-Resolution
Nonvolatile Analog Memory Cell”, Proceedings of the 1995 IEEE International
Symposium on Circuits and Systems, vol. 3, pp. 2233-2236, 1995.
[66] A. J. Montalvo, R. S. Gyurcsik, and J. J. Paulos, “An Analog VLSI Neural
Network with On-Chip Perturbation Learning”, IEEE Journal of Solid-State
Circuits, vol. 32, no. 4, April 1997.
[67] P. Baldi and F. Pineda, “Contrastive Learning and Neural Oscillations”, Neural
Computation, vol. 3, no. 4, pp. 526-545, Winter 1991.

ii

c

2001

Vincent F. Koosh All Rights Reserved

iii

Acknowledgements
First, I would like to thank my advisor, Rod Goodman, for providing me with the resources and tools necessary to produce this work. Next, I would like to thank the members of my thesis defense committee, Jehoshua Bruck, Christof Koch, Alain Martin, and Chris Diorio. I would also like to thank Petr Vogel for being on my candidacy committee. Thanks also goes out to some of my undergraduate professors, Andreas Andreou and Gert Cauwenberghs, for starting me off on this analog VLSI journey and for providing occasional guidance along the way. Many thanks go to the members of my lab and the Caltech VLSI community for fruitful conversations and useful insights. This includes Jeff Dickson, Bhusan Gupta, Samuel Tang, Theron Stanford, Reid Harrison, Alyssa Apsel, John Cortese, Art Zirger, Tom Olick, Bob Freeman, Cindy Daniell, Shih-Chii Liu, Brad Minch, Paul Hasler, Dave Babcock, Adam Hayes, William Agassounon, Phillip Tsao, Joseph Chen, Alcherio Martinoli, and many others. I thank Karin Knott for providing administrative support for our lab. I must also thank my parents, Rosalyn and Charles Soffar, for making all things possible for me, and I thank my brothers, Victor Koosh and Jonathan Soffar. Thanks also goes to Lavonne Martin for always lending a helpful hand and a kind ear. I also thank Grace Esteban for providing encouragement for many years. Furthermore, I would like to thank many of my other friends and colleagues, too numerous to name, for the many joyous times.

locally generated. nature has evolved techniques to deal with imprecise analog components by using redundancy and massive connectivity.iv Abstract Nature has evolved highly advanced systems capable of performing complex computations. A parallel perturbative weight update algorithm is used. adaptation. The chip uses multiple. Test results are shown for all of the circuits. The second neural network also uses a computer for the global control operations. it is possible to learn around circuit offsets. A circuit that performs a vector normalization. If the perturbation causes the error function to decrease. but all of the local operations are performed on chip. The chip is trained in loop with the computer performing control and weight updates. The arithmetic function circuits presented are based on MOS transistors operating in the subthreshold region with capacitive dividers as inputs to the gates. mathematical computations. and multiplication/division are shown. . digital systems cannot outperform analog systems in terms of power. Although digital systems have significantly surpassed analog systems in terms of performing precise. Two feedforward neural network implementations are also presented. pseudorandom bit streams to perturb all of the weights in parallel. is shown to display the ease with which simpler circuits may be combined to obtain more complicated functions. Furthermore. Circuits for performing squaring. and learning using analog components. Because the inputs to the gates of the transistors are floating. The first uses analog synapses and neurons with a digital serial weight bus. based on cascading the preceding circuits. digital switches are used to dynamically reset the charges on the floating gates to perform the computations. square root. The weights are implemented digitally. analog VLSI circuits are presented for performing arithmetic functions and for implementing neural networks. By training with the chip in the loop. high speed. and counters are used to adjust them. In this thesis. These circuits draw on the power of the analog building blocks to perform low power and parallel computations.

otherwise it is discarded. .v the weight change is kept. Test results are shown of both networks successfully learning digital functions such as AND and XOR.

1.1 Overview . . . . .1. . . . . . . .2 3. 3 Analog Computation with Dynamic Subthreshold Translinear Circuits 3. . . . . Multiplier/Divider Circuit . . . . . . . . . . . .3 2. . . . . . .5. . . . . . . . . . . . . . . . . . Above Threshold MOS Translinear Circuits . . . . . . . . . . . . . . . . . .1 3. . . . . . . . . . . . . . Subthreshold MOS Translinear Circuits . . . . . .1 1. . . . iii iv 1 2 2 2 4 4 5 6 7 2 Analog Computation with Translinear Circuits 2. . . . . . . . . . . . . . . . . . . . . . . . . 8 9 10 11 12 14 16 19 20 20 26 Squaring Circuit . . . .1 2. .2 Early Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 3. . . . . . . . . . . . . . . . . . Dynamic Gate Charge Restoration . . . . . . . . . .5. . . .5 . . . . . . . . . . . . .3 3. . . . . . . . . .1 3. . . . . . . . . . . . . . . . .1. . . . . .1. . . . . . . . Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vi Contents Acknowledgements Abstract 1 Introduction 1.2 Analog Computation . VLSI Neural Networks . . . . . . . .4 BJT Translinear Circuits . . 3. . . . . . . . . . . . Translinear Circuits With Multiple Linear Input Elements . .4 3. . . . . . . . . . . . . . . . . . . 1. .2 3. Clock Effects . . . . . . . . . . . . . . . Root Mean Square (Vector Sum) Normalization Circuit. . . . . . . . . . . . . . . . Square Root Circuit . . . . . . . . . . . . . . . .2 2. . . . .1. .1 Floating Gate Translinear Circuits 3. . . . Discussion . . . . . . .

.1 4. . . . . . . . Test Results . Feedforward Network . . . . . . .5. . . . . . . . . Conclusions . .1. . . . . . .6 5. .3 4. . . . . . . . . . . . . . .6 4. . . . . . . . . . . . . . . .7 4. . . . . . . . Perturbative Neural Network Implementations . . . . Feedforward Method . . . . . . . . . . . . . . . . . . .3 3. . . . . . . . . . .7 Synapse .1. . . . . Limitations. . .1 5.2. . . . . . . . .7 Speed. . .2. . . . . . . Serial Weight Perturbation . . . . . . . . .1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Neuron . . . . . . . . . . 27 27 29 Floating Gate Versus Dynamic Comparison Conclusion . . . .1. . . . . .5 4. . . . .2 4. 4 VLSI Neural Network Algorithms. . . . . . . . . . . . . . . . . . . . . . .4 5. . . . . .3 Number of Learnable Functions with Limited Precision Weights 36 Trainability with Limited Precision Weights . .5 5. . . . . . .2 4. . . . . . . . . . . .2 Backpropagation . . . . . .1.4 4. .1. . . . . . . 4. . . . . . . . . . .2 Contrastive Neural Network Implementations . . . . . . . . . .3 5. . . . . . . . . . . . . . . . . . .3 VLSI Neural Network Implementations . . . .1 4. . . . . . . . and Implementations 4. . . . . . . 4. . . . . . . . . . . . . . . . . . . . . . . Summed Weight Neuron Perturbation . . . . . . 43 44 48 54 57 60 60 67 . . .3. . . . . . . . . . . . Serial Weight Bus .1. . . . . . . . . . . . . . . Accuracy Tradeoffs . .3. . . . . . . . . . . . . . Analog Weight Storage . . . . . . . . . .1 4. . . . . . . . . . . . . . . . . . . . .6 3. . . . . . . . . . . . . . . . . . . . . . . . . Training Algorithm . . . . Parallel Weight Perturbation . 30 30 30 31 32 33 34 35 36 36 VLSI Neural Network Limitations 4. .1 VLSI Neural Network Algorithms . . . . . . . . . . . . . . . . . . . . . . . 37 38 39 40 41 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii 3. . . . 5 VLSI Neural Network with Analog Multipliers and a Serial Digital Weight Bus 5. . . . Chain Rule Perturbation . . . Neuron Perturbation . .2. . . . . . . . .2 5. . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1. . . . . . . . . . . . . . . . . 68 69 71 74 74 78 78 81 82 86 91 93 93 93 93 94 96 6. . . . . . . . . . . . . . . . . . 7. VLSI Neural Networks . . . . . . . . . . . . . .2 Dynamic Subthreshold MOS Translinear Circuits .7 6. . . .8 6. .viii 6 A Parallel Perturbative VLSI Neural Network 6. . . . . . . Multiple Pseudorandom Bit Stream Circuit . . Multiple Pseudorandom Noise Generators . . . . . Up/down Counter Weight Storage Elements . .4 6. . . . . Future Directions . . . . . . . .3 6. . . . .9 Linear Feedback Shift Registers . . . . Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Error Comparison . . . . . . . . . . . . . . . . . .1.2 7. . . . . . . . . . . . . .10 Conclusions . . .5 6. . . . . . . . . . . . . . . . . . Bibliography . .1 Contributions . .1 7. . . . . . . . . . . . . . Training Algorithm . . . . . . . . . . . . . Inverter Block . . . . . . Parallel Perturbative Feedforward Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 6. . . . . . 7 Conclusions 7. . . . . .2 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 6. . . . . . . . . . .

. . . . . . .10 Square root circuit results . . . . . . . 2 hidden unit. . . . . . . . Squaring circuit results .4 5. . . . . . Vout+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 3. . . . . . . . . .ix List of Figures 3. . . . for various values of Vof f set . . . 2 input. . . . . as a function of ∆Iin . . . Dynamically restored squaring circuit . . Multiplier/divider circuit . . .7 5. .6 3. . . . . . . . 5. . . . . . . . . . . . . . . . . . . . . .2 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 Neuron circuit . . . . . . . . . . . .1 5. . . . . . . . . . . . . . .10 Differential to single ended converter and digital buffer . . . . . . .5 3. . . 1 output neural network . . Dynamically restored multiplier/divider circuit . . . . . . . . . . . . . . . . . . Dynamically restored square root circuit . Synapse circuit . . . . . . . . . . Neuron differential output voltage. . . . . . . Squaring circuit . . . . . . .1 3. . .3 Binary weighted current source circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 49 52 53 55 56 58 59 5. . 5. . . . . . . . . as a function of ∆Iin . . . . . . . . . . . . . . . . . Root mean square (vector sum) normalization circuit . . . . . . . . . . .9 Capacitive voltage divider . . . 5. . . . . . .4 3.6 Neuron positive output. . . . . . . . .8 5. . . . . . . . . for various values of Vgain . .2 5. . 9 10 11 12 14 15 16 17 21 22 23 24 45 45 3. Square root circuit . . . . . . . . . . . . . . . . . . ∆Vout . . . . . . . . . . . . . . . . . . .12 RMS normalization circuit results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Synapse differential output current as a function of differential input voltage for various digital weight settings .8 3. . 3. . . . . . .7 3. .11 Multiplier/divider circuit results . . Nonoverlapping clock generator . . . . . . . . . . . . . .9 Serial weight bus shift register . . 3. .

15 Training of a 2:2:1 network with the XOR function with more iterations 90 . . . . . .15 Training of a 2:2:1 network with XOR function starting with “ideal” weights . . . . . . 6. . . . . . . . . . . . . . . . .12 Training of a 2:1 network with AND function . . . . . . . . . .11 Parallel Perturbative algorithm . . . .4 6. . . . . . . . . . .7 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . .8 6. . . . . . . . . . . . . . . . .10 Pseudocode for parallel perturbative learning network . .2 6. . . . . . . . . . . . . . . . . . . . . . . . . .13 Training of a 2:1 network with the AND function . . . 6. . . . . . . . . . . . . . . . . . . . . . . Full adder for up/down counter . . . 6. . 5. . . 66 69 69 72 75 76 77 78 79 80 83 83 84 88 89 63 64 6. .11 Error comparison section . . . . . . . . . . . . . . . . . . . . . Block diagram of parallel perturbative neural network circuit . . . . . . . . . 2-tap. . . . . . . . . . . . .12 Error comparison operation . . Maximal length LFSRs with only 2 feedback taps . . . . . . . . . . Synchronous up/down counter . . . . . . 6. . . . . . . . . 5. . . . . .14 Network with “ideal” weights chosen for implementing XOR . .9 Amplified thermal noise of a resistor . . . . . . . .14 Training of a 2:2:1 network with the XOR function . . . . Dual counterpropagating LFSR multiple random bit generator . m-bit linear feedback shift register . 60 62 5. . . Counter static register . . . . . . . . . . . . .3 6. .13 Training of a 2:2:1 network with XOR function starting with small random weights . . . 6. 5. . . . . . . . . Inverter block . . . . . . .6 6. . . . . . . .x 5. . . . . . . . . 6. . . . . . . . . . . . . . . . .5 6. . . . . . . . . . . .1 6. . . . . . . . . . . . . . . . . .

6. . . . . . . . . . . . n=2. . . . . . . .1 6. . . . . . . . . . . . LFSR with m=4. . . . 65 70 70 . . . . .1 Computations for network with “ideal” weights chosen for implementing XOR . .2 LFSR with m=4. . n=3 yielding 1 maximal length sequence . . . . yielding 3 possible subsequences . . . . .xi List of Tables 5.

The places where analog circuits are required and excel are the places where digital equivalents cannot live up to the task. These circuits work with extremely low current levels and would fit well into applications requiring ultra low power designs. For example. ultra high speed circuits. handheld devices. Although these neural networks can be implemented with digital microprocessors. The inherent parallelism of neural networks allows a compact high speed solution in analog VLSI. several circuits are presented for implementing neural network architectures.1 Chapter 1 Introduction Although many applications are well suited for digital solutions. First. In this thesis. the large growth in portable devices with limited battery life increases the need of finding custom low power solutions. such as in wireless applications. require the use of analog design techniques. a set of circuits for performing basic analog calculations such as squaring. Furthermore. square rooting. the area of operation of the neural network circuits can be modified from low power to high speed to meet the needs of the specific application. Finally. Another important area where digital circuits may not be sufficient is that of low power applications such as in biomedical. implantable devices or portable. analog circuit design continues to have an important place. Next. the importance of analog circuitry in interfacing with the world will never be displaced because the world itself consists of analog signals. multiplication and division are described. Neural networks have proven useful in areas requiring man-machine interactions such as handwriting or speech recognition. several circuit designs are described that fit well into the analog arena. .

linear input elements. First. such as bipolar transistors. Each of these circuits have switches at the inputs of their floating gates which are used to dynamically set and restore the charges at the floating gates to proceed with the computation.1 1. A class of analog CMOS circuits that can be used to perform many analog computational tasks is presented.1. whereas. a brief discussion of several VLSI neural network implementations is presented. These circuit building blocks are then combined into a more complicated circuit that normalizes a current by the square root of the sum of the squares (vector sum) of the currents. Next. a brief review of VLSI neural networks is given. Next. a discussion of the various learning rules which are appropriate for analog VLSI implementation are discussed. 1. Also. traditional translinear circuits are discussed which are implemented with devices showing an exponential characteristic. Also mentioned are translinear circuits that utilize networks of multiple. A few basic current-mode building blocks that perform squaring. and multiplication/division are shown.1 Overview Analog Computation In chapter 2.1. square root. MOS transistor versions of translinear circuits are discussed. In subthreshold MOS translinear circuits the exponential characteristic is exploited similarly to bipolar transistors. a brief review of analog computational circuits is given with an emphasis on translinear circuits. above threshold MOS translinear circuits use the square law characteristic and require a different design style. First. some of the issues associated with analog VLSI networks is presented such as limited precision weights and weight storage issues. This should be sufficient to gain understanding of how to implement other power law circuits. .2 VLSI Neural Networks In chapter 4. The circuits utilize MOSFETs in their subthreshold region as well as capacitors and switches to produce the computations.2 1. These form the basis of the circuits described in the next chapter. In chapter 3.

3 In chapter 5. 2 hidden layer. The inputs and outputs of the network are provided directly at pins on the chip. and for a 2 input. 1 output network trained with an XOR function. The chip uses a serial digital weight bus implemented by a long shift register to input the weights. 1 output network trained with an AND function. 2 hidden layer. Training results are also shown for a 2 input. The training algorithm used is a parallel weight perturbation technique. The network is trained in a chip-in-loop fashion with a host computer implementing the training algorithm. This includes the generation of random perturbations and counters for updating the digital words where the weights are stored. parallel. but all of the local. a VLSI feedforward neural network is presented that makes use of digital weights and analog synaptic multipliers and neurons. . In chapter 6. Training results are shown for a 2 input. a VLSI neural network that uses a parallel perturbative weight update technique is presented. and for a 2 input. 1 output network trained with an XOR function. 1 output network trained with an AND function. weight update computations are performed on-chip. The network uses the same synapses and neurons as the previous network.

and the collector current is the output which performs the exponentiation or antilog operation. which is approximately 26mV at room temperature. as a function of its base-emitter voltage. Operational amplifier circuits are commonly used for these tasks as well as for generic derivative. 2. and highpass filters. It is also possible to think of the collector . the base-emitter voltage is considered the input. analog circuits can perform many types of signal filtering including lowpass. This type of operation was well know to users of slide rules in the past. For a bipolar junction transistor.4 Chapter 2 Analog Computation with Translinear Circuits There are many forms of analog computation. bandpass. power law implementations. A well known class of circuits exists to perform such tasks called translinear circuits. integrator. and Ut = kT /q is the thermal voltage. is given as IC = IS exp Vbe Ut where IS is the saturation current. In using this formula. the collector current. and gain circuits. For example. The emphasis of this chapter. IC . The basic idea behind using these types of circuits is that of using logarithms for converting multiplications and divisions into additions and subtractions. Vbe . shall be on circuits with the ability to implement arithmetic functions such as signal multiplication. and trigonometric functions. however. division.1 BJT Translinear Circuits The earliest forms of these type of circuits were based on exploiting the exponential current-voltage relationships of diodes or bipolar junction transistors[1][2].

the equation can be seen as follows: Vbe = Ut ln IC IS From here it is possible to connect the transistors. and κ is a process dependent parameter which is normally around 0. or possibly a voltage buffer to remove base current offsets.5 current as the input and the base-emitter voltage as the output such as in a diode connected transistor.7. care must be taken because of the nonidealities introduced in subthreshold MOS. Id . where the collector has a wire. as a function of its gate-source voltage. connecting it to the base. For a subthreshold MOS transistor. using log and antilog techniques. the drain current. Vgs . It is also possible to obtain trigonometric functions by using other power law functional approximations[3]. it is sometimes possible to directly translate BJT translinear circuits into subthreshold MOS translinear circuits. is given as[30] κVgs Ut Id = Io exp where Io is the preexponential current factor. However. Due to this extra parameter. In this case. Thus.2 Subthreshold MOS Translinear Circuits Because of the similarities of the subthreshold MOS transistor equation and the bipolar transistor equation. to obtain various multiplication. it is clear that the subthreshold MOS transistor equation differs from the BJT equation with the introduction of the parameter κ. 2. direct copies of the functional form of BJT translinear circuits into subthreshold MOS translinear circuits do not always lead to the same . However. division and power law circuits. it is not always trivial to obtain the best circuit configuration for a given application.

3 Above Threshold MOS Translinear Circuits The ability of above threshold MOS transistors to exhibit translinear behavior has also been demonstrated[6]. but since normally κ ≈ 0. Nevertheless. division. and vector sum computation[6][10][11]. general function synthesis is usually not as straightforward as for exponential translinear circuits[7][9]. 2. For an above threshold MOS transistor in saturation. the characteristics slowly decay in the transition region from subthreshold to above threshold. Several interesting circuits have been demonstrated such as those for performing squaring. Furthermore. the power becomes ≈ 0. Although many of the circuits . However. Subthreshold MOS translinear circuits are more limited in their dynamic range than BJT translinear circuits because of entrance into strong inversion. is given as Id = K (Vgs − Vt )2 Because of the square law characteristic used. When κ = 1. power. easier integration with standard digital circuits in a CMOS process is possible. Furthermore.7. because of the inflexibility of the square law compared to an exponential law. Vgs . lower power operation may be possible with the subthreshold MOS circuits. These involve a few added restrictions on the types of translinear loop topologies allowed. for a particular circuit that performs a square rooting operation on a current using BJTs. However.6 implemented arithmetic function. For example. due to the lack of a required base current. these circuits are sometimes called quadratic-translinear or MOS translinear to distinguish them from translinear circuits based on exponential characteristics. I d . as a function of its gate-source voltage. when converted to the subthreshold MOS equivalent. multiplication. several types of subthreshold MOS circuits which display κ independence do exist[4][5]. the drain current.41. it will actually raise the current to the this reduces to the appropriate 1 2 κ κ+1 power[30].

The linear elements are used to add. for decreasingly small submicron channel lengths. and when the exponential translinear devices used are subthreshold MOS transistors. Thus. Furthermore. it is clear that multiplying values by a constant. subtract and multiply variables by a constant. the linear networks are usually defined by resistor networks[13][14]. These circuits are usually more restricted in their dynamic range than bipolar translinear circuits. can be transformed into raising a value to the n-th power. When the exponential translinear devices used are bipolar transistors. the square-law current-voltage equation is usually not as good an approximation to the actual circuit behavior as for both bipolar circuits and MOS circuits that are deep within the subthreshold region. This class of circuits can be viewed as a generalization of the translinear principle. . They are limited at the low end by weak inversion and at the high end by mobility reduction[7]. From the point of view of log-antilog mathematics. some also take advantage of transistors biased in the triode or linear region[12]. the addition of a linear network provides greater flexibility in function synthesis. 2.4 Translinear Circuits With Multiple Linear Input Elements Another class of translinear circuits exists that involves the addition of a linear network of elements used in conjunction with exponential translinear devices. the linear networks are usually defined by capacitor networks[15][16]. n. Also.7 rely upon transistors in saturation. the current-voltage equation begins to approximate a linear function more than a quadratic[8].

One technique is to expose the circuit to ultraviolet light while grounding all of the pins. These circuits use capacitively coupled inputs to the gate of the MOSFET in a capacitive voltage divider configuration. because the actual gate charge is not critical so long as the gate charges are equalized. stray charge may again begin to accumulate requiring another ultraviolet exposure. the gate voltages are somewhat random and must be equalized for proper circuit operation. a dynamic restoration technique is utilized that makes it possible to operate the circuits indefinitely. Since the gate has no DC path to ground it is floating and some means must be used to initially set the gate charge and.8 Chapter 3 Analog Computation with Dynamic Subthreshold Translinear Circuits A class of analog CMOS circuits has been presented which made use of MOS transistors operating in their subthreshold region[15][16]. such as when the floating gate transistors are in a fully differential configuration. The gate voltage is initially set at the foundry by a process which puts different amounts of charge into the oxides. when one gets a chip back from the foundry. This has the effect of reducing the effective resistance of the oxide and allowing a conduction path. voltage value. Thus. it does not always constrain the actual value of the gate voltage to a particular value. Although this technique ensures that all the gate charges are equalized. Therefore. For the circuits presented here it is imperative that the initial voltages on the gates are set precisely and effectively to the same value near the circuit ground. hence. . This technique works in certain cases[17]. If the floating gate circuits are operated over a sufficient length of time.

A few simple circuit building blocks that should give the main idea of how to implement other designs will be presented. the voltage at the node VT becomes VT = C 1 V1 + C 2 V2 C1 + C 2 If C1 = C2 then this equation reduces to VT = V1 V2 + 2 2 The current-voltage relation for a MOSFET transistor operating in subthreshold and in saturation (Vds > 4Ut .9 C1 V1 VT V2 C2 Figure 3. all of the transistors are assumed to be identical. the necessary equations of the computations desired can be produced.1 Floating Gate Translinear Circuits The analysis begins by describing the math that governs the implementation of these circuits.1. if all of the voltages are initially set to zero volts and then V1 and V2 are applied. In the following formulations. For the capacitive voltage divider shown in fig. as well as a capacitive voltage divider.1: Capacitive voltage divider 3. A more thorough analysis for circuit synthesis is given elsewhere[15][16]. where Ut = kT /q) is given by[30]: Ids = Io exp(κVgs /Ut ) Using the above equation for a transistor operating in subthreshold. . 3.

1.2. 3.Vin Iout Figure 3. Ut Iout ln κ Io = Vin Ut Iref ln κ Io = Vref Ut Iin ln κ Io = Vin Vref + 2 2 Then. these results can be combined to obtain the following: Ut Iin ln κ Io Ut Iout = ln κ Io 1 2 Iref Ut ln + κ Io 1 2 Iout I2 Iin /Io  = Io  = in 1 Iref (Iref /Io ) 2  2 .2: Squaring circuit Also. all of the capacitors are of the same size.10 Iref.Vref Iin. the I-V equations for each of the transistors can be written as follows.1 Squaring Circuit For the squaring circuit shown in figure 3.

2 Square Root Circuit The equations for the square root circuit shown in figure 3.1.Vref Iin.3: Square root circuit 3.11 Iref.3 are shown similarly as follows: Ut Iref ln κ Io = Vref Ut Iin ln κ Io = Vin Ut Iout ln κ Io Combining the equations then gives Ut Iout ln κ Io = Vin Vref + 2 2 Iin Ut = ln κ Io 1 2 Iref Ut ln + κ Io 1 2 Iout = Iref Iin .Vin Iout Figure 3.

12

Iin1,Vin1

Iin2,Vin2

Iref,Vref

Iout

Figure 3.4: Multiplier/divider circuit

3.1.3

Multiplier/Divider Circuit

For the multiplier/divider circuit shown in figure 3.4, the calculations are again performed as follows: Iin1 Ut ln κ Io = Vin1 2

Iin2 Ut ln κ Io

=

Vin2 2

Ut Iref ln κ Io

=

Vin2 Vref + 2 2

Vref Ut Iref = ln 2 κ Io

Iin2 Ut ln κ Io

Iout Ut ln κ Io

=

Vin1 Vref + 2 2

Ut Iout ln κ Io

=

Iref Ut ln κ Io

− ln

Iin2 Iin1 + ln Io Io

13

Iout = Iref

Iin1 Iin2

Although the input currents are labeled in such a way as to emphasize the dividing nature of this circuit, it is possible to rename the inputs such that the divisor is the reference current and the two currents in the numerator would be the multiplying inputs. Note that the divider circuit output is only valid when Iref is larger than Iin2 . This is because the gate of the transistor with current Iref is limited to the voltage Vf gref at the gate by the current source driving Iref . Since this gate is part of a capacitive voltage divider between Vref and Vin2 , when Vin2 > Vf gref , the voltage at the node Vref is zero and cannot go lower. Thus, no extra charge is coupled onto the output transistors. Thus, when Iin2 > Iref , Iout ≈ Iin1 . Also, from the input/output relation, it would seem that Iin1 and Iref are interchangeable inputs. Unfortunately, based on the previous discussion, whenever Iin2 > Iin1 , Iout ≈ Iref . This would mean that no divisions with fractional outputs of
Iin1 Iin2

< 1 could be performed. When such a fractional output is desired, the use of the

circuit in this configuration would not perform properly. The preceding discussion indicates that the final simplified input/output relation of the circuit is slightly deceptive. Care must be taken to ensure that the circuit is operated in the proper operating region with appropriately sized inputs and reference currents. Furthermore, when doing the analysis for similar power law circuits the circuit designer must be aware of the charge limitations at the individual floating gates. A more accurate input/output relation for the multiplier/divider circuit is thus given as follows:

14

Iref,Vref f2

Iin,Vin

Iout

f1

Figure 3.5: Dynamically restored squaring circuit

3.2

Dynamic Gate Charge Restoration

Iout =  

  

Iin1 Iref Iin2 if Iin2 < Iref

Iin1

if Iin2 > Iref

All of the above circuits assume that some means is available to initially set the gate charge level so that when all currents are set to zero, the gate voltages are also zero. One method of doing so which lends itself well to actual circuit implementation is that of using switches to dynamically set the charge during one phase of operation, and then to allow the circuit to perform computations during a second phase of operation. The squaring circuit in figure 3.5 is shown with switches now added to dynamically equalize the charge. A non-overlapping clock generator generates the two clock

Vin Iout f1 Figure 3.6) and multiplier/divider circuit (figure 3. φ1 is high and φ2 is low. This establishes an initial condition with no current through the transistors corresponding to zero gate voltage. Thus. During the first phase of operation. The square root (figure 3. This is the compute phase of operation. during the second phase of operation.6: Dynamically restored square root circuit signals. The circuit now exactly resembles the aforementioned floating gate squaring circuit.15 Iref. Thus.Vref f2 Iin. the input currents do not affect the circuit and all sides of the capacitors are discharged and the floating gates of the transistors are also grounded and discharged. and the input currents are reapplied. it is able to perform the necessary computation. The transistor gates are allowed to float. φ1 goes low and φ2 goes high. This is the reset phase of operation.7) are also constructed in the same manner by adding switches connected to φ2 at the drains of each of the transistors and switches connected to φ1 at each of the floating gate and . Then.

Iref .8. The reference current is doubled with the 1:2 current mirror into the divider stage. is mirrored to all of the reference inputs of the individual stages.Vin2 Iref. when Iin1 = Iin2 = Imax . For example. it is necessary to make sure that the reference . The reference current. The above circuits can be used as building blocks and combined with current mirrors to perform a number of useful computations. This normalization stage is seen in figure 3. Since the current that is being divided will be the square root of a sum of squares of two currents.Vref Iout f1 Figure 3. a normalization stage can be made which normalizes a current by the square root of the sum of the squares of other currents.7: Dynamically restored multiplier/divider circuit capacitor terminals. Such a stage is useful in many signal processing tasks.3 Root Mean Square (Vector Sum) Normalization Circuit. This is necessary because the reference current must be larger than the largest current that will be divided by. 3.Vin1 f2 Iin2.16 Iin1.

17 Iin1 Iin2 Iin1 Iout Iref Figure 3.8: Root mean square (vector sum) normalization circuit f2 f1 .

The other input to the divider stage will be a mirrored copy of one of the input currents. the overall transfer function computed would be Iin1 2 2 Iin1 + Iin2 Iout = Iref The addition of other divider stages can be used to normalize other input currents. . Thus.18 current for the divider section is greater than √ 2Imax . Thus. The reference current for the divider stage is twice the reference current for the other stages as discussed before. The input to the square root stage is thus given by 2 I2 Iin1 + in2 Iref Iref The square root stage then performs the computation 2 Iin1 I2 + in2 Iref Iref Iref = 2 2 Iin1 + Iin2 The output of the square root stage is then fed into the multiplier/divider stage as the divisor current. the output of the divider stage is given by Iin1 2 2 Iin1 + Iin2 2Iref The output can then be fed back through another 2:1 current mirror (not shown) to remove the factor of 2. the reference current in the divider section will be 2Imax . Using the 1:2 current mirror and setting Iref = Imax . The first two stages that read the input currents are the squaring circuit stages. The outputs of these stages are summed and then fed back into the square root stage with a current mirror. This circuit easily extends to more variables by adding more input squaring stages and connecting them all to the input of the current mirror that outputs to the square root stage. which is sufficient to enforce the condition.

19 3. Figure 3. The circuits show good performance over several orders of magnitude in current range. in subthreshold it follows the exponential characteristic.9 and 3.6µ. This is because in above threshold. Since the reference transistor is leaving the subthreshold region. the circuit deviates from the ideal when one or more transistors leaves the subthreshold region. The solid line in the figures represents the ideal fit.11. more voltage is coupled onto the output transistor. Some of the anomalous points. The subthreshold region for these transistors is below approximately 100nA. The circle markers represent the actual data points. and 3. The floating gate nfets were all W=30µ.475 pF.10. Since the diode connected reference transistor sets up a slightly higher gate voltage. 3. All of the pfet transistors used for the mirrors were W=16.6µ.8µ.2µm double poly CMOS process. more charge and. are believed to be caused by autoranging errors in the instrumentation used to collect . hence. The current W/L is 1. The data gathered from the current squaring circuit. such as those seen in figures 3.9. an offset from the theoretical fit for the Iref = 100nA case is seen. The nfet switches were all W=3. the current goes as the square of the voltage.4 Experimental Results The above circuits were fabricated in a 1. This causes the output current to be slightly larger than expected. more gate voltage is necessary per amount of current. Leakage currents limit the low end of the dynamic range. increasing W/L to 10 would change the subthreshold current range of these circuits to be below approximately 1µA and hence increase the high end of the dynamic range appropriately. For example.11. L=3. in figure 3. At the high end of current. L=30µ.10. L=6µ. respectively. square root circuit and multiplier/divider circuit is shown in figures 3. whereas. Extra current range at the high end can be achieved by increasing the W/L of the transistors so that they remain subthreshold at higher current levels.12 shows the data from the vector sum normalization circuit. The capacitors were all 2.

The reference current for the normalization circuit. it is possible to simply scale the W/L of the final output mirror stage to correct it. the switches will have a small voltage drop across them and the floating gates are actually set slightly above ground. The divider circuit.1 Discussion Early Effects Although setting the floating gates near the circuit ground has been the goal of the dynamic technique discussed above. in the circuits previously discussed. 3. the transistors will enter the above threshold region with smaller input currents. the ideal fit required a value of 14nA to be used. This was not seen in the other circuits.20 the data. However. If the floating gates are reset to a value higher than ground.5 3. thus it is assumed that this is due to the Early effect of the current mirrors. it is possible to allow the floating gate values to be reset to something other than the circuit ground. does not perform the division after Iin2 > Iref . and instead outputs Iin1 as is seen in the figure. The main effect this has is to change the dynamic range. It may also be possible to use a reference voltage smaller than ground in order to set . In fact. This is because the reference current and the maximum input currents were chosen to keep the divider and all circuits within the proper subthreshold operating regions of the building block circuits. Iref . The normalization circuit shows very good performance. at the input of the reference current mirror array was set to 10nA. as previously discussed. In fact. it is possible to use the SPICE simulations to change the mirror transistor ratios to obtain the desired output. Since this is merely a multiplicative effect. the SPICE simulations of the circuit also predict the value of 14nA. This is helped by the compressive nature of the function. it is possible to increase the length of the mirror transistors to reduce Early effect or to use a more complicated mirror structure such as a cascoded mirror. Alternatively.5. Therefore.

21 10 −5 Iout = Iin / Iref) 2 10 −6 10 −7 Iout (Amps) 10 −8 Iref=100pA −9 Iref=1nA Iref=10nA Iref=100nA 10 10 −10 10 −11 10 −10 10 −9 10 Iin (Amps) −8 10 −7 Figure 3.9: Squaring circuit results .

10: Square root circuit results .22 10 −6 Iout = sqrt(Iin * Iref) 10 −7 Iout (Amps) Iref=100nA 10 −8 Iref=10nA 10 −9 Iref=1nA Iref=100pA 10 −10 10 −10 10 −9 10 Iin (Amps) −8 10 −7 Figure 3.

23 10 −6 Iout = Iref * I1 / I2. Iref=10nA 10 −7 I1=10nA 10 Iout (Amps) −8 I1=1nA 10 −9 I1=100pA 10 −10 10 −11 10 −10 10 −9 10 I2 (Amps) −8 10 −7 Figure 3.11: Multiplier/divider circuit results .

24 10 −7 Iout = Iref * I1 / sqrt(I1 + I2 ) 2 2 I2=100pA Iout (Amps) 10 −8 I2=1nA 10 −9 I2=10nA −10 10 10 −10 10 I1 (Amps) Iout = Iref * I1 / sqrt(I1 + I2 ) 2 2 −9 10 −8 10 −7 I1=10nA Iout (Amps) 10 −8 I1=1nA 10 −9 I1=100pA −10 10 10 −10 10 I2 (Amps) −9 10 −8 Figure 3.12: RMS normalization circuit results .

the drain voltage couples onto the floating gate by an amount Vd Cf g−d . In this way. The method chosen to overcome these combined Early effects is to first make the transistors long. If L is sufficiently long. Vds . W. by the same amount as the length. Since this is coupled directly onto gate. the main overlap capacitance of interest is the floating gate to drain capacitance. CT where CT is the total input capacitance. It may be possible to use some of the switches as cascode transistors to reduce the Early effect in the output transistors of the various stages. Ids = Io e κVgs Ut 1+ Vds Vo The voltage Vo . the standard Early effect comes from an increase in channel current due to channel length modulation[30]. Another aspect of the Early effect which is important for circuits with floating gates is due to the overlap capacitance of the gate to the drain/source region of the transistor[15]. then the Early voltage. The input and reference . V d . depends on process parameters and is directly proportional to the channel length. to keep the same subthreshold current levels and not lose any dynamic range. This parasitic capacitance acts as another input capacitor from the drain voltage. L. Vo . this has an exponential effect on the current level.25 the floating gates in such a way as to increase the dynamic range. Cf g−d . This also requires increasing the width. This also depends on process parameters and is directly proportional to the width. will be large and this will reduce the contribution of the linear Early effect. This can be modeled by an extra linear term that is dependent on the drain-source voltage. it is necessary to further increase the size of the input capacitors to make the total input capacitance large compared to the floating gate to drain parasitic capacitance to reduce the impact of the exponential Early effect. One problem with these circuits is the presence of a large Early effect. Thus. First. However. it is possible to trade circuit size for accuracy. W. This effect comes from two sources. In this way. this increase in W increases Cf g−d which worsens the exponential Early effect. In these circuits. called the Early voltage. L.

One such technique would involve using a single transistor with a multiphase clock that can be used as a replacement for all the input and output transistors in the building blocks[18]. it is important to make the input capacitors large enough that other parasitic capacitances are very small compared to them.5. but the output transistor drains are allowed to go near Vdd . It may be possible to use a current mode filter or use two transistors in complementary phase at the output to compensate for the glitches[18]. Furthermore. 3. This would involve not setting φ2 all the way to Vdd during the computation and instead setting it to some lower cascode voltage. Thus.26 transistors already have their drain voltages fixed due to the diode connection and would not benefit from this. The input can change faster than the clock rate and the output will be valid during most of the computation phase. The output does require a short settling period due to the presence of glitches in the output current from charge injection by the switches. Other techniques are also available which can improve matching characteristics and reduce the size of the circuits. it is possible to make the clock as slow as 1 Hz or slower. Also making the input capacitors large with respect to the size of the switches will reduce the effects of the glitches by making the injected charge less effective at inducing a voltage on the gate. The amount of charge injection is determined by the area of the switches and is minimized with small switches. the clock rate for these circuits is determined solely by the leakage rate of the switch transistors. .2 Clock Effects Unlike switched capacitor circuits that require a very high clock rate compared to the input frequencies. This may allow a reduction in the size of the transistors and capacitors.

However. this also has an effect on the speeds at which the circuits may be operated and the settling times of the signals. .3 Speed. Subthreshold circuits are already limited to low currents which in turn limits their speed of operation since it takes longer to charge up a capacitor or gate capacitance with small currents. thus. the increases in input capacitance necessary to increase accuracy further limits the speed of operation. The switches feeding into the capacitor inputs can be viewed as a lowpass R-C filter at the inputs of the circuits. Floating gate advantages • Larger dynamic range .6 Floating Gate Versus Dynamic Comparison As previously discussed. Accuracy Tradeoffs It is necessary to increase the size of the input capacitors and reduce the size of the switches in order to improve the accuracy and decrease the effect of glitches. but this will increase the parasitic capacitance and require an increase in the size of the input capacitors to obtain the same accuracy.27 3. Thus. a brief comparison is made between the advantages and disadvantages of the two design styles. there is an inherent tradeoff between the speed of operation and the accuracy of the circuits. • No glitches . does not have the same settling time requirements.5.The floating gate version does not suffer from the appearance of glitches on the output and.The switches of the dynamic version introduce leakage currents and offsets which reduce dynamic range. In this section. Furthermore. the addition of a switch along the current path reduces the speed of the circuits because of the effective on-resistance of the switch. Thus. It is possible to make the switches larger in order to decrease the effective resistance. 3. it is possible to use these circuits in either the purely floating gate version or in the dynamic version.

Also.There are fewer parasitics because of the lack of switches. • Faster . thus.It is possible to go to smaller current values and. • Better matching with smaller input capacitors . thereby causing the shift. ultraviolet irradiation fails to place these circuits into a proper subthreshold operating range and instead places the floating gate in a state where the transistors are biased above threshold.Packaging costs are increased when it is necessary to accommodate a UV transparent window[19]. • Less power . • Decreased packaging costs . • Robust against long term drift . it is possible to reduce the size of input capacitors and still have good matching. These electrons get stuck in trap states in the oxide. power is saved by not operating any switches. since there are fewer stacked transistors. Much of the threshold shift is attributed to hot electron injection. However.Since it is possible to use smaller input capacitors and there is no added switch resistance along the current path. Much effort is placed into reducing the number of oxide traps to minimize this effect. also.Transistor threshold drift is a known problem in CMOS circuits[20] and Flash EEPROMs[19]. Dynamic advantages • No UV radiation required .There are fewer transistors because of the lack of switches and also smaller input capacitors may be used. smaller operating voltages are possible. • Less area . The electrons can then exit the oxide and normally contribute to a very small gate current.In many cases it is desirable not to introduce an extra post processing step in order to operate the circuits. in some instances. Furthermore. it is possible to operate the circuits at higher speeds. in floating .28 • Constant operation .The output is always valid.

Furthermore. the dynamic charge restoration technique may be applied to other floating gate computational circuits that may otherwise require initial ultraviolet illumination or other methods to set the initial conditions. In practice. Furthermore. 3. The dynamically restored circuits remove these excess charges during the reset phase of operation and provide greater immunity against long term drift. as process scales continue to decrease. direct gate tunneling effects will further exacerbate the problem. a full consideration of each style’s constraints with respect to the overall system design criteria would favor one or the other of the two design styles. Although the advantages that the pure floating gate circuits provide outnumber those of the dynamic circuits.29 gate circuits. the electrons are not removed and remain as excess charge which leads to increased threshold drift. . The dynamic charge restoration technique is shown to be a useful implementation of this class of analog circuits.7 Conclusion A set of circuits for analog circuit design that may be useful for analog computation circuits and neural network circuits has been presented. the advantages afforded by the dynamic technique are of such significant importance that many circuit designers would choose to utilize the technique. It is hoped that it is clear from the derivations how to obtain other power law circuits that may be necessary and how to combine them to perform useful complex calculations.

1 VLSI Neural Network Algorithms Backpropagation Early work in neural networks was hindered by the problem that single layer networks could only represent linearly separable functions[41]. for example. ∂wij are calculated for each of the weights. Furthermore. s(.1. did not suffer from this limitation. and Oi is the respective network output. Although it was known that multilayer networks could learn any boolean function and. Thus. the lack of a learning rule to program multilayered networks reduced interest in neural networks for some time[40]. The functional form of the update involves the weights themselves propagating the errors backwards to more internal nodes.1 4. Limitations. In backpropagation. hence.).30 Chapter 4 VLSI Neural Network Algorithms. A typical sum-squared error function used for neural networks is defined as follows: 1 − E (→) = w 2 (Ti − Oi )2 i where Ti is the target output for input pattern i from the training set. weight updates. which leads to the name backpropagation. based on gradient descent. ∆wij ∝ ∂E . provided a viable means for programming multilayer networks and created a resurgence of interest in neural networks. if . and Implementations 4. the functional form of the weight update rules are composed of derivatives of the neuron sigmoidal function. The network output is a composite function of the network weights and composed of the underlying neuron sigmoidal function. The backpropagation algorithm[42].

the functional form of the neuron. deviations in the multiplier circuits can add offsets which would also accumulate and significantly reduce the ability of the network to learn. . Furthermore. 4. For example. it was discovered that offsets of 1% degraded the learning success rate of the XOR problem by 50%[43]. unwanted errors would get backpropagated and accumulate. approximates a gradient by applying a perturbation to the input of each neuron and then measuring the difference in error. For example. Otherwise. efforts have been made to find suitable learning algorithms which are independent of the model of the neuron[44]. In typical analog VLSI implementations. in the above example the neuron was modeled as a tanh function. if the hardware implementation of the sigmoidal function tends to deviate from an ideal function such as the tanh function. This error is then used for the weight update rule.1. analog implementations of the derivative can prove difficult and implementations which avoid these difficulties are desirable.31 s(x) = tanh(ax). In a model-free learning algorithm. in one implementation. For software implementations of the neural network. Most of the algorithms developed for analog VLSI implementation utilize stochastic techniques to approximate the gradients rather than to directly compute them. offsets much larger than 1% are quite common. it becomes necessary to compute the derivative of the function actually being implemented which may not have a simple functional form. called the Madaline Rule III (MR III). However. the ease of computing this type of function allows an effective means to implement the backpropagation algorithm. does not enter into the weight update rule. One such implementation[45]. which would lead to seriously degraded operation. then the weight update rule would involve functions of the form s (x) = 2a (tanh(ax)) (1 − tanh(ax)). as well as the functional form of its derivative.2 Neuron Perturbation Because of the problems associated with analog VLSI backpropagation implementations. Furthermore. ∆E = E(with perturbation)-E(without perturbation).

1. multiplication/division circuitry would be required to implement the learning rule. summed input to the neuron i that is perturbed. Another variant of this method. but studies of the effects of inaccurate derivative approximations were not done. a reasonable approximation is achieved. and no assumptions are made about the neuron model. no explicit derivative hardware is required. Smaller perturbations provide better approximations to the gradient at the expense of very small steps in the gradient direction which then requires many extra iterations to find the minimum. a natural tradeoff between speed and accuracy exists. This weight update rule requires access to every neuron input and every neuron output. required the use of the neuron function derivatives in the weight update rule[47]. Thus. The gradient approximation can also be improved with the central difference ap- . However. and xj is either a network input or a neuron output from a previous layer which feeds into neuron i. The weight updates are based on a finite difference approximation to the gradient of the mean squared error with respect to a weight. pertij . The weight update rule is given as follows: E wij + pertij − E (wij ) pertij ∆wij = This is the forward difference approximation to the gradient. Furthermore.32 ∆wij = −η ∆E xj ∆neti where neti is the weighted. For small enough perturbations. The method showed good robustness against noise.3 Serial Weight Perturbation Another implementation which is more optimal for analog implementation involves serially perturbing the weights rather than perturbing the neurons[23]. 4. called the virtual targets method and which is not model-free.

the above weight update rule merely involves changing the weights by the error difference scaled by a small constant.33 proximation: E wij + pertij 2 ∆wij = pertij − E wij − pertij 2 However. Also. Studies have also shown that this algorithm performs better than neuron perturbation in networks with limited precision weights[46]. However.1. Thus.4 Summed Weight Neuron Perturbation The summed weight neuron perturbation approach[24] combines elements from both the neuron perturbation technique and the serial weight perturbation technique. The multiplication/division step is replaced by scaling with a constant. This allows for a reduction in the number of feedforward passes. this speed . It is clear that this implementation should require simpler circuitry than the neuron perturbation algorithm. in summed weight neuron perturbation. Since the serial weight perturbation technique applies perturbations to the weights one at a time. weight perturbations are applied for each particular neuron. 4. A usual implementation would use a weight update rule of the following form: η E wij + pertij − E (wij ) pert ∆wij = − Since pert and η are both small constants. there is no need to provide access to the computations of internal nodes. its computational complexity was shown to be of higher order than that of the neuron perturbation technique. Applying a weight perturbation to each of the weights connected to a particular neuron is equivalent to perturbing the input of that neuron as is done in neuron perturbation. this requires more feedforward passes of the network to obtain the error terms. the error is then calculated and weight updates are made using the same weight update rule as for serial weight perturbation. yet still does not require access to the outputs of internal nodes of the network.

34 up is at the cost of increased hardware to store weight perturbations. the weight perturbation storage can be achieved with 1 bit per weight. This is feasible only if the speed of averaging over multiple perturbations is greater than the speed of applying input patterns. Furthermore. This can be better understood if the parallel perturbative method is thought of in analogy to Brownian motion where there is considerable movement of the network through weight space. a speedup of W 2 is achieved over serial weight perturbation. a serial weight perturbation approach gives a factor W reduction in speed over a straight gradient descent approach because the overall gradient is calculated based on W local approximations to the gradient. With parallel weight perturbation. but with a general guiding force in the proper direction. whereas. The perturbed error is then measured and compared to the unperturbed error. Thus. If there are W total weights in a network. Thus. large 1 1 . the weight perturbations are kept at a fixed small size. but the sign of the perturbation is random. In this way. This increase in speed may not always be seen in practice because it depends upon certain conditions being met such as the perturbation size being sufficiently small. convergence rates can be improved by using an annealing schedule where large perturbation sizes are used initially and then reduced. This usually holds true because the input patterns must be obtained externally. the multiple perturbations are applied locally. The same weight update rule is then used as in the serial weight perturbation method. on average. Thus.1. It is also possible to improve the convergence properties of the network by averaging over multiple perturbations for each input pattern presentation[21]. Normally.5 Parallel Weight Perturbation The parallel weight perturbation techniques[21][25][48] extend the above methods by simultaneously applying perturbations to all weights. there is a factor W 2 reduction in speed over direct gradient descent[25]. 4. all of the weights are updated in parallel.

from this parallel weight perturbation is measured and then used to calculate and store another partial gradient term. j hidden nodes. Then. wij . wpertij for each weight/neuron pair. ∆ni . This neuron output perturbation is repeated sequentially and a partial error gradient term stored for every neuron in the hidden layer. For example. .6 Chain Rule Perturbation The chain rule perturbation method is a semi-parallel method that mixes ideas from both serial weight perturbation and neuron perturbation[26]. a hidden layer neuron output.1. each of the weights. In a 1 hidden layer network. ∆E . nperti is stored. ni . is perturbed by amount nperti . Finally. the chain rule perturbation method requires k ·j +j +i perturbations. ∆ni . in a network with i inputs. the name of the method comes from the analog of the weight update rule to the chain rule in derivative calculus. the same hidden layer weight update procedure is repeated. The resultant change in hidden layer neuron outputs. For more hidden layers. the hidden layer weights are updated in parallel according to the following rule: ∆wij = −η ∆E ∆E ∆ni = −η wpert nperti wpertij Hence.35 weight changes are discovered first and gradually decreased to get finer and finer resolution in approaching the minimum. This method allows a speedup over serial weight perturbation by a factor approaching the number of hidden layer neurons at the expense of the extra hardware necessary to store the partial gradient terms. 4. serial perturbation requires k · j + j · i perturbations. feeding into the hidden layer neurons are perturbed in parallel by an amount wpertij . the output layer weights are first updated serially using the weight update formula from the serial weight perturbation method. whereas. The effect on the error is measured and a partial gradient term. and k outputs. Next.

4.2 4.1. in the backward phase. First.36 4. The total number of bits in the network is given as B = bW . then the direction of search is positive. Next. for a weight. the direction of search is found by applying a fraction of the perturbation.2.1 VLSI Neural Network Limitations Number of Learnable Functions with Limited Precision Weights Clearly in a neural network with a small. fixed number of bits per weight. For example if 0. in the forward phase. the total number of different functions that the network can implement is no more than 2B .1pert is applied and the error goes down. Most of the stochastic algorithms and implementations concentrate on iterative approaches to weight updates as discussed above. wij . Then. In computer simulations of neural networks where floating point . Since each bit can take on only 1 of 2 values. perturbations of size pert continue to be applied in the direction of search until the error does not decrease or some predetermined maximum number of forward perturbations is reached. Successive backwards steps are taken with each step decreasing by another factor of 2 until the error again goes up or another predetermined maximum number of backward perturbations is reached. the ability of the network to classify multiple functions is limited. otherwise it is negative. a step of size pert 2 in the reverse direction is taken. Assume that each weight is comprised of b bits of information and that there are W total weights. but continuous-time implementations are also possible[50]. This method is effectively a sophisticated annealing schedule for the serial weight perturbation method.7 Feedforward Method The feedforward method is a variant of serial weight perturbation that incorporates a direct search technique[49]. Many other variations and combinations of these stochastic learning rules are possible. These steps are repeated for every weight.

at least 12 bits were required to obtain reasonable convergence. the perturbative approaches will be desirable in . Due to the increased hardware complexity necessary to achieve the extra number of bits. Most of these are specific to a particular learning algorithm. and the output neuron quantization were all tested for a range of bits. However. This implies that standard analog VLSI neurons should be sufficient for a VLSI neural network implementation. This shows that a perturbative algorithm that relies only on feedforward operation can be effective in training with as few as half the number of bits required for a backpropagation implementation. neuron output quantization of 6 bits was sufficient. The effect of the number of bits on the ability of the network to properly converge was determined. this is usually not a limitation to learning.37 precision is used and b is on the order of 32 or 64.2.2 Trainability with Limited Precision Weights Several studies have been done on the effect of the number of bits per weight constraining the ability of a neural network to properly train[52][53][54]. 2B is merely an upper bound. Also. In one result. 4. In one study. then the ability of the network to learn a specific function may be compromised. the effects of the number of bits of precision on backpropagation were investigated[52]. the signs of the outputs of the hidden layer neurons can also be flipped by adjusting the weights coming into that neuron and by also flipping the sign of the weight on the output of the neuron to ensure the same output. Furthermore. the total number of boolean functions which can be implemented is given approximately by 2B N !2N for a fully connected network with N units in each layer[51]. The weight precision for a feedforward pass of the network. the hidden neuron outputs can be permuted since the final output of the network does not depend upon the order of the hidden layer neurons. There are many different settings of the weights that will lead to the same function. the weight precision for the backpropagation weight update calculation. For example. The empirical results showed that a weight precision of 6 bits was sufficient for feedforward operation. For the backpropagation weight update calculation. if b is a small number like 5. Also.

The weights are stored permanently on a floating gate transistor which can be updated under UV illumination. Other schemes involve periodic refresh by use of quantized levels defined by input ramp functions or by periodic repetition of input training patterns[59]. In one such implementation.38 analog VLSI implementations.3 Analog Weight Storage Due to the inherent size requirements to store and update digital weights with A/D and D/A converters. A second drive transistor is used as a switch to sample the input drive voltage onto a drive electrode in order to adjust the floating gate voltage. a 3 transistor analog . involves quantization and refresh of the weights by use of A/D/A converters which can be shared by multiple weights[60][61]. hence. only one floating gate transistor per synapse was required. weight value. A learning rule was developed which balances the effects of the tunneling and injection. it is necessary to implement some sort of scheme to continuously refresh the weight. Since significant space and circuit complexity overhead exists for capacitive storage of weights. The weight multiplication with an input voltage is achieved by biasing the floating gate transistor into the triode region which produces an output current which is roughly proportionate to the product of its floating gate voltage. Since the capacitors slowly discharge. representing the weight. 4. Another approach which also implements a dense array of synapses based on floating gates utilized tunneling and injection of electrons onto the floating gate rather than UV illumination[28][29]. The use of UV illumination also allowed a simple weight decay term to be introduced into the learning rule if necessary. a dense array of synapses is achieved with only two transistors required per synapse[64]. some analog VLSI implementations concentrate on reducing the space necessary for weight storage by using analog weights. and its drain to source voltage. In another approach. which showed long term 8 bit precision. attempts have also been made at direct implementation of nonvolatile analog memory. and.2. One such scheme. Usually. Using this technique. representing the input. this involves storing weights on a capacitor.

Since the components were modular. Other techniques use time multiplexed multipliers to reduce the number of multipliers necessary. each of the individual circuit blocks was quite large and hence was not well suited for dense integration on a single neural network chip. Although the digital implementations are more straightforward to build using standard digital design techniques. they often scale poorly with large networks that require many bits of precision to implement the backpropagation algorithms[57][58]. the circuits for the neuron blocks consisted of three multistage operational amplifiers and three 100k resistors. This includes purely digital approaches. An early attempt at implementation of an analog neural network involved making separate VLSI chip modules for the neurons. Some of the weight update speeds obtainable using the tunneling and injection techniques may be limited because rapid writing of the floating gates requires large oxide currents which may damage the oxide. The learning algorithms were based on output information from the neurons and no circuitry for local weight updates was provided. purely analog approaches and hybrid analog-digital approaches. it could be scaled to accommodate larger networks. but eventually written into nonvolatile analog memories[66]. synapses and routing switches[62]. For example. 4. The modules could all be interconnected and controlled by a computer.3 VLSI Neural Network Implementations Many neural network implementations in VLSI can be found in the literature. these techniques can be combined with dynamic refresh techniques where weights and weight updates are quickly stored and learned on capacitors.39 memory cell was shown to have up to 14 bits of resolution[65]. . The weights were digital and were controlled by a serial weight bus from the computer. some size reductions can be realized by reducing the bit precision and using the stochastic learning techniques available[55][56]. Nevertheless. However. However.

asymmetric weights[63].3. it was necessary to intermittently apply the noise sources in order to facilitate learning. then no change is made to the weight. Briefly. On each pattern presentation. The network showed good success with a 2:2:2:1 network learning XOR. If the neurons are correlated in the teacher phase. and the unclamped. and highly correlated oscillations were present between all of the noise sources. The weights were stored as 5 bit digital words. which uses bidirectional. The basic idea behind a contrastive technique is to run the network in two phases. then the weight is decremented. the network weights are perturbed with noise and the correlations counted. The learning rule was based on a Boltzmann machine algorithm.1 Contrastive Neural Network Implementations Some of the earliest VLSI implementations of neural networks were not based on the stochastic weight update algorithms. where the outputs are unclamped. If it is reversed. but is an example of contrastive learning that requires randomness in order to learn. where the output neurons are clamped to the correct output. then the weight is incremented. feedback connections and stochastic elements. Although the system actually needs noise in order to learn. The chip incorporated a local learning rule. the algorithm works by checking correlations between the input and output neuron of a synapse.40 4. and a student phase. The weight updates are based on a contrast function that measures the difference between the clamped. but on contrastive weight updates[67]. free phase. and the outputs of the network are clamped. a teacher signal is sent. Thus. In the second phase. the gains required to amplify thermal noise caused the system to be unstable. The learning rate was shown to be over 100. One of the first fully integrated VLSI neural networks used one of these contrastive techniques and showed rapid training on the XOR function[33]. If the teacher and student phases share the same correlation. In the first phase. the network outputs are unclamped and free to change. Correlations are checked between a teacher phase.000 times faster than that . Analog noise sources were used to produce the random perturbations. teacher phase. the weight update rule is not based on the previously mentioned stochastic weight update rules. but not the student phase. Unfortunately.

Due to the size of the circuitry. The use of the contrastive technique allowed a form of backpropagation despite circuit offsets. and a derivative generator was also necessary in order to compute the weight updates. . The chip was trained in loop with a computer applying the input patterns. Neurons consisted of current to voltage converters using a diode connected transistor. The multilayer network was obtained by cascading several chips together. The actual weight update was taken to be the difference between the clamped weight change and the free weight change. The weights were stored digitally and a serial weight bus was used to set the weights and weight updates. Another implementation showed a contrastive version of backpropagation[43]. only one layer of the network was integrated on a single chip. This is a good example of a situation where an analog VLSI neural network is very suitable as compared to a digital solution.3. With a 3V supply. 4. subthreshold synapses and neurons were used. The basic idea was to also use both a clamped phase and a free phase. To reduce power. this implementation was not model free. controlling the serial weight bus and reading the output values for the error computation. A 10:6:3 network was trained to distinguish two cardiac arrythmia patterns. The chip was designed for implantable applications which necessitated the need for extremely low power and small area.2 Perturbative Neural Network Implementations A low power chip implementing the summed weight neuron perturbation algorithm has been demonstrated[22]. However. The synapses consisted of transconductance amplifiers with the weights encoded as a digital word which set the bias current. Successful learning of the XOR function was shown. only 200 nW was consumed.41 achieved with simulations on a digital computer. The chips showed a success rate with over 95% of patient data. The weights were stored as analog voltages on capacitors with the eventual goal of using nonvolatile analog weights. They also included a few extra transistors to deal with common mode cancelation. The weight updates are calculated in both phases using the standard backpropagation weight update rules.

. The chain rule perturbation algorithm has also been successfully implemented on chip[66]. The network was shown to successfully train on XOR and other functions. After the oscillations. The analog weights were stored dynamically and provisions were made on the chip for nonvolatile analog storage using tunneling and injection. Some of the weight refresh circuitry was also implemented off chip. but its use was not demonstrated. Two counterpropagating linear feedback shift registers. Many of the global functions such as error computation were also performed on the chip. The weight updates were computed locally. the error actually increased steadily. A parallel perturbative method was used with analog volatile weights and local weight refresh circuitry. However.42 Another perturbative implementation was used for training a recurrent neural network on a continuous time trajectory[39]. Rather than gradually decreasing to a small value. were used to digitally generate the random perturbations. thus not requiring any specialized pseudorandom number generation. Only a few control signals from a computer were necessary for its operation. The network was successfully trained to generate two quadrature phase sinusoidal outputs. The global functions of error accumulation and comparison were performed off chip. However. which were combined with a network of XORs. The weight updates were computed locally. The authors attribute this strange and erratic behavior to the fact that a large learning rate and a large perturbation size were used. all perturbations seemed to be of the same sign. the error was close to zero and stayed there. then entered large oscillations between the high error and near zero error. the error function during learning showed some very peculiar behavior. The size of the perturbations was externally settable.

large perturbations are applied to move the weights .43 Chapter 5 VLSI Neural Network with Analog Multipliers and a Serial Digital Weight Bus Training an analog neural network directly on a VLSI chip provides additional benefits over using a computer for the initial training and then downloading the weights. Initially. Both the parallel and serial methods can potentially benefit from the use of annealing the perturbation. By training with the chip in the loop. the old weight is restored.[23]. There are several issues that must be addressed to implement an analog VLSI neural network chip. where a weight is perturbed in a random direction. its effect on the error is determined and the perturbation is kept if the error was reduced. This requires a number of iterations that is directly proportional to the number of weights.[27]. otherwise. low power operation such as handwriting recognition for PDAs or pattern detection for implantable medical devices[22]. the neural network will also learn these offsets and adjust the weights appropriately to account for them.[26]. The analog hardware is prone to have offsets and device mismatches. the network observes the gradient rather than actually computing it.[25]. First. In this approach.[24]. Techniques that are more suitable involve stochastic weight perturbation[21]. Serial weight perturbation[23] involves perturbing each weight sequentially. A significant speed-up can be obtained if all weights are perturbed randomly in parallel and then measuring the effect on the error and keeping them all if the error reduces. an appropriate algorithm suitable for VLSI implementation must be found. Traditional error backpropagation approaches for neural network training require too many bits of floating point precision to implement efficiently in an analog VLSI chip. A VLSI neural network can be applied in many situations requiring fast.

If the weights are simply stored as charge on a capacitor.[22] is shown in figure 5. However. the perturbation sizes are occasionally decreased to achieve finer selection of the weights and a smaller error. A parallel scheme. which would perturb all weights at once. the issue of how to appropriately store the weights on-chip in a non-volatile manner must be addressed. many D/A converters are achieved with only one binary weighted array of transistors. This would then require one digital counter and one D/A per weight.[29]. This is not a concern because the network will be able to learn around these offsets. Then. they will ultimately decay due to parasitic conductance paths. optimized gradient descent techniques converge more rapidly than the perturbative techniques.44 quickly towards a minimum. would require one A/D/A per weight.2. but would require more area. this technique requires using large voltages to obtain tunneling and/or injection through the gate oxide and is still being investigated. however. One alternative would remove the A/D requirement by replacing it with a digital counter to adjust the weight values. This would then require having A/D/A (one A/D and one D/A) converters for the weights. The synapse performs the weighting .1 Synapse A small synapse with one D/A per weight can be achieved by first making a binary weighted current source (Figure 5. This would be faster. These voltages are then fed to transistors on the synapse to convert them back to currents. It is clear that the linearity of the D/A will be poor because of matching errors between the current source array and synapses which may be located on opposite sides of the chip. The synapse[26]. This would allow directly storing the analog voltage value. A single A/D/A converter would only allow a serial weight perturbation scheme that would be slow. One method would be to use an analog memory cell [28]. 5. Another approach would be to use traditional digital storage with EEPROMs. Thus.1) and then feeding the binary weighted currents into diode connected transistors to encode them as voltages. Next. In general.

45 Vdd 16 8 4 2 1 1 Iin i4 i3 i2 i1 i0 Figure 5.1: Binary weighted current source circuit Iout+ Iout- b5 b5 b5 Vin+ Vin- b0 b1 b2 b3 b4 i0 i1 i2 i3 i4 Figure 5.2: Synapse circuit .

Thus. changes the direction of current to achieve the appropriate sign. the output is approximately given by √ ∆Iout ≈ gm ∆Vin = 2 KI0 W ∆Vin . in the subthreshold linear region. The sign bit. the output is approximately given by κI0 W ∆Vin 2Ut ∆Iout ≈ gm ∆Vin = In the above threshold regime. the transistor equation in saturation is approximately given by ID ≈ K(Vgs − Vt )2 The synapse output is no longer described by a simple tanh function.[22]   +I W tanh κ(Vin+ −Vin− )   0 2Ut       b5 = 1 in+  −I0 W tanh 2Ut      b5 = 0 ∆Iout = Iout+ − Iout− =    κ(V −Vin− ) where W is the weight of the synapse encoded by the digital word and I0 is the least significant bit (LSB) current. the transistor equation is given by[30] Id = Id0 eκVgs /Ut and the output of the synapse is given by[30]. In the subthreshold region of operation. In the above threshold linear region. but is nevertheless still sigmoidal with a wider “linear” range. b5.46 of the inputs by multiplying the input voltages by a weight stored in a digital word denoted by b0 through b5.

1 ∆ Vin (Volts) 0. since the weights are learned on chip.5 Figure 5.5 16 1 8 0.3 shows the differential output current of the synapse as a function of differential input voltage for various digital weight settings.3: Synapse differential output current as a function of differential input voltage for various digital weight settings It is clear that above threshold.1 0 0.5 −0. on-chip learning will be able to set the weights to account for these different modes of operation.4 0.5 −2 −31 −2.3 −0. Furthermore. depending on the choice of LSB current. it is possible that some synapses will operate below threshold while others above.2 0.5 −0.3 0.5 −1 −16 −1. However. they will be adjusted accordingly to the necessary value.2 −0. the synapse is not doing a pure weighting of the input voltage. Again.47 2. Voltage Input for Various Weights 31 2 1.4 −0. The output currents range from the subthreshold region for the smaller weights to the above threshold region for the . The input current of the binary weighted current source was set to 100 nA.5 x 10 −6 Synapse Output Current vs. Figure 5.5 ∆ Iout (Amps) 4 0 −4 −8 0 −0.

it is clear the width of the linear region increases as the current moves from subthreshold to above threshold. For the largest weights. when the current range moves above threshold. For the smaller weights. Notice that the zero crossing of ∆Iout occurs slightly positive of ∆Vin = 0.8V. The neuron circuit performs the summation from all of the input synapses. Since one side of the differential current inputs may have a larger share of the common mode current. All of the curves show their sigmoidal characteristics.4.6V . W = 31. Since the outputs of the synapse will all have a common mode component. W = 4. it is important for the neuron to have common mode cancelation[22]. 5. the linear region spans only approximately 0. This is a circuit offset that is primarily due to slight W/L differences of the differential input pair of the synapse.2V . but a smaller number. it is important to distribute this common mode to keep both differential currents within a reasonable operating range.0. the linear range has expanded to roughly 0. The neuron circuit then converts the currents back into a differential voltage feeding into the next layer of synapses. and it is caused by minor fabrication variations. Furthermore.0. the synapse does not perform a pure linear weighting.48 large weights.1µA as would be expected from a linear weighting of 31 × 100nA.2 Neuron The synapse circuit outputs a differential current that will be summed in the neuron circuit shown in figure 5. The largest synapse output current is not 3. As was discussed above. Iin+d + Iin−d = Iin+d + Icmd 2 Iin+ = Iin+d + Iin− = Iin−d + Iin+d + Iin−d = Iin−d + Icmd 2 ∆I = Iin+ − Iin− = Iin+d − Iin−d .4V.

49 Vdd 2:1:1 Iind Icm d Icm d Iin+ d Vout+ Vcm Iin+ Vcm Vout- Iin- Vdd Vdd Vgain Vgain V 1+ Voffset V 1Voffset Figure 5.4: Neuron circuit .

This stage effectively takes the differential input currents and uses a transistor in the triode region to provide a differential output. the transistor with Iin−d may begin to cutoff and the above equations would not exactly hold. The above threshold transistor equation in the triode region is given by Id = 2K(Vgs − Vt − Vds )Vds 2 ≈ 2K(Vgs− Vt )Vds for small enough Vds .50 Icm = Iin+d + Iin−d + 2Icmd Iin+ + Iin− = = 2Icmd 2 2 ⇒ Iin+d = Iin+ − Icm ∆I Icm = + 2 2 2 ⇒ Iin−d = Iin− − ∆I Icm Icm =− + 2 2 2 If the ∆I is of equal size or larger than Icm . Having the outputs mid-rail is important for proper biasing of the next stage of synapses. This is done by simply adding additional current to the diode connected transistor stack. because the Vof f set and Vcm controls are used to ensure that the output voltages are approximately mid-rail. however. the current cutoff is graceful and should not normally affect performance. the voltage output of the neuron will then be given by Iin = K1 (Vgain − Vt − V1 )2 ≈ K2 (2(Vout − Vt )V1 ) V1 = Vgain − Vt − Iin K1 . This stage will usually be operating above threshold. the differential currents are then mirrored into the current to voltage transformation stage. With the common mode signal properly equalized. If K1 denotes the prefactor with W/L of the cascode transistor and K2 denotes the same for the transistor with gate Vout .

the neuron shows very good linear current to voltage conversion with separate gain and . However. L1 L2 Iin √ + Vt 2K(Vgain − Vt ) − 2 KIin 1 2K(Vgain − Vt ) ⇒ Vout = for small Iin. Overall. The diode connected transistor. it is clear that Vgain can be used to adjust the effective gain of the stage.6 displays only the positive output. the Vcm transistor becomes necessary to provide additional offset. Vout+ of the neuron. its ability to increase the baseline voltage is reduced because of the cascode transistor on its drain. varies as a function of differential input current for several values of Vgain . Figure 5. diode connected transistors with gate attached to Vout turns off. ∆Vout . as Vof f set increases in value.51 Iin = 2K2 (Vout − Vt ) Vgain − Vt − Iin K1 Vout = Iin 2K2 (Vgain − Vt ) − 2 2 K2 I K1 in + Vt if W1 W2 = then K1 = K2 = K.5 shows how the neuron differential output voltage. At some point. especially for small values of Vgain . the baseline output voltage corresponds roughly to Vof f set . with gate attached to Vout+ turns off where the output goes flat on the left side of the curves. Figure 5. The neuron shows fairly linear performance with a sharp bend on either side of the linear region. This sharp bend occurs when one of the two linearized. R ≈ Thus. Furthermore.5. This corresponds to the left bend point in figure 5.

52 Neuron Output Voltage vs.65 V 0.5 −1 −0.5 V 0.4 −2.5 −2 −1.2 1.5 1 1. ∆Vout .5 2 x 10 2.5 0 ∆ I (Amps) in 0.1 −0.1 (Volts) 2V 0 ∆V out −0. Current Input for Various Vgain 0.5: Neuron differential output voltage.3 −0.5 −6 Figure 5.2 −0.4 0.3 1. as a function of ∆Iin . for various values of Vgain .

53 Neuron Positive Output Voltage vs.5 2 x 10 2. as a function of ∆Iin .4 V 2.2 2V 2 1.5 0 ∆ Iin (Amps) 0.5 1 1.4 2.2 V 2. Vout+ .5 −2 −1.2 3 2.8 (Volts) V out+ 2.8 −2.5 −1 −0.6 2. for various values of Vof f set . Current Input for Various Voffset 3.4 3.5 −6 Figure 5.6: Neuron positive output.

The input to the serial weight bus comes from a host computer which implements the learning algorithm. During the read phase of operation. They set the appropriate delay such that after one clock transitions from high to low. when φ1 is high. feeds into the data input. The shift register storage cell is accomplished by using two staticizer or jamb latches[34] in series. it is necessary to make the buffer transistors with large W/L ratios to drive the large capacitances on the output. The clock generator ensures that the two clock phases do not overlap in order that no race conditions exist between reading and writing. when φ2 is high.3 Serial Weight Bus A serial weight bus is used to apply weights to the synapses. the weak inverter keeps the current bit state. Figure 5. the bit is written to the second latch for storage. The last output bit of the shift register. A single CLK input comes from an off chip clock. The delay inverters are made weak by making the transistors long. Since the clock lines go throughout the chip and drive long shift registers. of the next shift register to form a long compound shift register. is made sufficiently weak such that a signal coming from the main inverter and passing through the pass gate is sufficiently strong to overpower the weak inverter and set the state. Figure 5. during the write phase of operation. The weak inverter. Q5.54 offset controls. D. 5. The latch consists of two interlocked inverters. labeled with a W in the figure. The weight bus merely consists of a long shift register to cover all of the possible weight and threshold bits. the data from the previous register is read into the first latch. Then. . The shift register is controlled by a dual phase nonoverlapping clock. the next one does not transition from low to high until after the small delay.7 shows the circuit schematic of the shift register used to implement the weight bus.8 shows the nonoverlapping clock generator[35]. Once the pass gate is turned off.

7: Serial weight bus shift register D j1 W j2 W j1 W j2 W j1 W j2 W .55 Q5 j2 j1 Q1 . j2 j1 Q0 j2 j1 Figure 5...

56 j2 j2 j1 j1 Figure 5.8: Nonoverlapping clock generator CLK Delay inverters Buffer Inverters .

Furthermore. This bias current then controls the gain of the converter. Figure 5. it is possible to simply tie the negative input to mid-rail and apply the standard digital signal to the positive synapse input.9. a differential to single ended converter is shown on the output. depending on the type of network outputs desired. if a roughly linear output is desired. However. 1 output feedforward network for training of XOR. The biases are learned in the same way as the weights. Note that the nonlinear squashing function is actually performed in the next layer of synapse circuits rather than in the neuron as in a traditional neural network. for non-differential digital signals. additional circuitry may be needed for the final squashing function. it is possible to take the output after a dual inverter digital buffer to get a strong standard digital signal to send off-chip or to other sections of the chip. . The biases. The Vb knob is used to set the bias current of the amplifier. a somewhat linear output with low gain is desired to have a reasonable slope to learn the function on.4 Feedforward Network Using the synapse and neuron circuit building blocks it is possible to construct a multilayer feed-forward neural network. The gain is also controlled by sizing of the transistors. In figure 5. Also. during training. For example. or thresholds. 2 hidden unit.57 5. Normally. the inputs need not be constrained as the synapses will pass roughly the same current regardless of whether the digital inputs are at the flat part of the synapse curve near the linear region or all the way at the end of the flat part of the curve. For digital functions. this is equivalent as long as the inputs to the first layer are kept within the linear range of the synapses. The differential to single ended converter is based on the same transconductance amplifier design used for the synapse. the differential output can be taken directly from the neuron outputs.10 shows a schematic of a differential to single ended converter and a digital buffer. The gain of this converter determines the size of the linear region for the final output. for each neuron are simply implemented as synapses tied to fixed bias voltages.9 shows a figure of a 2 input. However. after training. Figure 5.

CLK D i4 i3 i2 i1 i0 D1 Vin+ Vin− Vin+ i4 i3 i2 i1 i0 i4 i3 i2 i1 i0 Vin− Q5 Q4 Q3 Q2 Q1 Q0 Synapse CLK Q5 Q4 Q3 Q2 Q1 Q0 b5 b4 b3 b2 b1 b0 Synapse CLK D D2 Q5 Q4 Q3 Q2 Q1 Q0 b5 b4 b3 b2 b1 b0 Shift Reg.9: 2 input. Shift Reg. 2 hidden unit. CLK D D6 i4 i3 i2 i1 i0 i4 i3 i2 i1 i0 D7 i4 i3 i2 i1 i0 Vin+ Vin− Vin+ Vin− i4 i3 i2 i1 i0 D Q5 Q4 Q3 Q2 Q1 Q0 Shift Reg. Synapse CLK D i4 i3 i2 i1 i0 i4 i3 i2 i1 i0 Vin+ Vin− D5 Q5 Q4 Q3 Q2 Q1 Q0 b5 b4 b3 b2 b1 b0 Synapse D D0 i4 i3 i2 i1 i0 i4 i3 i2 i1 i0 i4 i3 i2 i1 i0 Vin+ Vin− Vin1+ Vin1− Vbias+ Vbias− Vin2+ Vin2− Vin1+ Vin1− Vbias+ Vbias− Vin2+ Vin2− .Vout V+ V− Trans. 1 output neural network b5 b4 b3 b2 b1 b0 Shift Reg. Synapse CLK D i4 i3 i2 i1 i0 i4 i3 i2 i1 i0 Vin+ Vin− D4 Q5 Q4 Q3 Q2 Q1 Q0 b5 b4 b3 b2 b1 b0 Shift Reg. Amp DigOut Vout+ Differential to Single ended Converter Vcm Vcm Vout− Digital Buffer Vgain Vgain Neuron Voffset Voffset Iin+ Iin− D7 Iout+ D8 Iout− Iout+ Iout− Iout+ Iout− Shift Reg. b5 b4 b3 b2 b1 b0 Synapse CLK Q5 Q4 Q3 Q2 Q1 Q0 b5 b4 b3 b2 b1 b0 Synapse CLK D D8 Q5 Q4 Q3 Q2 Q1 Q0 b5 b4 b3 b2 b1 b0 Synapse i4 i3 i2 i1 i0 i4 i3 i2 i1 i0 Vin+ Vin− 58 Vbias+ Vbias− Vout+ Vout− Vout+ Vout− Vcm Vcm Vcm Vcm Vgain Vgain Neuron Vgain Vgain Neuron Voffset Voffset Voffset Voffset Iin+ Iin− Iin+ Iin− D1 Iout+ Iout− Iout+ Iout− D2 D3 Iout+ Iout− D4 Iout+ Iout− D5 Iout+ Iout− D6 Iout+ Iout− Figure 5. Shift Reg. Shift Reg. Synapse CLK D i4 i3 i2 i1 i0 i4 i3 i2 i1 i0 Vin+ Vin− D3 Q5 Q4 Q3 Q2 Q1 Q0 b5 b4 b3 b2 b1 b0 Shift Reg.

59 Out Digital Out Vin+ Vin− Vb Differential to Single Ended Converter Digital Buffer Figure 5.10: Differential to single ended converter and digital buffer .

6µm to keep the layout small. An LSB current of 100nA was .11: Parallel Perturbative algorithm 5. all input training patterns are applied and the error is accumulated. Get Error. else Restore Old Weights. end end Figure 5. Get New Error. This process repeats until a sufficiently low error is achieved. other suitable algorithms may also be used. it is possible to apply an annealing schedule wherein large perturbations are initially applied and gradually reduced as the network settles.5 Training Algorithm The neural network is trained by using a parallel perturbative weight update rule[21]. otherwise they are discarded. 5. For example. This error is then checked to see if it was higher or lower than the unperturbed iteration.6µm/3.60 Initialize Weights. In batch mode. Since the weight updates are calculated offline.11. Perturb Weights. An outline of the algorithm is given in figure 5.6µm. Error = New Error.6µm/3. These random perturbations are then applied to all of the weights in parallel.6 Test Results A chip implementing the above circuits was fabricated in a 1. while(Error > Error Goal). If the error is lower. the perturbations are kept. The perturbative technique requires generating random weight increments to adjust the weights during each iteration. The unit size current source transistors were also 3.2µm CMOS process. Weights = New Weights. All synapse and neuron transistors were 3. if (New Error < Error).

1).5V from their respective rail and functionally correct. the error voltage starts around or over 10V indicating that at least 2 of the input patterns give an incorrect output. Figure 5. Thus. A double inverter buffer can be placed at the final output to obtain good digital signals. The weights are initialized as small random numbers.12-5. The neurons are assumed to be hard thresholds. The inputs for the network and neuron outputs are chosen to be (-1. This choice is made because the actual synapses which implement the squashing function give maximum outputs of +/. This network has only 2 synapses and 1 bias for a total of 3 weights. as opposed to (0.Iout . when the error gets to around 2V.12 shows the results from a 2 input. 2 hidden unit. a differential to single ended converter was placed on the output of the final neuron. An ideal neural network with weights appropriately chosen to implement the XOR function is shown in figure 5. the network weights slowly converge to a correct solution. The above neural network circuits were trained with some simple digital functions such as 2 input AND and 2 input XOR.13.61 chosen for the current source.14.+1). The weight values can go from -31 to +31 because of the 6 bit D/A converters used on the synapses. At the beginning of each of the training runs. The weights slowly diverge and the error monotonically decreases until the function is learned. The results of some training runs are shown in figures 5. Figure 5. The function computed by each of the neurons is given by . As can be seen from the figures. occasional training runs resulted in the network getting stuck in a local minimum and the error would not go all the way down.5V from Vdd because the transconductance amplifier had low gain. Since Vdd was 5V. the output would only easily move to within about 0. The error voltages were calculated as a total sum voltage error over all input patterns at the output of the transconductance amplifier. 1 output network with the XOR function. This was simply a 5 transistor transconductance amplifier. Since the training was done on digital functions. it means that all of the outputs are within about 0. As with gradient techniques.13 shows the results of training a 2 input. 1 output network learning an AND function.

62 Weight and Error vs. Iteration for training AND network 30 20 Weight 10 0 −10 −20 0 10 20 30 40 50 60 70 80 14 12 Error Voltage 10 8 6 4 2 0 0 10 20 30 40 iteration 50 60 70 80 Figure 5.12: Training of a 2:1 network with AND function .

63 Weight and Error vs.13: Training of a 2:2:1 network with XOR function starting with small random weights . Iteration for training XOR network 40 20 Weight 0 −20 −40 0 50 100 150 200 250 300 350 400 450 500 12 10 Error Voltage 8 6 4 2 0 50 100 150 200 250 300 iteration 350 400 450 500 Figure 5.

14: Network with “ideal” weights chosen for implementing XOR .64 Output C t=−20 10 −20 A t=20 B t=−20 20 20 20 20 X1 X2 Figure 5.

intermediate computations. The weights were chosen to be within the range of possible weight values.1: Computations for network with “ideal” weights chosen for implementing XOR Out =     +1. Also. because small weights would be within the linear region of the synapses. both in theory and based on simulations. since the weights start near where they should be.1 might tend to get flipped due to circuit offsets. start off with correct outputs. It is clear that the XOR function is properly implemented. -31 to +31. the offsets and mismatches of the circuit cause the outputs to be incorrect.1 shows the inputs. and outputs of the idealized network. Also. for more complicated  −1. However. Figure 5. A SPICE simulation was done on the actual circuit using these ideal weights and the outputs were seen to be the correct XOR outputs. but with the initial weights chosen as the mathematically correct weights for the ideal synapses and neurons. all of the input weights could be set to 1 as opposed to 20. the error goes down rapidly to the correct solution. Although the ideal weights should. without any change in the network function. . of the actual network. However. and the A and B thresholds would then be set to 1 and -1 respectively. small weights such as +/. This ideal network would perform perfectly well with smaller weights.65 X1 X2 A B C -1 -1 −60 ⇒ −1 −60 ⇒ −1 −10 ⇒ −1 -1 +1 +20 ⇒ +1 −20 ⇒ −1 +10 ⇒ +1 +1 -1 +20 ⇒ +1 −20 ⇒ −1 +10 ⇒ +1 +1 +1 +60 ⇒ +1 +20 ⇒ +1 −30 ⇒ −1 Table 5. This is an example of how a more complicated network could be trained on computer first to obtain good initial weights and then the training could be completed with the chip in the loop. weights of large absolute value within the possible range were chosen (20 as opposed to 1). if (( if (( i i Wi Xi ) + t) ≥ 0 Wi Xi ) + t) < 0 Table 5. For example.15 shows the 2:2:1 network trained with XOR.

15: Training of a 2:2:1 network with XOR function starting with “ideal” weights . Iteration for training XOR network 40 20 Weight 0 −20 −40 0 50 100 150 200 250 300 350 400 450 500 12 10 Error Voltage 8 6 4 2 0 50 100 150 200 250 iteration 300 350 400 450 500 Figure 5.66 Weight and Error vs.

67 networks, using a more sophisticated model of the synapses and neurons that more closely approximates the actual circuit implementation would be advantageous for computer pretraining.

5.7

Conclusions

A VLSI implementation of a neural network has been demonstrated. Digital weights are used to provide stable weight storage. Analog multipliers are used because full digital multipliers would occupy considerable space for large networks. Although the functions learned were digital, the network is able to accept analog inputs and provide analog outputs for learning other functions. A parallel perturbation technique was used to train the network successfully on the 2-input AND and XOR functions.

68

Chapter 6

A Parallel Perturbative VLSI

Neural Network
A fully parallel perturbative algorithm cannot truly be realized with a serial weight bus, because the act of changing the weights is performed by a serial operation. Thus, it is desirable to add circuitry to allow for parallel weight updates. First, a method for applying random perturbation is necessary. The randomness is necessary because it defines the direction of search for finding the gradient. Since the gradient is not actually calculated, but observed, it is necessary to search for the downward gradient. It is possible to use a technique which does a nonrandom search. However, since no prior information about the error surface is known, in the worst case a nonrandom technique would spend much more time investigating upward gradients which the network would not follow. A conceptually simple technique of generating random perturbations would be to amplify the thermal noise of a diode or resistor (figure 6.1). Unfortunately, the extremely large value of gain required for the amplifier makes the amplifier susceptible to crosstalk. Any noise generated from neighboring circuits would also get amplified. Since some of this noise may come from clocked digital sections, the noise would become very regular, and would likely lead to oscillations rather than the uncorrelated noise sources that are desired. Such a scheme was attempted with separate thermal noise generators for each neuron[33]. The gain required for the amplifier was nearly 1 million and highly correlated oscillations of a few MHz were observed among all the noise generators. Therefore, another technique is required. Instead, the random weight increments can be generated digitally with linear feedback shift registers that produce a long pseudorandom sequence. These random bits are used as inputs to a counter that stores and updates the weights. The counter

69

A

V

noise

Figure 6.1: Amplified thermal noise of a resistor

1

2

3

...

n

...

m

out

Figure 6.2: 2-tap, m-bit linear feedback shift register outputs go directly to the D/A converter inputs of the synapses. If the weight updates led to a reduction in error, the update is kept. Otherwise, an inverter block is activated which inverts the counter inputs coming from the linear feedback shift registers. This has the effect of restoring the original weights. A block diagram of the full neural network circuit function is provided in figure 6.8.

6.1

Linear Feedback Shift Registers

The use of linear feedback shift registers for generating pseudorandom sequences has been known for some time[31]. A simple 2 tap linear feedback shift register is shown in figure 6.2. A shift register with m bits is used. In the figure shown, an XOR is performed from bit locations m and n and its output is fed back to the input of the shift register. In general, however, the XOR can be replaced with a multiple input parity function from any number of input taps. If one of the feedback taps is not

2: LFSR with m=4. It is important to note that not all feedback tap selections will lead to maximal length sequences. m. A maximal length sequence from such a linear feedback shift register would be of length 2M − 1 and includes all 2M possible patterns of 0’s and 1’s with the exception of the all 0’s sequence. n=3 yielding 1 maximal length sequence 1 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 1 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 Table 6. then this is equivalent to an l-bit linear feedback shift register and the output is merely delayed by the extra m − l shift register bits. but instead the largest feedback tap number is from some earlier bit.70 1 0 0 1 1 0 1 0 1 1 1 1 0 0 0 1 0 1 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 0 Table 6. It is clear that as long as the shift register . Table 6. The all 0’s sequence is considered a dead state.1 shows an example of a maximal length sequence by choosing feedback taps of positions 4 and 3 from a 4 bit LFSR. because XOR(0. yielding 3 possible subsequences taken from the final bit. 0) = 0 and the sequence never changes. l.1: LFSR with m=4. n=2.

the subsequence lengths are of size 6. a parallel perturbative neural network requires as many uncorrelated noise sources as there are weights. LFSRs of certain lengths cannot achieve maximal length sequences with only 2 feedback tap locations. an LFSR only provides one such noise source. In applications where analog noise is desired. Tables have been tabulated which describe the lengths of possible sequences and subsequences based on LFSR lengths and feedback tap locations[31]. Table 6. This is done by using a filter corner frequency which is much less than the clock frequency of the shift register.6) to get a length 255 sequence. Unfortunately. It is not possible to use .71 does not begin in the all 0’s state. 6.2 shows an example where non-maximal length sequences are possible by utilizing feedback taps 4 and 2 from a 4 bit LFSR. it is required to have XOR(4. It is desirable to use the maximal length sequences because they allow the longest possible pseudorandom sequence for the given complexity of the LFSR. However.13. The sum of the sizes of the subsequences equal the size of a maximal length sequence. In this situation. it will progress through all 2M − 1 = 15 sequences and then repeat. it is possible to low pass filter the output of the digital pseudorandom noise sequence. it is required to have an XOR(4.15) to obtain the length 65536 sequence[32]. Thus.3 provides a list of LFSRs that can obtain maximal length sequences with only 2 feedback taps[32]. and 3. It is clear that it is possible to make very long pseudorandom sequences with moderately sized shift registers and a single XOR gate.2 Multiple Pseudorandom Noise Generators From the previous discussion it is clear that linear feedback shift registers are a useful technique for generating pseudorandom noise. 6.5. for m = 8. and. it is important to choose the correct feedback taps. There is no overlap of the subsequences so it is not possible to transition from one to another. for m = 16. In this situation it is possible to enter one of three possible subsequences. For example. The initial state will determine which subsequence is entered. Table 6.

3: Maximal length LFSRs with only 2 feedback taps .72 m 3 4 5 6 7 9 10 11 15 17 18 20 21 22 23 25 28 29 31 33 35 36 39 n 2 3 3 5 6 5 7 9 14 14 11 17 19 21 18 22 25 27 28 20 33 25 35 Length 7 15 31 63 127 511 1023 2047 32767 131071 262143 1048575 2097151 4194303 8388607 33554431 268435455 536870911 2147483647 8589934591 34359738367 68719476735 549755813887 Figure 6.

wherein their state at time t + 1 depends only on the states of all of the unit cells at time t. this approach becomes prohibitively expensive in terms of area and possibly power required to implement the scheme. Each of the N units in the LFSR except for the a0 unit is given by at+1 = at i i The first unit is described by at = 0 N −1 i=0 c i at i where ⊕ represents the modulo 2 summation and the ci are the boolean feedback coefficients where a coefficient of 1 represents a feedback tap at that cell. Because of this. In another approach a standard boolean cellular automaton is used[37][38]. several approaches have been taken to resolve the problem. In one approach[36]. many of which utilize cellular automata. The boolean cellular automaton is comprised of N unit cells. For large networks with long training times. . they build from a standard LFSR with the addition of an XOR network with inputs coming from the taps of an LFSR and with outputs providing the multiple noise sources. ai . the analysis begins with an LFSR consisting of shift register cells numbering a0 to aN −1 .73 the different taps of a single LFSR as separate noise sources because these taps are merely the same noise pattern offset in time and thus highly correlated. First. One solution would be to use one LFSR with different feedback taps and/or initial states for every noise source required. The outputs are given by gt = N −1 i=0 t d i ai A detailed discussion is given as to the appropriate choice of the ci and di coefficients to ensure good noise properties. with the cell marked a0 receiving the feedback.

n = 6 and m = 6.. at −1 i 0 1 N An analysis was performed to choose a cellular automaton with good independence between output bits.4 Up/down Counter Weight Storage Elements The weights in the network are represented directly as the bits of an up/down digital counter.3 Multiple Pseudorandom Bit Stream Circuit Another simplified approach utilizes two counterpropagating LFSRs with an XOR network to combine outputs from different taps to obtain uncorrelated noise[39]. For example.4 shows two LFSRs with m = 7. at . The output bits of the counter feed directly into the digital input word weight bits of the synapse circuit.5)[34]. It is possible to obtain more channels and larger sequence lengths with the use of larger LFSRs. Since the two LFSRs are counterpropagating. Updating the weights becomes a simple matter of incrementing or decrementing the counter to the desired value. . n = 5. unbiased... . the bits are combined to provide 9 pseudorandom bit sequences.74 at+1 = f at . bits. The counter used is a standard synchronous up/down counter using full adders and registers (figure 6. figure 6. 6. every 4th bit is chosen as an output to ensure independent. This was the scheme that was ultimately implemented in hardware. the length of the output sequences obtained from the XOR network are (26 − 1)(27 − 1) = 8001 bits[31]. In the figure. The following transition rule was chosen: at+1 = (at + an ) ⊕ at ⊕ at i i+1 i i−1 i−1 Furthermore. 6.

r0 r1 r2 r3 r4 r5 r6 r7 r8 75 24 25 bit shift register 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 reset D f1 f2 f1 f2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 bit shift register reset D f1 f2 f1 f2 f1 f1 f2 f2 Figure 6.4: Dual counterpropagating LFSR multiple random bit generator reset .

5: Synchronous up/down counter .76 overflow D Cout B S A Cin f1 f1 f2 f2 Q b5 D Cout B S A Cin f1 f1 f2 f2 Q b4 D Cout B S A Cin f1 f1 f2 f2 Q b3 D Cout B S A Cin f1 f1 f2 f2 Q b2 b1 Cout B S A Cin f1 f1 f2 f2 D Q Cout B S A Cin f1 f1 f2 f2 down/up D Q b0 f1 f1 f2 f2 Figure 6.

77 Vdd A B C Cout SUM Figure 6.6: Full adder for up/down counter .

The inverter block is essentially composed of pass gates and inverters. Control of the inverter block is what allows weight updates to either be kept or discarded.9). the adder is optimized for space instead of for speed. However. Thus. Most of the adder transistors are made close to minimum size. counters with their respective registers are used to store and update the weights. The register cell (figure 6. This register cell is both small and static.5 Parallel Perturbative Feedforward Network Figure 6.7.7) used in the counter is simply a 1 bit version of the serial weight bus shown in figure 5. since the output must remain valid for standard feedforward operation after learning is complete. the adders take up a considerable amount of room. Since every weight requires 1 counter.6 Inverter Block The counter up/down inputs originate in the pseudorandom bit generator and pass through an inverter block (figure 6.6[34]. the bit passes unchanged.8 shows a block diagram of the parallel perturbative neural network circuit operation. binary weighted encoder and neuron circuits are the same as those used for the serial weight bus neural network. If the invert signal is low. and every counter requires 1 full adder for each bit. If the invert bit is high. instead of the synapses interfacing with a serial weight bus.78 j1 j2 D j1 W Q j2 W Figure 6. 6.7: Counter static register The full adder is shown in figure 6. . then the inversion of the pseudorandom bit gets passed. The synapse. 6.

8: Block diagram of parallel perturbative neural network circuit ..b5 Synapse To other counters input Inverter block Pseudorandom bit generator Binary weighted encoder Figure 6.79 Error Comparison output Neuron other layers Neuron from other synapses Counter up/down b0.

b0 b3 b4 b5 b1 b2 b6 b7 b8 80 invert Figure 6.9: Inverter block b0in b1in b2in b4in b3in b5in b6in b7in b8in .

is w −→ − − measured with the current weight vector. pert. ∆E = E → + pert − E (→). − − − −→ − − −→ − E → + pert − E (→) w w E → + pert − E (→) w w ∆E ∂E ≈ →= = − − − ∂→ w ∆− w (→ + pert) − → w w pert However. after perturbations are applied to the weights. the weight update rule is of the following form: . If the error increases. pert is a perturbation w vector applied to the weights which is normally of fixed size. but functionally equivalent. → + pert. Next. the effect on the error. The way that this is done is by applying a set of small perturbations on the weights and measuring the effect on the network error. this type of weight update rule requires weight changes which are proportional to the difference in error terms. a weight update rule of the following form is used: − ∆→ = −η w − − −→ − w w E → + pert − E (→) pert −→ − − where → is a weight vector of all the weights in the network. − weight update rule can be achieved as follows. E ( →).7 Training Algorithm The algorithm implemented by the network is a parallel perturbative method[21][25]. Thus. and η is a small parameter used to set the learning rate. Afterwards. but random sign is applied to the weight vector yielding a new weight − − − −→ − −→ − vector. the original weight vector is restored. the weight update rule can be seen as based on an approximate derivative of the error functions with respect to the weights. If the error decreases then the perturbations are kept and the next iteration is performed. First. the network error. pert. but with random signs. In a typical perturbative algorithm. a perturbation vector. The basic idea of the algorithm is to perform a modified gradient descent search of the error surface without calculating derivatives or using explicit knowledge of the functional form of the neurons. →. of w fixed magnitude. A simpler. w w w is measured. Thus.81 6.

as was chosen for the current implementation. there are several alternatives for implementing the error comparison on chip. instead. First. the control section could be implemented on chip as a standard digital section such as a finite state machine.82 The use of this rule may require more iterations since it does not perform an actual weight change every iteration and since the weight updates are not scaled with the resulting changes in error. A pseudocode version of the algorithm is presented in figure 6. relies on the same circuitry for the weight update as for applying the random perturbations. it significantly simplifies the weight update circuitry. Another approach would be to have the error comparison section compare the analog errors directly and then output digital control signals. w pertt −→ − − − w if E →t + pertt < E (→t ) w −→ − − − w w if E →t + pertt > E (→t ) 6. Some extra circuitry is required to remove the perturbations when the error doesn’t decrease. Both sections could be performed off-chip by using a computer for chip-in-loop training. but this merely involves inverting the signs of all of the perturbations and reapplying in order to cancel out the initial perturbation. the error comparison could simply consist of analog to digital converters which would then pass the digital information to the control block. Also. For either update rule. Nevertheless.10. .8 Error Comparison The error comparison section is responsible for calculating the error of the current iteration and interfacing with a control section to implement the algorithm. →t+1 = − w  →  − . this rule does not require additional circuitry to change the weight values proportionately with the error difference. while the local functions are performed on-chip. however. and. wt  −  →  − t + −→ . However. some means must be available to apply the weight perturbations. This allows flexibility in performing the global functions necessary for implementing the training algorithm.

Error = New Error.11: Error comparison section . Set Invert Bit Low. Clock counters. else Unperturb Weights. Get Error. if (New Error < Error). Clock Pseudorandom bit generator. Figure 6. function Unperturb Weights. while(Error > Error Goal).83 Initialize Weights.10: Pseudocode for parallel perturbative learning network Ein+ Ep+ C C Eoin+ in- Eout EinEp- Eo+ Vref StoreEp StoreEo CompareE Figure 6. Get New Error. Clock counters. end end function Perturb Weights. Set Invert Bit High. Perturb Weights.

84 Ein+ Ep+ C C Eo- Ein- Epa) StoreEo high Eo+ Ein+ Ep+ C C Eo- Ein- Epb) StoreEp high Eo+ Ep+ C EpC Eoin+ in- Eout Eo+ Vref c) CompareE high Figure 6.12: Error comparison operation .

12 shows the operation of the error comparison section in greater detail.11.85 A sample error comparison section is shown in figure 6. CompareE is asserted to compute the error difference. The two storage capacitors are connected together and their charges combine. The capacitor where Ep is stored is left floating. initial error is stored. 1 1 QT = ((Ep+ − Ep− ) − (Eo+ − Eo− )) = (Ep − Eo ) 2C 2 2 VT = . Next. Figure 6. QT . This stores the unperturbed error voltages on the capacitor. In figure 6. Note that the polarities of the two storage capacitors are drawn reversed. across the parallel capacitors is given as follows: QT = Q+ − Q− = C (Ep+ + Eo− ) − C (Ep− + Eo+ ) = C ((Ep+ − Ep− ) − (Eo+ − Eo− )) If both capacitors are of size C. StoreEp is asserted in order to store the perturbed error after the weight perturbations have been applied. The voltage across the parallel capacitors can be obtained from QT = 2CVT . In the same manner as before. In figure 6. The total charge. are connected. This will facilitate the subtraction operation in the final step. in figure 6. This error difference is then compared with a reference voltage to determine if the error decreased or increased. the inputs Ein+ and Ein− are connected to Ep+ and Ep− respectively. The error input terminals. StoreEo is asserted when the unperturbed. StoreEp is held high while the other signals are low. Thus.12c. to two storage capacitors which are then compared. StoreEo is high and the other signals are low. CompareE is asserted and the other input signals are low.12b. the parallel capacitance will be of size 2C. in sequence. The output consists of a digital signal which shows the appropriate error change. and Eout shows the sign of the difference computation.12a. The section uses digital control signals to pass error voltages onto capacitor terminals and then combines the capacitors in such a way to subtract the error voltages. This stores the perturbed error voltages on the capacitor. while the inputs Ein+ and Ein− are connected to Eo+ and Eo− respectively. Ein+ and Ein− .

The high gain amplifier is shown as a differential transconductance amplifier which feeds into a double inverter buffer to output a strong digital signal.6µm/3. Eout . The final output. Furthermore.6µm to keep the layout small. this sample error computation stage can easily be expanded to accommodate a multiple output neural network.13-6. and the 0 represents a digital low signal. All synapse and neuron transistors were 3. where the 1 represents a digital high signal. This output can then be sent to a control section which will use it to determine the next step in the algorithm. this error difference feeds into a high gain amplifier to perform the comparison. A chip implementing the parallel perturbative neural network was fabricated in a 1. which is usually chosen to be approximately half the supply voltage to allow for the greatest dynamic range. The error inputs which are computed by a difference of the actual output and a target output can be computed with a similar analog switched capacitor technique. Vref . it might also be desirable to insert an absolute value computation stage. However.2µm CMOS process. the combined capacitor is placed onto a voltage. The unit size current source transistors were also 3.15.86 Since these voltages are floating and must be referenced to some other voltage. if Ep < Eo 6.6µm. Simple circuits for obtaining voltage absolute values can be found elsewhere[30]. is given as follows:    1. Also.6µm/3. Eout =   if Ep > Eo 0. The above neural network circuits were trained with some simple digital functions such as 2 input AND and 2 input XOR. The results of some training runs are shown in figures 6.9 Test Results In this section several sample training run results from the chip are shown. An LSB current of 100nA was chosen for the current source. As can be . during this step.

The network analog output voltage for each pattern is also displayed. In figure 6. The actual weight values were not made accessible and. This is because the network weights occasionally transition through a region of reduced analog output error that is used for training. although the error voltage is monotonically decreasing. After being stuck in a local minimum for quite some time. but which actually increases the digital error. Figure 5. the network weights slowly converge to a correct solution. What is important is that the error goes down. In both figures. The error voltages were taken directly from the voltage output of the output neuron. This seems to be occasionally necessary for the network to ultimately reach a reduced digital error.12 shows the results from a 2 input. Figures 6. Some of the analog output values occasionally show a large jump from one iter- . another implementation might also add a serial weight bus to read the weights and to set initial weight values. the network continues to train and moves the weight vector in order to better match the training outputs. the digital error occasionally increases.14 and 6. 1 output network with the XOR function. This would also be useful when pretraining with a computer to initialize the weights in a good location. after relatively few iterations.15 show the results of training a 2 input. all of the outputs are digitally correct. Both figures also show that the function is essentially learned after only several hundred iterations. the network finally finds another location where it is able to go slightly closer towards the minimum. However.15. This network has only 2 synapses and 1 bias for a total of 3 weights. however. The differential to single-ended converter and a double inverter buffer can be placed at the final output to obtain good digital signals. thus. 1 output network learning an AND function. the network is allowed to continue learning.87 seen from the figures. The actual output voltage error is arbitrary and depends on the circuit parameters. The error voltages were calculated as a total sum voltage error over all input patterns. Since the AND function is a very simple function to learn. The network starts with 2 incorrect outputs which is to be expected with initial random weights. 2 hidden unit. Also shown is a digital error signal that shows the number of input patterns where the network gives the incorrect answer for the output. are not shown.

88 Errors and Outputs vs.1 0 −0.2 Analog Outputs (Volts) 0 10 20 30 40 50 iteration 60 70 80 90 100 Figure 6.3 0. Iteration for training AND network 1 0.1 −0.8 Error (Volts) 0.2 0 0 10 20 30 40 50 60 70 80 90 100 4 3 Digital Error 2 1 0 −1 0 10 20 30 40 50 60 70 80 90 100 0.6 0.4 0.13: Training of a 2:1 network with the AND function .2 0.

14: Training of a 2:2:1 network with the XOR function .05 0 0 100 200 300 400 500 600 700 800 900 1000 4 3 Digital Error 2 1 0 −1 0 100 200 300 400 500 600 700 800 900 1000 0.15 Error (Volts) 0.1 0.2 0. Iteration for training XOR network 0.2 0 100 200 300 400 500 iteration 600 700 800 900 1000 Figure 6.05 −0.05 0 −0.15 Analog Outputs (Volts) 0.89 Errors and Outputs vs.1 0.1 −0.2 0.15 −0.

05 0 −0.4 0.3 0.2 0.15 0. Iteration for training XOR network 0.1 0.15 0.90 Errors and Outputs vs.35 Error (Volts) 0.15: Training of a 2:2:1 network with the XOR function with more iterations .05 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 4 3 Digital Error 2 1 0 −1 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0.25 0.1 0.1 Analog Outputs (Volts) 0 200 400 600 800 1000 iteration 1200 1400 1600 1800 2000 Figure 6.05 −0.

measuring the outputs. the weight reset will simply be discarded. The chip was trained using a computer in a chip-in-loop fashion. this circuitry would need to be added to every weight counter and would be an unnecessary increase in size. In this type of situation. the network is able to accept analog inputs and provide analog outputs for learning other functions. The network was successfully trained on the 2-input AND and XOR functions. these overflow conditions should normally not be a problem. This occurs when a weight value which is at maximum magnitude overflows and resets to zero. calculating the error function and applying algorithm control signals. Although the functions learned were digital. However. that it would not overflow and reset to zero. but a sign flip or significant weight reduction is necessary to reach the global minimum. the network will be unwilling to increase the error as necessary to obtain the global minimum. the weight reset may occasionally be useful for breaking out of local minima.10 Conclusions A VLSI neural network implementing a parallel perturbative algorithm has been demonstrated. in a . Analog multipliers are used to decrease the space that a full parallel array of digital multipliers would occupy in a large network. Digital weights are used to provide stable weight storage. For example. All of the local weight update computations and random noise perturbations were generated on chip. However. Since the algorithm only accepts weight changes that decrease the error. where a weight value has been pushed up against an edge which leads to a local minima. This chip-in-loop configuration is the most likely to be used if the chip were used as a fast. and no special circuitry was added to deal with these overflow conditions. They are stored and updated via up/down counters and registers. low power component in a larger digital system. The weights are merely stored in counters. The computer performed the global operations including applying the inputs. if an overflow and reset on a weight is undesirable. It would require a simple logic block to ensure that if a weight is at maximum magnitude and was incremented. 6. In fact.91 ation to another.

. it may be desirable to use a low power analog neural network to assist in the handwriting recognition task with the microprocessor performing the global functions.92 handheld PDA.

1. The parallel perturbative training algorithm with simplified weight update rule was used in the serial weight bus based neural network.1.1 Conclusions Contributions Dynamic Subthreshold MOS Translinear Circuits The main contribution of this section involved the transformation of a set of floating gate circuits useful for analog computation into an equivalent set of circuits using dynamic techniques. This was done in order to show the effects . Excellent training results were shown for training of a 2 input. an XOR network was trained which was initialized with “ideal” weights. and. 7. although similar to another neuron design[22].2 VLSI Neural Networks The design of the neuron circuit. the simplicity with which these circuits can be combined and cascaded to implement more complicated functions was demonstrated.1 7. Thus. This element proved to have too large of a gain. a diode connected cascoded transistor with linearized gain and a gain control stage was used instead. in particular. and easier to work with since they do not require using ultraviolet illumination to initialize the circuits.93 Chapter 7 7. The previous element was a diode connected transistor. Furthermore. such as serial weight perturbation. this was done with an eye toward the eventual implementation of the fully parallel perturbative implementation. learning XOR was difficult with such high neuron gain. Although many different algorithms could have been used. 1 output network on the AND function and for training a 2 input. Furthermore. is new. The main distinction is a change in the current to voltage conversion element used. This makes the circuits more robust. 2 hidden layer. 1 output network on the XOR function.

Although many of the subcircuits and techniques were borrowed from previous designs.94 of circuit offsets in making the outputs initially incorrect. The size of the neuron cell in dimensionless units was 300λ x 96λ. such as a 0. parallel. 1 output network trained on the AND function and for a 2 input. it would be possible to make a network with over 100 neurons and over 10. λ was equal to 0. the ability to learn of a neural network using analog components for implementation of the synapses and neurons and 6 bit digital weights has been successfully demonstrated.35µm process. but then the inherent benefits of using a compact.000 weights in a 1cm x 1cm chip. Using more than 8 bits may not be desirable since the analog circuitry itself may not have significant precision to take advantage of the extra bits per weight. 7. The circuits can easily be extended to use 8 bit weights. In a modern process. Excellent training results were again shown for a 2 input. Thus.2 Future Directions The serial weight bus neural network and parallel perturbative neural network can be combined. the synapse was 80λ x 150λ. 2 hidden layer. the overall system architecture is novel. but also the ability to significantly decrease the training time required by using good initial weights. significant computer simulation would be carried out prior to circuit fabrication. and the weight counter/register was 340λ x 380λ. Significant strides can be taken to improve the matching characteristics of the analog circuits. The overall system design and implementation of the fully parallel perturbative neural network with local computation of the simplified weight update rule was also demonstrated. In a typical design cycle.2µm process used to make the test chips. analog implementation may be lost. 1 output network trained on the XOR function. it would also make it possible to initialize the neural network with weights obtained from computer pretraining. In the 1. By adding a serial weight bus to the parallel perturbative neural network. Part of this simulation may involve detailed . The choice of 6 bits for the digital weights was chosen to demonstrate that learning was possible with limited bit precision.6µm.

for certain applications such as handwriting recognition. . it would then make it possible to initialize the network with the results of this simulation. then read out those weights and copy them to other chips. With the serial weight bus. The next steps would also involve building larger networks for the purposes of performing specific tasks. it may be desirable to train one chip with a standard set of inputs. Additionally.95 training runs of the network using either idealized models or full mixed-signal training simulation with the actual circuit.

G. 1179-1191. vol. Analog Integrated Circuits and Signal Processing. vol. J. R. 46. May 1999. SC-17. “Translinear Circuits: A Proposed Classification”.. June 1987. 1991. IEEE Journal of Solid-State Circuits. “A Monolithic Microsystem for Analog Synthesis of Trigonometric Functions and Their Inverses”. no. pp. F. “Translinear Circuits in Subthreshold MOS”. 1990. no. J. 3. 3rd Ed. G. Haigh. 1. vol. 141-166. Andreou and K. Boahen. Gilbert. Aug. no. Gilbert. Gray and R. Ledgey. IEEE Journal of Solid-State Circuits. Gilbert. Serrano-Gotarredona. 9. 11. 1982. Wiegerink. pp. Toumazou. Bult and H. Seevinck and R. 11-91. [7] E. “Generalized Translinear Circuit Principle”. [5] T. no. IEEE Journal of Solid-State Circuits. vol. Eds. [2] B. 5. B. 6. 1993. Wallinga.. Electronics Letters. and D. London: Peter Peregrinus Ltd.I: Fundamental Theory and Applications.. [6] K. 1996. [3] B. Linares-Barranco. and A. IEEE Transactions on Circuits and Systems . no. G. 26. pp. “Current-Mode Circuits From A Translinear Viewpoint: A Tutorial”. [4] A. Andreou. 14-16. C. vol. pp. “A Class of Analog CMOS Circuits Based on the Square-Law Characteristic of an MOS Transistor in Saturation”. SC-22. G. [8] P. Analogue IC Design: The Current-Mode Approach. Meyer. “A General Translinear Principle for Subthreshold MOS Transistors”. A. . 8. New York: John Wiley and Sons.96 Bibliography [1] B. Analysis and Design of Analog Integrated Circuits. vol. 1975.

”Analysis. [10] W. [17] K. W. No. ”A Multiple-Input Differential-Amplifier Based on Charge Sharing on a Floating-Gate MOSFET”. and Implementation of Networks of MultipleInput Translinear Elements”. 1994. [12] C. 9. Analog Integrated Circuits and Signal Processing.D. M. C. J. “Computer Aided Analysis and Design of MOS Translinear Circuits Operating in Strong Inversion”. [14] E. 6. and How?” Analog Integrated Circuits and Signal Processing. Verleysen and P. 442-445. Chen and E. 33. Arreguit. A. Analog Integrated Circuits and Signal Processing. June 1998. 43. 1. “Analog VLSI Signal Processing: Why. pp. Ph. Chang. 7. Liu and C. [11] S. 9. 12. SC-22. vol. Minch. A. 1987. Vittoz. 181-187. and M. vol. IEEE Transactions on Circuits and Systems . Merz. I. Dualibe. 2. A. Seevinck. IEEE Journal of Solid-State Circuits. Gai. Vittoz. Diorio. G. pp. “Quadratic-Translinear CMOS MultiplierDivider Circuit”. no. 10. H. no. pp. P. [16] B. July 1994. no. 27-44. May 1997. A. vol. 1996. 3. “Precision Compressor Gain Controller in CMOS Technology”. Minch. Electronics Letters. Wiegerink. A. Where. vol. vol. Vol. C. “Two-Quadrant CMOS Analogue Divider”. “A CMOS Square-Law Vector Summation Circuit”. pp. 6. California Institute of Technology. E. Yang and A. Analog Integrated Circuits and Signal Processing. vol. 1996. Mead. 167-179. July 1996. Jespers. Synthesis. no. and C. 3.97 [9] R. Thesis. 34. 1997. 167-179.II: Analog and Digital Signal Processing. ”Translinear Circuits using Subthreshold floating-gate MOS transistors”. no. vol. Andreou. pp. [15] B. Electronics Letters. [13] X. Hasler. no. .

“A Fast Stochastic Error-Descent Algorithm for Supervised Learning and Optimization”. 5. 5. 542-550. [23] M. May 1995. Jayakumar. Masuoka. vol.98 [18] E. [24] B. Yuhas. CA: Morgan Kaufman Publishers.. New York: Addison-Wesley. and D. Keyes. 5. 1993. vol. Jabri. “A Hybrid Analog and Digital VLSI Neural Network for Intracardiac Morphology Classification”. The Physics of VLSI Systems. 1993. Flower and M. 244-251. Flower. 81. and S. vol. Advances in Neural Information Processing Systems. Jabri. [19] S. [21] J. San Mateo. A. pp. “Summed Weight Neuron Perturbation: An O(N) Improvement over Weight Perturbation”. [20] R. Vittoz. 30. Advances in Neural Information Processing Systems. . [25] G. [22] R. Flower. pp. pp. 1987. Shirota. 5. no. Englewood Cliffs. B. IEEE Journal of Solid-State Circuits. Ed. G. Pickard. 1993. pp. vol. vol. M. Cauwenberghs. IEEE Transactions on Neural Networks. Franca J. May 1993. T. Hemink. pp. Meir. CA: Morgan Kaufman Publishers. Jabri and B. 5. Endoh. 1. 3. New Jersey. R. 1992. “Weight Perturbation: An Optimal Architecture and Learning Technique for Analog VLSI Feedforward and Recurrent Multilayer Networks”. pp. and F. Proceedings of the IEEE. Alspector. Prentice Hall. Coggins. Advances in Neural Information Processing Systems. San Mateo. 1994. CA: Morgan Kaufman Publishers. Lippe. 154-157. Design of Analog-Digital VLSI Circuits for Telecommunications and Signal Processing. Tsividis Y. San Mateo. 836-844. 212-219. R. vol. “Reliability Issues of Flash Memory Cells”. 776788.. no. Aritome. no. W. “A Parallel Gradient Descent Method for Learning in Analog VLSI Neural Networks”. Jan. B. ”Dynamic Analog Techniques”.

pp. 1994. “Analog VLSI Stochastic Perturbative Learning Architectures”. “A VLSIEfficient Technique for Generating Multiple Uncorrelated Noise Source and Its . B. 11. Mass. Analog VLSI and Neural Systems. Revised Ed. “A Single-Transistor Silicon Synapse”. Paulos. “A Complementary Pair of Four-Terminal Silicon Synapses”. B. and C.. [31] S. [28] C. 748-760 [34] N. Cauwenberghs. A. S. Cambridge. Eshraghian. 5. 1989. and R. Horowitz. Mead. pp. Second Ed. Analog Integrated Circuits and Signal Processing. W. no. Mead. Laguna Hills. 153-166. Touretzky. Golomb. Diorio. UK: Cambridge University Press. [27] G. Sept. pp. 1989. Hill. Analog MOS Integrated Circuits for Signal Processing. 1982. J. Nov. [30] C. no. C. 1997.. Mead.. vol. B. 784-791. 655-664. Hasler. Gupta. Second Ed. “Performance of a stochastic learning microchip” in Advances in Neural Information Processing Systems 1. 1972-1980. vol. pp. P. Minch. Ed. [35] R. “A Neural Network Learning Algorithm Tailored for VLSI Implementation”. [29] C. M. B. Haber. S. 5. CA: Aegean Park Press. Alspector. 13. Analog Integrated Circuits and Signal Processing. Shift Register Sequences. 1996. Weste and K. A. San Mateo. Reading. IEEE Transactions on Electron Devices.. W. New York: John Wiley & Sons. Principles of CMOS VLSI Design. 1-2. Hollis and J. Parker.: Addison-Wesley.99 [26] P. 13. P. no. Gannett. Alspector. E. J. 195-209. H. 1989. [33] J. W. May-June 1997. Chu. [32] P. Diorio. vol. 1993. Gregorian and G. New York: Addison-Wesley. IEEE Transactions on Neural Networks. B. CA: Morgan Kaufman. Temes. and W. A. [36] J. Hasler. The Art of Electronics. D. Minch. pp. 43. A. Allen. 1986. and C. and R.

Krogh. Xie and M. 1986. 323. and Backpropagation”. [37] S. Dembo and T. 1992. Jan. Electronics Letters. 1. pp. Dupret. vol. Nature. Perceptrons. pp. vol. World Scientific Publishing Co. Garda. pp. New York: Addison-Wesley. 58-70. Cauwenberghs. “Model-Free Distributed Learning”. Wolfram. no. vol. 9. 2. Papert. no. vol. IEEE Transactions on Neural Networks. and P. A. . 78. “An Analog VLSI Recurrent Neural Network Learning a Continuous-Time Trajectory”. Introduction to the Theory of Neural Computation. [43] T. J. and R. vol. Williams. Sept. Cambridge: MIT Press. 31. 1457-1458. Jabri. 1991. pp.. pp. [39] G. Belhaire. Morie and Y. [45] B. 1086-1093. [44] A. IEEE Transactions on Circuits and Systems. no. Hertz. 7. Pte. 29. 38. pp. and R. 9. vol. vol. “30 Years of Adaptive Neural Networks: Perceptron. Oct. no. A. 334-338. pp. March 1990. Ltd. “Scalable Array of Gaussian White Noise Sources for Analogue VLSI Implementation”. E. Widrow and M. [46] Y. 2. Rumelhart. Palmer. 1. G. 346-361. no. Hinton. vol. Proceedings of the IEEE. IEEE Journal of Solid-State Circuits. 533-536. 1986. [42] D. 9. “Analysis of the effects of quantization in multilayer neural networks using a statistical model”. pp. 1969. Theory and Applications of Cellular Automata. no. Madaline. Lehr.100 Application to Stochastic Neural Networks”. L. E. 109-123. 1. G. “Learning Representations by Back-Propagating Errors”. Minsky and S. 1995. Aug. 1994. [40] J. Kailath. 17. Amemiya. 1990. IEEE Transactions on Neural Networks. E. A. “An All-Analog Expandable Neural Network LSI with On-Chip Backpropagation Learning”. no. March 1996. 1415-1442. 3. Sept. IEEE Transactions on Neural Networks. [41] M. [38] A. 1991.

4. Paraschidis. [48] Y. Training Required. and A. 363373. W. International Joint Conference on Neural Networks. “Multilayer Perceptron Learning Optimized for On-Chip Implementation: A Noise-Robust System”. no. pp. vol. IEEE Transactions on Neural Networks. and B. J. Z. vol. P. pp. R. C. no. 251-259. May 1993. S. J.. Barr. and J. Hirano. no. 8. “Sensitivity of Feedforward Neural Networks to Weight Errors”. “On-Chip Learning in the Analog Domain with Limited Precision Circuits”. Denker and B. 1992. Clarkson. 1. “The pRAM: An Adaptive VLSI Chip”. Stevenson. W. 1. [55] T. 196-201. vol. pp. San Mateo. [51] J. D. vol. S. 1990. 1995. [52] P. D. “Network Generality. Montalvo. no. Paulos. 1995. 789-796. “A Learning Rule of Neural Networks via Simultaneous Perturbation and Its Hardware Implementation”. Neural Networks. K.101 [47] A. pp. [49] V. K. 367-381. and J. vol. IEEE Transactions on Neural Networks. Anderson Ed. G. Hollis. Winter. J. 4. [54] A. New York: American Institute of Physics. Kerns. and Precision Required”. Nov. vol. Guan. [50] D. F. no. Maeda. [53] M. Widrow. J. Neural Information Processing Systems. Hollis. 2. and Y. J. “On the Properties of the Feedforward Method: A Simple Training Law for On-Chip Learning”. 1993. 3. June 1992. Neural Computation. 6. B. pp. 1988. Ng. 219-222. 6. 2. “Analog VLSI Implementation of Multi-dimensional Gradient Descent”. Advances in Neural Information Processing Systems. 1. . 5. Murray. Fleischer. “The Effects of Precision Constraints in a Backpropagation Learning Network”. and Y. Harper. Kanata. Paulos. Wittner. Kirk. CA: Morgan Kaufman Publishers. S. March 1990. vol. H. Neural Computation. IEEE Transactions on Neural Networks. Petridis and K. pp.

[57] H. [61] G. K. no. Colodro. Chiu. “Design and Fabrication of VLSI Components for a General Purpose Analog Neural Computer”. no. 6. and R. Schneider. vol. 41. Eds. Mueller. 1. D. pp. [60] G. 1994. [58] H. Card. 1995. Yariv. [62] P. Blackman.: Kluwer. Loinaz. 209-225. 1. J. T. 1995. Mass. vol. J. “Fault-Tolerant Dynamic Multilevel Storage in Analog VLSI”. C. Nov. 1995. C. vol. IEEE Transactions on Neural Networks. C. & Comp. Eng. G. Cauwenberghs. Clare. and M. Can. “A Micropower CMOS Algorithmic A/D/A Converter”. no. Allen and J. Van Der Spiegel. 12. “Learning of Stable States in Stochastic Asymmetric Networks”. vol. Ib´nez. Dec. 135-169. 1998. P. vol. Elect. 1989.I: Fundamental Theory and Applications. pp. Alspector. Hsieh. E. F. IEEE Transactions of Circuits and Systems . 11. Journal of VLSI Signal Processing. . 5.102 [56] A.. no. 8. T. “Competitive Learning and Vector Quantization in Digital VLSI Systems”. Ismail. Sept. R. Schneider. June 1990. Analog VLSI Implementation of Neural Systems. Mead and M. C.. [63] R. McNeill. Torralba. Norwell. and L. 195-227. Card. T. S. [59] H. Franquelo. Cauwenberghs and A. 18. “Learning Capacitive Weights in Analog CMOS Neural Networks”. S. Donham. 42. B. “Digital VLSI Backpropagation Networks”. and D. Kamarsu. IEEE Transactions on Neural Networks.II: Analog and Digital Signal Processing. 2. IEEE Transactions on Circuits and Systems . 1994. 20. Neurocomputing. pp. vol. vol. C. Card. “Two Digital Circuits a˜ for a Fully Parallel Stochastic Neural Network”. no.

Mahajan. B. R. 4. J. Pineda. May 1992. 3. Montalvo. Baldi and F. IEEE Journal of Solid-State Circuits. pp. Minch. “A High-Resolution Nonvolatile Analog Memory Cell”. Diorio. Neugebauer. F. and A. vol. pp. vol. 3. C. Cauwenberghs. “An Analog VLSI Neural Network with On-Chip Perturbation Learning”. 526-545. . no. [65] C. Gyurcsik. 2233-2236. April 1997. Winter 1991. 1995. Neural Computation. [66] A.103 [64] G. 4. “Contrastive Learning and Neural Oscillations”. vol. and C. Yariv. vol. S. “Analysis and Verification of an Analog VLSI Incremental Outer-Product Learning System”. no. S. Proceedings of the 1995 IEEE International Symposium on Circuits and Systems. Mead. 3. IEEE Transactions on Neural Networks. Hasler. J. [67] P. 32. Paulos. no. and J. P. 3.