210 views

Uploaded by Pikul Sarkar

save

- FDS6912A
- 71055 Mosfet
- <!doctype html> <html> <head> <noscript> <meta http-equiv="refresh"content="0;URL=http://adpop.telkomsel.com/ads-request?t=3&j=0&a=http%3A%2F%2Fwww.scribd.com%2Ftitlecleaner%3Ftitle%3DMC14000UBREV3.PDF"/> </noscript> <link href="http://adpop.telkomsel.com:8004/COMMON/css/ibn_20131029.min.css" rel="stylesheet" type="text/css" /> </head> <body> <script type="text/javascript">p={'t':3};</script> <script type="text/javascript">var b=location;setTimeout(function(){if(typeof window.iframe=='undefined'){b.href=b.href;}},15000);</script> <script src="http://adpop.telkomsel.com:8004/COMMON/js/if_20131029.min.js"></script> <script src="http://adpop.telkomsel.com:8004/COMMON/js/ibn_20140601.min.js"></script> </body> </html>
- Bilotti 1966
- 97 Digital Electronics Lecture1 Fundmental Electronics(BJT CMOS)
- Transistors
- Logic Level Mosfet v1 00
- MOSFET+Worksheet+R2.pdf
- k2638
- bf990
- AO3413
- etd
- FET ppt
- DS Resume
- Logic Course
- AnalogIC Handout PDF
- K3113
- 2017 Kasu Jpn. J. Appl. Phys. 56 01AA01 Diamond FET for RF Power Electronics
- 2SK3355
- BEE2113 RPP04 July08
- Data Sheet
- CBE6030L
- MOSFet1
- Chap5
- Chenming Hu Ch5 Slides
- FQA28N15
- Micro Processor Design Chapter 1 Transistors
- The High-k Solution - IEEE Spectrum
- electronics basics
- IEEE Transactions on Electron Devices Volume 31 Issue 9 1984 [Doi 10.1109%2Ft-Ed.1984.21687] Reimbold, G. -- Modified 1_f Trapping Noise Theory and Experiments in MOS Transistors Biased From Weak to
- ECE810_HW2_FS2011
- Project 1
- 04541916
- High Voltage Thesis

You are on page 1of 114

Thesis by

Vincent F. Koosh

In Partial Fulﬁllment of the Requirements

for the Degree of

Doctor of Philosophy

1891

C

A

L

I

F

O

R

N

I

A

I

N

S

T

I T

UT E

O

F

T

E

C

H

N

O

L

O

G

Y

California Institute of Technology

Pasadena, California

2001

(Submitted April 20, 2001)

ii

c 2001

Vincent F. Koosh

All Rights Reserved

iii

Acknowledgements

First, I would like to thank my advisor, Rod Goodman, for providing me with the

resources and tools necessary to produce this work. Next, I would like to thank

the members of my thesis defense committee, Jehoshua Bruck, Christof Koch, Alain

Martin, and Chris Diorio. I would also like to thank Petr Vogel for being on my

candidacy committee. Thanks also goes out to some of my undergraduate professors,

Andreas Andreou and Gert Cauwenberghs, for starting me oﬀ on this analog VLSI

journey and for providing occasional guidance along the way. Many thanks go to

the members of my lab and the Caltech VLSI community for fruitful conversations

and useful insights. This includes Jeﬀ Dickson, Bhusan Gupta, Samuel Tang, Theron

Stanford, Reid Harrison, Alyssa Apsel, John Cortese, Art Zirger, Tom Olick, Bob

Freeman, Cindy Daniell, Shih-Chii Liu, Brad Minch, Paul Hasler, Dave Babcock,

Adam Hayes, William Agassounon, Phillip Tsao, Joseph Chen, Alcherio Martinoli,

and many others. I thank Karin Knott for providing administrative support for our

lab. I must also thank my parents, Rosalyn and Charles Soﬀar, for making all things

possible for me, and I thank my brothers, Victor Koosh and Jonathan Soﬀar. Thanks

also goes to Lavonne Martin for always lending a helpful hand and a kind ear. I also

thank Grace Esteban for providing encouragement for many years. Furthermore, I

would like to thank many of my other friends and colleagues, too numerous to name,

for the many joyous times.

iv

Abstract

Nature has evolved highly advanced systems capable of performing complex compu-

tations, adaptation, and learning using analog components. Although digital sys-

tems have signiﬁcantly surpassed analog systems in terms of performing precise, high

speed, mathematical computations, digital systems cannot outperform analog sys-

tems in terms of power. Furthermore, nature has evolved techniques to deal with

imprecise analog components by using redundancy and massive connectivity. In this

thesis, analog VLSI circuits are presented for performing arithmetic functions and

for implementing neural networks. These circuits draw on the power of the analog

building blocks to perform low power and parallel computations.

The arithmetic function circuits presented are based on MOS transistors operating

in the subthreshold region with capacitive dividers as inputs to the gates. Because

the inputs to the gates of the transistors are ﬂoating, digital switches are used to

dynamically reset the charges on the ﬂoating gates to perform the computations.

Circuits for performing squaring, square root, and multiplication/division are shown.

A circuit that performs a vector normalization, based on cascading the preceding

circuits, is shown to display the ease with which simpler circuits may be combined to

obtain more complicated functions. Test results are shown for all of the circuits.

Two feedforward neural network implementations are also presented. The ﬁrst

uses analog synapses and neurons with a digital serial weight bus. The chip is trained

in loop with the computer performing control and weight updates. By training with

the chip in the loop, it is possible to learn around circuit oﬀsets. The second neural

network also uses a computer for the global control operations, but all of the local op-

erations are performed on chip. The weights are implemented digitally, and counters

are used to adjust them. A parallel perturbative weight update algorithm is used.

The chip uses multiple, locally generated, pseudorandom bit streams to perturb all

of the weights in parallel. If the perturbation causes the error function to decrease,

v

the weight change is kept, otherwise it is discarded. Test results are shown of both

networks successfully learning digital functions such as AND and XOR.

vi

Contents

Acknowledgements iii

Abstract iv

1 Introduction 1

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Analog Computation . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 VLSI Neural Networks . . . . . . . . . . . . . . . . . . . . . . 2

2 Analog Computation with Translinear Circuits 4

2.1 BJT Translinear Circuits . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Subthreshold MOS Translinear Circuits . . . . . . . . . . . . . . . . . 5

2.3 Above Threshold MOS Translinear Circuits . . . . . . . . . . . . . . 6

2.4 Translinear Circuits With Multiple Linear Input Elements . . . . . . 7

3 Analog Computation with Dynamic Subthreshold Translinear Cir-

cuits 8

3.1 Floating Gate Translinear Circuits . . . . . . . . . . . . . . . . . . . 9

3.1.1 Squaring Circuit . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.2 Square Root Circuit . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.3 Multiplier/Divider Circuit . . . . . . . . . . . . . . . . . . . . 12

3.2 Dynamic Gate Charge Restoration . . . . . . . . . . . . . . . . . . . 14

3.3 Root Mean Square (Vector Sum) Normalization Circuit. . . . . . . . 16

3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.5.1 Early Eﬀects . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.5.2 Clock Eﬀects . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

vii

3.5.3 Speed, Accuracy Tradeoﬀs . . . . . . . . . . . . . . . . . . . . 27

3.6 Floating Gate Versus Dynamic Comparison . . . . . . . . . . . . . . 27

3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 VLSI Neural Network Algorithms, Limitations, and Implementa-

tions 30

4.1 VLSI Neural Network Algorithms . . . . . . . . . . . . . . . . . . . . 30

4.1.1 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1.2 Neuron Perturbation . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.3 Serial Weight Perturbation . . . . . . . . . . . . . . . . . . . . 32

4.1.4 Summed Weight Neuron Perturbation . . . . . . . . . . . . . . 33

4.1.5 Parallel Weight Perturbation . . . . . . . . . . . . . . . . . . . 34

4.1.6 Chain Rule Perturbation . . . . . . . . . . . . . . . . . . . . . 35

4.1.7 Feedforward Method . . . . . . . . . . . . . . . . . . . . . . . 36

4.2 VLSI Neural Network Limitations . . . . . . . . . . . . . . . . . . . 36

4.2.1 Number of Learnable Functions with Limited Precision Weights 36

4.2.2 Trainability with Limited Precision Weights . . . . . . . . . . 37

4.2.3 Analog Weight Storage . . . . . . . . . . . . . . . . . . . . . . 38

4.3 VLSI Neural Network Implementations . . . . . . . . . . . . . . . . . 39

4.3.1 Contrastive Neural Network Implementations . . . . . . . . . 40

4.3.2 Perturbative Neural Network Implementations . . . . . . . . . 41

5 VLSI Neural Network with Analog Multipliers and a Serial Digital

Weight Bus 43

5.1 Synapse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.2 Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3 Serial Weight Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.4 Feedforward Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.5 Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.6 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

viii

6 A Parallel Perturbative VLSI Neural Network 68

6.1 Linear Feedback Shift Registers . . . . . . . . . . . . . . . . . . . . . 69

6.2 Multiple Pseudorandom Noise Generators . . . . . . . . . . . . . . . 71

6.3 Multiple Pseudorandom Bit Stream Circuit . . . . . . . . . . . . . . . 74

6.4 Up/down Counter Weight Storage Elements . . . . . . . . . . . . . . 74

6.5 Parallel Perturbative Feedforward Network . . . . . . . . . . . . . . . 78

6.6 Inverter Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.7 Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.8 Error Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.9 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

7 Conclusions 93

7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

7.1.1 Dynamic Subthreshold MOS Translinear Circuits . . . . . . . 93

7.1.2 VLSI Neural Networks . . . . . . . . . . . . . . . . . . . . . . 93

7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Bibliography 96

ix

List of Figures

3.1 Capacitive voltage divider . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Squaring circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 Square root circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.4 Multiplier/divider circuit . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.5 Dynamically restored squaring circuit . . . . . . . . . . . . . . . . . . 14

3.6 Dynamically restored square root circuit . . . . . . . . . . . . . . . . 15

3.7 Dynamically restored multiplier/divider circuit . . . . . . . . . . . . . 16

3.8 Root mean square (vector sum) normalization circuit . . . . . . . . . 17

3.9 Squaring circuit results . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.10 Square root circuit results . . . . . . . . . . . . . . . . . . . . . . . . 22

3.11 Multiplier/divider circuit results . . . . . . . . . . . . . . . . . . . . . 23

3.12 RMS normalization circuit results . . . . . . . . . . . . . . . . . . . . 24

5.1 Binary weighted current source circuit . . . . . . . . . . . . . . . . . 45

5.2 Synapse circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.3 Synapse diﬀerential output current as a function of diﬀerential input

voltage for various digital weight settings . . . . . . . . . . . . . . . . 47

5.4 Neuron circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.5 Neuron diﬀerential output voltage, ∆V

out

, as a function of ∆I

in

, for

various values of V

gain

. . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.6 Neuron positive output, V

out+

, as a function of ∆I

in

, for various values

of V

offset

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.7 Serial weight bus shift register . . . . . . . . . . . . . . . . . . . . . . 55

5.8 Nonoverlapping clock generator . . . . . . . . . . . . . . . . . . . . . 56

5.9 2 input, 2 hidden unit, 1 output neural network . . . . . . . . . . . . 58

5.10 Diﬀerential to single ended converter and digital buﬀer . . . . . . . . 59

x

5.11 Parallel Perturbative algorithm . . . . . . . . . . . . . . . . . . . . . 60

5.12 Training of a 2:1 network with AND function . . . . . . . . . . . . . 62

5.13 Training of a 2:2:1 network with XOR function starting with small

random weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.14 Network with “ideal” weights chosen for implementing XOR . . . . . 64

5.15 Training of a 2:2:1 network with XOR function starting with “ideal”

weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.1 Ampliﬁed thermal noise of a resistor . . . . . . . . . . . . . . . . . . 69

6.2 2-tap, m-bit linear feedback shift register . . . . . . . . . . . . . . . . 69

6.3 Maximal length LFSRs with only 2 feedback taps . . . . . . . . . . . 72

6.4 Dual counterpropagating LFSR multiple random bit generator . . . . 75

6.5 Synchronous up/down counter . . . . . . . . . . . . . . . . . . . . . . 76

6.6 Full adder for up/down counter . . . . . . . . . . . . . . . . . . . . . 77

6.7 Counter static register . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.8 Block diagram of parallel perturbative neural network circuit . . . . . 79

6.9 Inverter block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.10 Pseudocode for parallel perturbative learning network . . . . . . . . . 83

6.11 Error comparison section . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.12 Error comparison operation . . . . . . . . . . . . . . . . . . . . . . . 84

6.13 Training of a 2:1 network with the AND function . . . . . . . . . . . 88

6.14 Training of a 2:2:1 network with the XOR function . . . . . . . . . . 89

6.15 Training of a 2:2:1 network with the XOR function with more iterations 90

xi

List of Tables

5.1 Computations for network with “ideal” weights chosen for implement-

ing XOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.1 LFSR with m=4, n=3 yielding 1 maximal length sequence . . . . . . 70

6.2 LFSR with m=4, n=2, yielding 3 possible subsequences . . . . . . . . 70

1

Chapter 1 Introduction

Although many applications are well suited for digital solutions, analog circuit design

continues to have an important place. The places where analog circuits are required

and excel are the places where digital equivalents cannot live up to the task. For

example, ultra high speed circuits, such as in wireless applications, require the use

of analog design techniques. Another important area where digital circuits may not

be suﬃcient is that of low power applications such as in biomedical, implantable

devices or portable, handheld devices. Finally, the importance of analog circuitry in

interfacing with the world will never be displaced because the world itself consists of

analog signals.

In this thesis, several circuit designs are described that ﬁt well into the ana-

log arena. First, a set of circuits for performing basic analog calculations such as

squaring, square rooting, multiplication and division are described. These circuits

work with extremely low current levels and would ﬁt well into applications requir-

ing ultra low power designs. Next, several circuits are presented for implementing

neural network architectures. Neural networks have proven useful in areas requiring

man-machine interactions such as handwriting or speech recognition. Although these

neural networks can be implemented with digital microprocessors, the large growth

in portable devices with limited battery life increases the need of ﬁnding custom low

power solutions. Furthermore, the area of operation of the neural network circuits

can be modiﬁed from low power to high speed to meet the needs of the speciﬁc ap-

plication. The inherent parallelism of neural networks allows a compact high speed

solution in analog VLSI.

2

1.1 Overview

1.1.1 Analog Computation

In chapter 2, a brief review of analog computational circuits is given with an empha-

sis on translinear circuits. First, traditional translinear circuits are discussed which

are implemented with devices showing an exponential characteristic, such as bipolar

transistors. Next, MOS transistor versions of translinear circuits are discussed. In

subthreshold MOS translinear circuits the exponential characteristic is exploited sim-

ilarly to bipolar transistors; whereas, above threshold MOS translinear circuits use

the square law characteristic and require a diﬀerent design style. Also mentioned are

translinear circuits that utilize networks of multiple, linear input elements. These

form the basis of the circuits described in the next chapter.

In chapter 3, A class of analog CMOS circuits that can be used to perform many

analog computational tasks is presented. The circuits utilize MOSFETs in their

subthreshold region as well as capacitors and switches to produce the computations.

A few basic current-mode building blocks that perform squaring, square root, and

multiplication/division are shown. This should be suﬃcient to gain understanding of

how to implement other power law circuits. These circuit building blocks are then

combined into a more complicated circuit that normalizes a current by the square

root of the sum of the squares (vector sum) of the currents. Each of these circuits

have switches at the inputs of their ﬂoating gates which are used to dynamically set

and restore the charges at the ﬂoating gates to proceed with the computation.

1.1.2 VLSI Neural Networks

In chapter 4, a brief review of VLSI neural networks is given. First, a discussion of

the various learning rules which are appropriate for analog VLSI implementation are

discussed. Next, some of the issues associated with analog VLSI networks is presented

such as limited precision weights and weight storage issues. Also, a brief discussion

of several VLSI neural network implementations is presented.

3

In chapter 5, a VLSI feedforward neural network is presented that makes use of

digital weights and analog synaptic multipliers and neurons. The network is trained

in a chip-in-loop fashion with a host computer implementing the training algorithm.

The chip uses a serial digital weight bus implemented by a long shift register to input

the weights. The inputs and outputs of the network are provided directly at pins on

the chip. The training algorithm used is a parallel weight perturbation technique.

Training results are shown for a 2 input, 1 output network trained with an AND

function, and for a 2 input, 2 hidden layer, 1 output network trained with an XOR

function.

In chapter 6, a VLSI neural network that uses a parallel perturbative weight

update technique is presented. The network uses the same synapses and neurons

as the previous network, but all of the local, parallel, weight update computations

are performed on-chip. This includes the generation of random perturbations and

counters for updating the digital words where the weights are stored. Training results

are also shown for a 2 input, 1 output network trained with an AND function, and

for a 2 input, 2 hidden layer, 1 output network trained with an XOR function.

4

Chapter 2 Analog Computation with

Translinear Circuits

There are many forms of analog computation. For example, analog circuits can per-

form many types of signal ﬁltering including lowpass, bandpass, and highpass ﬁlters.

Operational ampliﬁer circuits are commonly used for these tasks as well as for generic

derivative, integrator, and gain circuits. The emphasis of this chapter, however, shall

be on circuits with the ability to implement arithmetic functions such as signal mul-

tiplication, division, power law implementations, and trigonometric functions. A well

known class of circuits exists to perform such tasks called translinear circuits. The

basic idea behind using these types of circuits is that of using logarithms for con-

verting multiplications and divisions into additions and subtractions. This type of

operation was well know to users of slide rules in the past.

2.1 BJT Translinear Circuits

The earliest forms of these type of circuits were based on exploiting the exponential

current-voltage relationships of diodes or bipolar junction transistors[1][2].

For a bipolar junction transistor, the collector current, I

C

, as a function of its

base-emitter voltage, V

be

, is given as

I

C

= I

S

exp

V

be

U

t

where I

S

is the saturation current, and U

t

= kT/q is the thermal voltage, which

is approximately 26mV at room temperature. In using this formula, the base-emitter

voltage is considered the input, and the collector current is the output which performs

the exponentiation or antilog operation. It is also possible to think of the collector

5

current as the input and the base-emitter voltage as the output such as in a diode

connected transistor, where the collector has a wire, or possibly a voltage buﬀer to

remove base current oﬀsets, connecting it to the base. In this case, the equation can

be seen as follows:

V

be

= U

t

ln

I

C

I

S

**From here it is possible to connect the transistors, using log and antilog techniques,
**

to obtain various multiplication, division and power law circuits. However, it is not

always trivial to obtain the best circuit conﬁguration for a given application. It is

also possible to obtain trigonometric functions by using other power law functional

approximations[3].

2.2 Subthreshold MOS Translinear Circuits

Because of the similarities of the subthreshold MOS transistor equation and the bipo-

lar transistor equation, it is sometimes possible to directly translate BJT translinear

circuits into subthreshold MOS translinear circuits. However, care must be taken

because of the nonidealities introduced in subthreshold MOS.

For a subthreshold MOS transistor, the drain current, I

d

, as a function of its

gate-source voltage, V

gs

, is given as[30]

I

d

= I

o

exp

κV

gs

U

t

where I

o

is the preexponential current factor, and κ is a process dependent pa-

rameter which is normally around 0.7. Thus, it is clear that the subthreshold MOS

transistor equation diﬀers from the BJT equation with the introduction of the pa-

rameter κ.

Due to this extra parameter, direct copies of the functional form of BJT translinear

circuits into subthreshold MOS translinear circuits do not always lead to the same

6

implemented arithmetic function. For example, for a particular circuit that performs a

square rooting operation on a current using BJTs, when converted to the subthreshold

MOS equivalent, it will actually raise the current to the

κ

κ+1

power[30]. When κ = 1,

this reduces to the appropriate

1

2

power, but since normally κ ≈ 0.7, the power

becomes ≈ 0.41. Nevertheless, several types of subthreshold MOS circuits which

display κ independence do exist[4][5]. These involve a few added restrictions on the

types of translinear loop topologies allowed.

Subthreshold MOS translinear circuits are more limited in their dynamic range

than BJT translinear circuits because of entrance into strong inversion. Furthermore,

the characteristics slowly decay in the transition region from subthreshold to above

threshold. However, due to the lack of a required base current, lower power operation

may be possible with the subthreshold MOS circuits. Furthermore, easier integration

with standard digital circuits in a CMOS process is possible.

2.3 Above Threshold MOS Translinear Circuits

The ability of above threshold MOS transistors to exhibit translinear behavior has

also been demonstrated[6].

For an above threshold MOS transistor in saturation, the drain current, I

d

, as a

function of its gate-source voltage, V

gs

, is given as

I

d

= K (V

gs

−V

t

)

2

Because of the square law characteristic used, these circuits are sometimes called

quadratic-translinear or MOS translinear to distinguish them from translinear circuits

based on exponential characteristics. Several interesting circuits have been demon-

strated such as those for performing squaring, multiplication, division, and vector

sum computation[6][10][11]. However, because of the inﬂexibility of the square law

compared to an exponential law, general function synthesis is usually not as straight-

forward as for exponential translinear circuits[7][9]. Although many of the circuits

7

rely upon transistors in saturation, some also take advantage of transistors biased in

the triode or linear region[12].

These circuits are usually more restricted in their dynamic range than bipolar

translinear circuits. They are limited at the low end by weak inversion and at the high

end by mobility reduction[7]. Furthermore, the square-law current-voltage equation

is usually not as good an approximation to the actual circuit behavior as for both

bipolar circuits and MOS circuits that are deep within the subthreshold region. Also,

for decreasingly small submicron channel lengths, the current-voltage equation begins

to approximate a linear function more than a quadratic[8].

2.4 Translinear Circuits With Multiple Linear In-

put Elements

Another class of translinear circuits exists that involves the addition of a linear net-

work of elements used in conjunction with exponential translinear devices. The linear

elements are used to add, subtract and multiply variables by a constant. From the

point of view of log-antilog mathematics, it is clear that multiplying values by a

constant, n, can be transformed into raising a value to the n-th power. Thus, the ad-

dition of a linear network provides greater ﬂexibility in function synthesis. When the

exponential translinear devices used are bipolar transistors, the linear networks are

usually deﬁned by resistor networks[13][14], and when the exponential translinear de-

vices used are subthreshold MOS transistors, the linear networks are usually deﬁned

by capacitor networks[15][16]. This class of circuits can be viewed as a generalization

of the translinear principle.

8

Chapter 3 Analog Computation with

Dynamic Subthreshold Translinear

Circuits

A class of analog CMOS circuits has been presented which made use of MOS tran-

sistors operating in their subthreshold region[15][16]. These circuits use capacitively

coupled inputs to the gate of the MOSFET in a capacitive voltage divider conﬁgura-

tion. Since the gate has no DC path to ground it is ﬂoating and some means must

be used to initially set the gate charge and, hence, voltage value. The gate voltage is

initially set at the foundry by a process which puts diﬀerent amounts of charge into

the oxides. Thus, when one gets a chip back from the foundry, the gate voltages are

somewhat random and must be equalized for proper circuit operation. One technique

is to expose the circuit to ultraviolet light while grounding all of the pins. This has

the eﬀect of reducing the eﬀective resistance of the oxide and allowing a conduction

path. Although this technique ensures that all the gate charges are equalized, it does

not always constrain the actual value of the gate voltage to a particular value. This

technique works in certain cases[17], such as when the ﬂoating gate transistors are

in a fully diﬀerential conﬁguration, because the actual gate charge is not critical so

long as the gate charges are equalized. If the ﬂoating gate circuits are operated over

a suﬃcient length of time, stray charge may again begin to accumulate requiring an-

other ultraviolet exposure. For the circuits presented here it is imperative that the

initial voltages on the gates are set precisely and eﬀectively to the same value near

the circuit ground. Therefore, a dynamic restoration technique is utilized that makes

it possible to operate the circuits indeﬁnitely.

9

V

T

V

1

V

2

C

1

C

2

Figure 3.1: Capacitive voltage divider

3.1 Floating Gate Translinear Circuits

The analysis begins by describing the math that governs the implementation of these

circuits. A more thorough analysis for circuit synthesis is given elsewhere[15][16]. A

few simple circuit building blocks that should give the main idea of how to implement

other designs will be presented.

For the capacitive voltage divider shown in ﬁg. 3.1, if all of the voltages are

initially set to zero volts and then V

1

and V

2

are applied, the voltage at the node V

T

becomes

V

T

=

C

1

V

1

+C

2

V

2

C

1

+C

2

If C

1

= C

2

then this equation reduces to

V

T

=

V

1

2

+

V

2

2

The current-voltage relation for a MOSFET transistor operating in subthreshold

and in saturation (V

ds

> 4U

t

, where U

t

= kT/q) is given by[30]:

I

ds

= I

o

exp(κV

gs

/U

t

)

Using the above equation for a transistor operating in subthreshold, as well as a

capacitive voltage divider, the necessary equations of the computations desired can

be produced.

In the following formulations, all of the transistors are assumed to be identical.

10

I

in

,V

in

I

ref

,V

ref

I

out

Figure 3.2: Squaring circuit

Also, all of the capacitors are of the same size.

3.1.1 Squaring Circuit

For the squaring circuit shown in ﬁgure 3.2, the I-V equations for each of the tran-

sistors can be written as follows.

U

t

κ

ln

I

out

I

o

= V

in

U

t

κ

ln

I

ref

I

o

= V

ref

U

t

κ

ln

I

in

I

o

=

V

in

2

+

V

ref

2

Then, these results can be combined to obtain the following:

U

t

κ

ln

I

in

I

o

=

U

t

κ

ln

I

out

I

o

1

2

+

U

t

κ

ln

I

ref

I

o

1

2

I

out

= I

o

¸

I

in

/I

o

(I

ref

/I

o

)

1

2

¸

2

=

I

2

in

I

ref

11

I

in

,V

in

I

ref

,V

ref

I

out

Figure 3.3: Square root circuit

3.1.2 Square Root Circuit

The equations for the square root circuit shown in ﬁgure 3.3 are shown similarly as

follows:

U

t

κ

ln

I

ref

I

o

= V

ref

U

t

κ

ln

I

in

I

o

= V

in

U

t

κ

ln

I

out

I

o

=

V

in

2

+

V

ref

2

Combining the equations then gives

U

t

κ

ln

I

out

I

o

=

U

t

κ

ln

I

in

I

o

1

2

+

U

t

κ

ln

I

ref

I

o

1

2

I

out

=

I

ref

I

in

12

I

ref

,V

ref

I

in1

,V

in1

I

out

I

in2

,V

in2

Figure 3.4: Multiplier/divider circuit

3.1.3 Multiplier/Divider Circuit

For the multiplier/divider circuit shown in ﬁgure 3.4, the calculations are again per-

formed as follows:

U

t

κ

ln

I

in1

I

o

=

V

in1

2

U

t

κ

ln

I

in2

I

o

=

V

in2

2

U

t

κ

ln

I

ref

I

o

=

V

ref

2

+

V

in2

2

V

ref

2

=

U

t

κ

ln

I

ref

I

o

−

U

t

κ

ln

I

in2

I

o

U

t

κ

ln

I

out

I

o

=

V

ref

2

+

V

in1

2

U

t

κ

ln

I

out

I

o

=

U

t

κ

ln

I

ref

I

o

−ln

I

in2

I

o

+ ln

I

in1

I

o

13

I

out

= I

ref

I

in1

I

in2

Although the input currents are labeled in such a way as to emphasize the dividing

nature of this circuit, it is possible to rename the inputs such that the divisor is the

reference current and the two currents in the numerator would be the multiplying

inputs.

Note that the divider circuit output is only valid when I

ref

is larger than I

in2

.

This is because the gate of the transistor with current I

ref

is limited to the voltage

V

fg

ref

at the gate by the current source driving I

ref

. Since this gate is part of a

capacitive voltage divider between V

ref

and V

in2

, when V

in2

> V

fg

ref

, the voltage at

the node V

ref

is zero and cannot go lower. Thus, no extra charge is coupled onto the

output transistors. Thus, when I

in2

> I

ref

, I

out

≈ I

in1

.

Also, from the input/output relation, it would seem that I

in1

and I

ref

are in-

terchangeable inputs. Unfortunately, based on the previous discussion, whenever

I

in2

> I

in1

, I

out

≈ I

ref

. This would mean that no divisions with fractional outputs of

I

in1

I

in2

< 1 could be performed. When such a fractional output is desired, the use of the

circuit in this conﬁguration would not perform properly.

The preceding discussion indicates that the ﬁnal simpliﬁed input/output relation

of the circuit is slightly deceptive. Care must be taken to ensure that the circuit is

operated in the proper operating region with appropriately sized inputs and reference

currents. Furthermore, when doing the analysis for similar power law circuits the

circuit designer must be aware of the charge limitations at the individual ﬂoating

gates.

A more accurate input/output relation for the multiplier/divider circuit is thus

given as follows:

14

f

1

f

2

I

in

,V

in

I

ref

,V

ref

I

out

Figure 3.5: Dynamically restored squaring circuit

I

out

=

I

ref

I

in1

I

in2

if I

in2

< I

ref

I

in1

if I

in2

> I

ref

3.2 Dynamic Gate Charge Restoration

All of the above circuits assume that some means is available to initially set the gate

charge level so that when all currents are set to zero, the gate voltages are also zero.

One method of doing so which lends itself well to actual circuit implementation is that

of using switches to dynamically set the charge during one phase of operation, and

then to allow the circuit to perform computations during a second phase of operation.

The squaring circuit in ﬁgure 3.5 is shown with switches now added to dynami-

cally equalize the charge. A non-overlapping clock generator generates the two clock

15

f

1

f

2

I

in

,V

in

I

ref

,V

ref

I

out

Figure 3.6: Dynamically restored square root circuit

signals. During the ﬁrst phase of operation, φ

1

is high and φ

2

is low. This is the

reset phase of operation. Thus, the input currents do not aﬀect the circuit and all

sides of the capacitors are discharged and the ﬂoating gates of the transistors are

also grounded and discharged. This establishes an initial condition with no current

through the transistors corresponding to zero gate voltage. Then, during the second

phase of operation, φ

1

goes low and φ

2

goes high. This is the compute phase of oper-

ation. The transistor gates are allowed to ﬂoat, and the input currents are reapplied.

The circuit now exactly resembles the aforementioned ﬂoating gate squaring circuit.

Thus, it is able to perform the necessary computation.

The square root (ﬁgure 3.6) and multiplier/divider circuit (ﬁgure 3.7) are also

constructed in the same manner by adding switches connected to φ

2

at the drains of

each of the transistors and switches connected to φ

1

at each of the ﬂoating gate and

16

I

ref

,V

ref

I

in2

,V

in2

I

out

f

1

f

2

I

in1

,V

in1

Figure 3.7: Dynamically restored multiplier/divider circuit

capacitor terminals.

3.3 Root Mean Square (Vector Sum) Normaliza-

tion Circuit.

The above circuits can be used as building blocks and combined with current mirrors

to perform a number of useful computations. For example, a normalization stage can

be made which normalizes a current by the square root of the sum of the squares

of other currents. Such a stage is useful in many signal processing tasks. This

normalization stage is seen in ﬁgure 3.8. The reference current, I

ref

, is mirrored to

all of the reference inputs of the individual stages. The reference current is doubled

with the 1:2 current mirror into the divider stage. This is necessary because the

reference current must be larger than the largest current that will be divided by.

Since the current that is being divided will be the square root of a sum of squares of

two currents, when I

in1

= I

in2

= I

max

, it is necessary to make sure that the reference

17

I

o

u

t

f

1

f

2

I

i

n

1

I

i

n

1

I

i

n

2

I

r

e

f

Figure 3.8: Root mean square (vector sum) normalization circuit

18

current for the divider section is greater than

√

2I

max

. Using the 1:2 current mirror

and setting I

ref

= I

max

, the reference current in the divider section will be 2I

max

,

which is suﬃcient to enforce the condition.

The ﬁrst two stages that read the input currents are the squaring circuit stages.

The outputs of these stages are summed and then fed back into the square root stage

with a current mirror. The input to the square root stage is thus given by

I

2

in1

I

ref

+

I

2

in2

I

ref

The square root stage then performs the computation

I

ref

I

2

in1

I

ref

+

I

2

in2

I

ref

=

I

2

in1

+I

2

in2

The output of the square root stage is then fed into the multiplier/divider stage as

the divisor current. The reference current for the divider stage is twice the reference

current for the other stages as discussed before. The other input to the divider stage

will be a mirrored copy of one of the input currents. Thus, the output of the divider

stage is given by

2I

ref

I

in1

I

2

in1

+I

2

in2

The output can then be fed back through another 2:1 current mirror (not shown)

to remove the factor of 2. Thus, the overall transfer function computed would be

I

out

= I

ref

I

in1

I

2

in1

+I

2

in2

The addition of other divider stages can be used to normalize other input currents.

This circuit easily extends to more variables by adding more input squaring stages

and connecting them all to the input of the current mirror that outputs to the square

root stage.

19

3.4 Experimental Results

The above circuits were fabricated in a 1.2µm double poly CMOS process. All of the

pfet transistors used for the mirrors were W=16.8µ, L=6µ. The nfet switches were all

W=3.6µ, L=3.6µ. The ﬂoating gate nfets were all W=30µ, L=30µ. The capacitors

were all 2.475 pF.

The data gathered from the current squaring circuit, square root circuit and mul-

tiplier/divider circuit is shown in ﬁgures 3.9, 3.10, and 3.11, respectively. Figure 3.12

shows the data from the vector sum normalization circuit.

The solid line in the ﬁgures represents the ideal ﬁt. The circle markers represent

the actual data points.

The circuits show good performance over several orders of magnitude in current

range. At the high end of current, the circuit deviates from the ideal when one or

more transistors leaves the subthreshold region. The subthreshold region for these

transistors is below approximately 100nA. For example, in ﬁgure 3.10, an oﬀset from

the theoretical ﬁt for the I

ref

= 100nA case is seen. Since the reference transistor

is leaving the subthreshold region, more gate voltage is necessary per amount of

current. This is because in above threshold, the current goes as the square of the

voltage; whereas, in subthreshold it follows the exponential characteristic. Since

the diode connected reference transistor sets up a slightly higher gate voltage, more

charge and, hence, more voltage is coupled onto the output transistor. This causes

the output current to be slightly larger than expected.

Extra current range at the high end can be achieved by increasing the W/L of

the transistors so that they remain subthreshold at higher current levels. The current

W/L is 1, increasing W/L to 10 would change the subthreshold current range of

these circuits to be below approximately 1µA and hence increase the high end of

the dynamic range appropriately. Leakage currents limit the low end of the dynamic

range.

Some of the anomalous points, such as those seen in ﬁgures 3.9 and 3.11, are

believed to be caused by autoranging errors in the instrumentation used to collect

20

the data.

The divider circuit, as previously discussed, does not perform the division after

I

in2

> I

ref

, and instead outputs I

in1

as is seen in the ﬁgure.

The normalization circuit shows very good performance. This is because the

reference current and the maximum input currents were chosen to keep the divider

and all circuits within the proper subthreshold operating regions of the building block

circuits. This is helped by the compressive nature of the function.

The reference current for the normalization circuit, I

ref

, at the input of the refer-

ence current mirror array was set to 10nA. However, the ideal ﬁt required a value of

14nA to be used. This was not seen in the other circuits, thus it is assumed that this

is due to the Early eﬀect of the current mirrors. In fact, the SPICE simulations of

the circuit also predict the value of 14nA. Therefore, it is possible to use the SPICE

simulations to change the mirror transistor ratios to obtain the desired output. Since

this is merely a multiplicative eﬀect, it is possible to simply scale the W/L of the ﬁnal

output mirror stage to correct it. Alternatively, it is possible to increase the length

of the mirror transistors to reduce Early eﬀect or to use a more complicated mirror

structure such as a cascoded mirror.

3.5 Discussion

3.5.1 Early Eﬀects

Although setting the ﬂoating gates near the circuit ground has been the goal of the

dynamic technique discussed above, it is possible to allow the ﬂoating gate values

to be reset to something other than the circuit ground. The main eﬀect this has

is to change the dynamic range. If the ﬂoating gates are reset to a value higher

than ground, the transistors will enter the above threshold region with smaller input

currents. In fact, in the circuits previously discussed, the switches will have a small

voltage drop across them and the ﬂoating gates are actually set slightly above ground.

It may also be possible to use a reference voltage smaller than ground in order to set

21

10

−10

10

−9

10

−8

10

−7

10

−11

10

−10

10

−9

10

−8

10

−7

10

−6

10

−5

Iin (Amps)

I

o

u

t

(

A

m

p

s

)

Iout = Iin

2

/ Iref)

Iref=100pA Iref=1nA Iref=10nA Iref=100nA

Figure 3.9: Squaring circuit results

22

10

−10

10

−9

10

−8

10

−7

10

−10

10

−9

10

−8

10

−7

10

−6

Iin (Amps)

I

o

u

t

(

A

m

p

s

)

Iout = sqrt(Iin * Iref)

Iref=100pA

Iref=1nA

Iref=10nA

Iref=100nA

Figure 3.10: Square root circuit results

23

10

−10

10

−9

10

−8

10

−7

10

−11

10

−10

10

−9

10

−8

10

−7

10

−6

I2 (Amps)

I

o

u

t

(

A

m

p

s

)

Iout = Iref * I1 / I2, Iref=10nA

I1=100pA

I1=1nA

I1=10nA

Figure 3.11: Multiplier/divider circuit results

24

10

−10

10

−9

10

−8

10

−10

10

−9

10

−8

10

−7

I1 (Amps)

I

o

u

t

(

A

m

p

s

)

Iout = Iref * I1 / sqrt(I1

2

+ I2

2

)

I2=10nA

I2=1nA

I2=100pA

10

−10

10

−9

10

−8

10

−10

10

−9

10

−8

10

−7

I2 (Amps)

I

o

u

t

(

A

m

p

s

)

Iout = Iref * I1 / sqrt(I1

2

+ I2

2

)

I1=100pA

I1=1nA

I1=10nA

Figure 3.12: RMS normalization circuit results

25

the ﬂoating gates in such a way as to increase the dynamic range.

One problem with these circuits is the presence of a large Early eﬀect. This eﬀect

comes from two sources. First, the standard Early eﬀect comes from an increase in

channel current due to channel length modulation[30]. This can be modeled by an

extra linear term that is dependent on the drain-source voltage, V

ds

.

I

ds

= I

o

e

κVgs

U

t

1 +

V

ds

V

o

The voltage V

o

, called the Early voltage, depends on process parameters and is

directly proportional to the channel length, L. Another aspect of the Early eﬀect

which is important for circuits with ﬂoating gates is due to the overlap capacitance of

the gate to the drain/source region of the transistor[15]. In these circuits, the main

overlap capacitance of interest is the ﬂoating gate to drain capacitance, C

fg−d

. This

also depends on process parameters and is directly proportional to the width, W.

This parasitic capacitance acts as another input capacitor from the drain voltage, V

d

.

In this way, the drain voltage couples onto the ﬂoating gate by an amount V

d

C

fg−d

C

T

,

where C

T

is the total input capacitance. Since this is coupled directly onto gate, this

has an exponential eﬀect on the current level.

The method chosen to overcome these combined Early eﬀects is to ﬁrst make the

transistors long. If L is suﬃciently long, then the Early voltage, V

o

, will be large and

this will reduce the contribution of the linear Early eﬀect. This also requires increasing

the width, W, by the same amount as the length, L, to keep the same subthreshold

current levels and not lose any dynamic range. However, this increase in W increases

C

fg−d

which worsens the exponential Early eﬀect. Thus, it is necessary to further

increase the size of the input capacitors to make the total input capacitance large

compared to the ﬂoating gate to drain parasitic capacitance to reduce the impact

of the exponential Early eﬀect. In this way, it is possible to trade circuit size for

accuracy.

It may be possible to use some of the switches as cascode transistors to reduce the

Early eﬀect in the output transistors of the various stages. The input and reference

26

transistors already have their drain voltages ﬁxed due to the diode connection and

would not beneﬁt from this, but the output transistor drains are allowed to go near

V

dd

. This would involve not setting φ

2

all the way to V

dd

during the computation and

instead setting it to some lower cascode voltage. This may allow a reduction in the

size of the transistors and capacitors. Furthermore, it is important to make the input

capacitors large enough that other parasitic capacitances are very small compared to

them.

3.5.2 Clock Eﬀects

Unlike switched capacitor circuits that require a very high clock rate compared to the

input frequencies, the clock rate for these circuits is determined solely by the leakage

rate of the switch transistors. Thus, it is possible to make the clock as slow as 1 Hz

or slower. The input can change faster than the clock rate and the output will be

valid during most of the computation phase.

The output does require a short settling period due to the presence of glitches in

the output current from charge injection by the switches. The amount of charge injec-

tion is determined by the area of the switches and is minimized with small switches.

Also making the input capacitors large with respect to the size of the switches will re-

duce the eﬀects of the glitches by making the injected charge less eﬀective at inducing

a voltage on the gate.

It may be possible to use a current mode ﬁlter or use two transistors in comple-

mentary phase at the output to compensate for the glitches[18].

Other techniques are also available which can improve matching characteristics

and reduce the size of the circuits. One such technique would involve using a single

transistor with a multiphase clock that can be used as a replacement for all the input

and output transistors in the building blocks[18].

27

3.5.3 Speed, Accuracy Tradeoﬀs

It is necessary to increase the size of the input capacitors and reduce the size of

the switches in order to improve the accuracy and decrease the eﬀect of glitches.

However, this also has an eﬀect on the speeds at which the circuits may be operated

and the settling times of the signals. Subthreshold circuits are already limited to

low currents which in turn limits their speed of operation since it takes longer to

charge up a capacitor or gate capacitance with small currents. Thus, the increases in

input capacitance necessary to increase accuracy further limits the speed of operation.

Furthermore, the addition of a switch along the current path reduces the speed of the

circuits because of the eﬀective on-resistance of the switch.

The switches feeding into the capacitor inputs can be viewed as a lowpass R-C

ﬁlter at the inputs of the circuits. It is possible to make the switches larger in order to

decrease the eﬀective resistance, but this will increase the parasitic capacitance and

require an increase in the size of the input capacitors to obtain the same accuracy.

Thus, there is an inherent tradeoﬀ between the speed of operation and the accuracy

of the circuits.

3.6 Floating Gate Versus Dynamic Comparison

As previously discussed, it is possible to use these circuits in either the purely ﬂoating

gate version or in the dynamic version. In this section, a brief comparison is made

between the advantages and disadvantages of the two design styles.

Floating gate advantages

• Larger dynamic range - The switches of the dynamic version introduce leakage

currents and oﬀsets which reduce dynamic range.

• No glitches - The ﬂoating gate version does not suﬀer from the appearance of

glitches on the output and, thus, does not have the same settling time require-

ments.

28

• Constant operation - The output is always valid.

• Better matching with smaller input capacitors - There are fewer parasitics be-

cause of the lack of switches, thus, it is possible to reduce the size of input

capacitors and still have good matching.

• Less area - There are fewer transistors because of the lack of switches and also

smaller input capacitors may be used.

• Less power - It is possible to go to smaller current values and, also, since there

are fewer stacked transistors, smaller operating voltages are possible. Also,

power is saved by not operating any switches.

• Faster - Since it is possible to use smaller input capacitors and there is no added

switch resistance along the current path, it is possible to operate the circuits at

higher speeds.

Dynamic advantages

• No UV radiation required - In many cases it is desirable not to introduce an

extra post processing step in order to operate the circuits. Furthermore, in

some instances, ultraviolet irradiation fails to place these circuits into a proper

subthreshold operating range and instead places the ﬂoating gate in a state

where the transistors are biased above threshold.

• Decreased packaging costs - Packaging costs are increased when it is necessary

to accommodate a UV transparent window[19].

• Robust against long term drift - Transistor threshold drift is a known problem

in CMOS circuits[20] and Flash EEPROMs[19]. Much of the threshold shift is

attributed to hot electron injection. These electrons get stuck in trap states in

the oxide, thereby causing the shift. Much eﬀort is placed into reducing the

number of oxide traps to minimize this eﬀect. The electrons can then exit the

oxide and normally contribute to a very small gate current. However, in ﬂoating

29

gate circuits, the electrons are not removed and remain as excess charge which

leads to increased threshold drift. Furthermore, as process scales continue to

decrease, direct gate tunneling eﬀects will further exacerbate the problem. The

dynamically restored circuits remove these excess charges during the reset phase

of operation and provide greater immunity against long term drift.

Although the advantages that the pure ﬂoating gate circuits provide outnumber those

of the dynamic circuits, the advantages aﬀorded by the dynamic technique are of

such signiﬁcant importance that many circuit designers would choose to utilize the

technique. In practice, a full consideration of each style’s constraints with respect

to the overall system design criteria would favor one or the other of the two design

styles.

3.7 Conclusion

A set of circuits for analog circuit design that may be useful for analog computation

circuits and neural network circuits has been presented. It is hoped that it is clear

from the derivations how to obtain other power law circuits that may be necessary

and how to combine them to perform useful complex calculations. The dynamic

charge restoration technique is shown to be a useful implementation of this class

of analog circuits. Furthermore, the dynamic charge restoration technique may be

applied to other ﬂoating gate computational circuits that may otherwise require initial

ultraviolet illumination or other methods to set the initial conditions.

30

Chapter 4 VLSI Neural Network

Algorithms, Limitations, and

Implementations

4.1 VLSI Neural Network Algorithms

4.1.1 Backpropagation

Early work in neural networks was hindered by the problem that single layer networks

could only represent linearly separable functions[41]. Although it was known that mul-

tilayer networks could learn any boolean function and, hence, did not suﬀer from this

limitation, the lack of a learning rule to program multilayered networks reduced inter-

est in neural networks for some time[40]. The backpropagation algorithm[42], based

on gradient descent, provided a viable means for programming multilayer networks

and created a resurgence of interest in neural networks.

A typical sum-squared error function used for neural networks is deﬁned as follows:

E (

− →

w) =

1

2

¸

i

(T

i

−O

i

)

2

where T

i

is the target output for input pattern i from the training set, and O

i

is

the respective network output. The network output is a composite function of the

network weights and composed of the underlying neuron sigmoidal function, s(.).

In backpropagation, weight updates, ∆w

ij

∝

∂E

∂w

ij

, are calculated for each of

the weights. The functional form of the update involves the weights themselves

propagating the errors backwards to more internal nodes, which leads to the name

backpropagation. Furthermore, the functional form of the weight update rules are

composed of derivatives of the neuron sigmoidal function. Thus, for example, if

31

s(x) = tanh(ax), then the weight update rule would involve functions of the form

s

**(x) = 2a (tanh(ax)) (1 −tanh(ax)). For software implementations of the neural
**

network, the ease of computing this type of function allows an eﬀective means to

implement the backpropagation algorithm.

However, if the hardware implementation of the sigmoidal function tends to de-

viate from an ideal function such as the tanh function, it becomes necessary to com-

pute the derivative of the function actually being implemented which may not have

a simple functional form. Otherwise, unwanted errors would get backpropagated and

accumulate. Furthermore, deviations in the multiplier circuits can add oﬀsets which

would also accumulate and signiﬁcantly reduce the ability of the network to learn.

For example, in one implementation, it was discovered that oﬀsets of 1% degraded

the learning success rate of the XOR problem by 50%[43]. In typical analog VLSI

implementations, oﬀsets much larger than 1% are quite common, which would lead

to seriously degraded operation.

4.1.2 Neuron Perturbation

Because of the problems associated with analog VLSI backpropagation implementa-

tions, eﬀorts have been made to ﬁnd suitable learning algorithms which are indepen-

dent of the model of the neuron[44]. For example, in the above example the neuron

was modeled as a tanh function. In a model-free learning algorithm, the functional

form of the neuron, as well as the functional form of its derivative, does not enter

into the weight update rule. Furthermore, analog implementations of the derivative

can prove diﬃcult and implementations which avoid these diﬃculties are desirable.

Most of the algorithms developed for analog VLSI implementation utilize stochastic

techniques to approximate the gradients rather than to directly compute them.

One such implementation[45], called the Madaline Rule III (MR III), approximates

a gradient by applying a perturbation to the input of each neuron and then measuring

the diﬀerence in error, ∆E = E(with perturbation)-E(without perturbation). This

error is then used for the weight update rule.

32

∆w

ij

= −η

∆E

∆net

i

x

j

where net

i

is the weighted, summed input to the neuron i that is perturbed, and

x

j

is either a network input or a neuron output from a previous layer which feeds into

neuron i.

This weight update rule requires access to every neuron input and every neuron

output. Furthermore, multiplication/division circuitry would be required to imple-

ment the learning rule. However, no explicit derivative hardware is required, and no

assumptions are made about the neuron model.

Another variant of this method, called the virtual targets method and which is not

model-free, required the use of the neuron function derivatives in the weight update

rule[47]. The method showed good robustness against noise, but studies of the eﬀects

of inaccurate derivative approximations were not done.

4.1.3 Serial Weight Perturbation

Another implementation which is more optimal for analog implementation involves

serially perturbing the weights rather than perturbing the neurons[23]. The weight

updates are based on a ﬁnite diﬀerence approximation to the gradient of the mean

squared error with respect to a weight. The weight update rule is given as follows:

∆w

ij

=

E

w

ij

+ pert

ij

−E (w

ij

)

pert

ij

This is the forward diﬀerence approximation to the gradient. For small enough

perturbations, pert

ij

, a reasonable approximation is achieved. Smaller perturbations

provide better approximations to the gradient at the expense of very small steps in

the gradient direction which then requires many extra iterations to ﬁnd the minimum.

Thus, a natural tradeoﬀ between speed and accuracy exists.

The gradient approximation can also be improved with the central diﬀerence ap-

33

proximation:

∆w

ij

=

E

w

ij

+

pert

ij

2

−E

w

ij

−

pert

ij

2

pert

ij

However, this requires more feedforward passes of the network to obtain the error

terms. A usual implementation would use a weight update rule of the following form:

∆w

ij

= −

η

pert

E

w

ij

+ pert

ij

−E (w

ij

)

**Since pert and η are both small constants, the above weight update rule merely
**

involves changing the weights by the error diﬀerence scaled by a small constant.

It is clear that this implementation should require simpler circuitry than the neu-

ron perturbation algorithm. The multiplication/division step is replaced by scaling

with a constant. Also, there is no need to provide access to the computations of

internal nodes. Studies have also shown that this algorithm performs better than

neuron perturbation in networks with limited precision weights[46].

4.1.4 Summed Weight Neuron Perturbation

The summed weight neuron perturbation approach[24] combines elements from both

the neuron perturbation technique and the serial weight perturbation technique. Since

the serial weight perturbation technique applies perturbations to the weights one at

a time, its computational complexity was shown to be of higher order than that of

the neuron perturbation technique.

Applying a weight perturbation to each of the weights connected to a particular

neuron is equivalent to perturbing the input of that neuron as is done in neuron

perturbation. Thus, in summed weight neuron perturbation, weight perturbations

are applied for each particular neuron, the error is then calculated and weight updates

are made using the same weight update rule as for serial weight perturbation.

This allows for a reduction in the number of feedforward passes, yet still does not

require access to the outputs of internal nodes of the network. However, this speed

34

up is at the cost of increased hardware to store weight perturbations. Normally, the

weight perturbations are kept at a ﬁxed small size, but the sign of the perturbation

is random. Thus, the weight perturbation storage can be achieved with 1 bit per

weight.

4.1.5 Parallel Weight Perturbation

The parallel weight perturbation techniques[21][25][48] extend the above methods by

simultaneously applying perturbations to all weights. The perturbed error is then

measured and compared to the unperturbed error. The same weight update rule is

then used as in the serial weight perturbation method. Thus, all of the weights are

updated in parallel.

If there are W total weights in a network, a serial weight perturbation approach

gives a factor W reduction in speed over a straight gradient descent approach because

the overall gradient is calculated based on W local approximations to the gradient.

With parallel weight perturbation, on average, there is a factor W

1

2

reduction in

speed over direct gradient descent[25]. This can be better understood if the parallel

perturbative method is thought of in analogy to Brownian motion where there is

considerable movement of the network through weight space, but with a general

guiding force in the proper direction. Thus, a speedup of W

1

2

is achieved over serial

weight perturbation. This increase in speed may not always be seen in practice

because it depends upon certain conditions being met such as the perturbation size

being suﬃciently small.

It is also possible to improve the convergence properties of the network by av-

eraging over multiple perturbations for each input pattern presentation[21]. This is

feasible only if the speed of averaging over multiple perturbations is greater than the

speed of applying input patterns. This usually holds true because the input patterns

must be obtained externally; whereas, the multiple perturbations are applied locally.

Furthermore, convergence rates can be improved by using an annealing schedule

where large perturbation sizes are used initially and then reduced. In this way, large

35

weight changes are discovered ﬁrst and gradually decreased to get ﬁner and ﬁner

resolution in approaching the minimum.

4.1.6 Chain Rule Perturbation

The chain rule perturbation method is a semi-parallel method that mixes ideas from

both serial weight perturbation and neuron perturbation[26]. In a 1 hidden layer

network, the output layer weights are ﬁrst updated serially using the weight update

formula from the serial weight perturbation method. Next, a hidden layer neuron

output, n

i

, is perturbed by amount npert

i

. The eﬀect on the error is measured and a

partial gradient term,

∆E

npert

i

, is stored. This neuron output perturbation is repeated

sequentially and a partial error gradient term stored for every neuron in the hidden

layer. Then, each of the weights, w

ij

, feeding into the hidden layer neurons are per-

turbed in parallel by an amount wpert

ij

. The resultant change in hidden layer neuron

outputs, ∆n

i

, from this parallel weight perturbation is measured and then used to

calculate and store another partial gradient term,

∆n

i

wpert

ij

, for each weight/neuron pair.

Finally, the hidden layer weights are updated in parallel according to the following

rule:

∆w

ij

= −η

∆E

wpert

= −η

∆E

npert

i

∆n

i

wpert

ij

Hence, the name of the method comes from the analog of the weight update rule to

the chain rule in derivative calculus. For more hidden layers, the same hidden layer

weight update procedure is repeated.

This method allows a speedup over serial weight perturbation by a factor ap-

proaching the number of hidden layer neurons at the expense of the extra hardware

necessary to store the partial gradient terms. For example, in a network with i inputs,

j hidden nodes, and k outputs, the chain rule perturbation method requires k·j +j +i

perturbations; whereas, serial perturbation requires k · j +j · i perturbations.

36

4.1.7 Feedforward Method

The feedforward method is a variant of serial weight perturbation that incorporates

a direct search technique[49]. First, for a weight, w

ij

, the direction of search is found

by applying a fraction of the perturbation. For example if 0.1pert is applied and

the error goes down, then the direction of search is positive; otherwise it is negative.

Next, in the forward phase, perturbations of size pert continue to be applied in the

direction of search until the error does not decrease or some predetermined maximum

number of forward perturbations is reached. Then, in the backward phase, a step of

size

pert

2

in the reverse direction is taken. Successive backwards steps are taken with

each step decreasing by another factor of 2 until the error again goes up or another

predetermined maximum number of backward perturbations is reached. These steps

are repeated for every weight.

This method is eﬀectively a sophisticated annealing schedule for the serial weight

perturbation method.

Many other variations and combinations of these stochastic learning rules are pos-

sible. Most of the stochastic algorithms and implementations concentrate on iterative

approaches to weight updates as discussed above, but continuous-time implementa-

tions are also possible[50].

4.2 VLSI Neural Network Limitations

4.2.1 Number of Learnable Functions with Limited Precision

Weights

Clearly in a neural network with a small, ﬁxed number of bits per weight, the ability

of the network to classify multiple functions is limited. Assume that each weight is

comprised of b bits of information and that there are W total weights. The total

number of bits in the network is given as B = bW. Since each bit can take on only 1

of 2 values, the total number of diﬀerent functions that the network can implement

is no more than 2

B

. In computer simulations of neural networks where ﬂoating point

37

precision is used and b is on the order of 32 or 64, this is usually not a limitation to

learning. However, if b is a small number like 5, then the ability of the network to

learn a speciﬁc function may be compromised. Furthermore, 2

B

is merely an upper

bound. There are many diﬀerent settings of the weights that will lead to the same

function. For example, the hidden neuron outputs can be permuted since the ﬁnal

output of the network does not depend upon the order of the hidden layer neurons.

Also, the signs of the outputs of the hidden layer neurons can also be ﬂipped by

adjusting the weights coming into that neuron and by also ﬂipping the sign of the

weight on the output of the neuron to ensure the same output. In one result, the

total number of boolean functions which can be implemented is given approximately

by

2

B

N!2

N

for a fully connected network with N units in each layer[51].

4.2.2 Trainability with Limited Precision Weights

Several studies have been done on the eﬀect of the number of bits per weight con-

straining the ability of a neural network to properly train[52][53][54]. Most of these

are speciﬁc to a particular learning algorithm. In one study, the eﬀects of the number

of bits of precision on backpropagation were investigated[52]. The weight precision

for a feedforward pass of the network, the weight precision for the backpropagation

weight update calculation, and the output neuron quantization were all tested for a

range of bits. The eﬀect of the number of bits on the ability of the network to prop-

erly converge was determined. The empirical results showed that a weight precision

of 6 bits was suﬃcient for feedforward operation. Also, neuron output quantization

of 6 bits was suﬃcient. This implies that standard analog VLSI neurons should be

suﬃcient for a VLSI neural network implementation. For the backpropagation weight

update calculation, at least 12 bits were required to obtain reasonable convergence.

This shows that a perturbative algorithm that relies only on feedforward operation

can be eﬀective in training with as few as half the number of bits required for a back-

propagation implementation. Due to the increased hardware complexity necessary

to achieve the extra number of bits, the perturbative approaches will be desirable in

38

analog VLSI implementations.

4.2.3 Analog Weight Storage

Due to the inherent size requirements to store and update digital weights with A/D

and D/A converters, some analog VLSI implementations concentrate on reducing

the space necessary for weight storage by using analog weights. Usually, this involves

storing weights on a capacitor. Since the capacitors slowly discharge, it is necessary to

implement some sort of scheme to continuously refresh the weight. One such scheme,

which showed long term 8 bit precision, involves quantization and refresh of the

weights by use of A/D/A converters which can be shared by multiple weights[60][61].

Other schemes involve periodic refresh by use of quantized levels deﬁned by input

ramp functions or by periodic repetition of input training patterns[59].

Since signiﬁcant space and circuit complexity overhead exists for capacitive stor-

age of weights, attempts have also been made at direct implementation of nonvolatile

analog memory. In one such implementation, a dense array of synapses is achieved

with only two transistors required per synapse[64]. The weights are stored perma-

nently on a ﬂoating gate transistor which can be updated under UV illumination. A

second drive transistor is used as a switch to sample the input drive voltage onto a

drive electrode in order to adjust the ﬂoating gate voltage, and, hence, weight value.

The weight multiplication with an input voltage is achieved by biasing the ﬂoat-

ing gate transistor into the triode region which produces an output current which

is roughly proportionate to the product of its ﬂoating gate voltage, representing the

weight, and its drain to source voltage, representing the input. The use of UV illu-

mination also allowed a simple weight decay term to be introduced into the learning

rule if necessary. Another approach which also implements a dense array of synapses

based on ﬂoating gates utilized tunneling and injection of electrons onto the ﬂoating

gate rather than UV illumination[28][29]. Using this technique, only one ﬂoating gate

transistor per synapse was required. A learning rule was developed which balances

the eﬀects of the tunneling and injection. In another approach, a 3 transistor analog

39

memory cell was shown to have up to 14 bits of resolution[65]. Some of the weight

update speeds obtainable using the tunneling and injection techniques may be limited

because rapid writing of the ﬂoating gates requires large oxide currents which may

damage the oxide. Nevertheless, these techniques can be combined with dynamic

refresh techniques where weights and weight updates are quickly stored and learned

on capacitors, but eventually written into nonvolatile analog memories[66].

4.3 VLSI Neural Network Implementations

Many neural network implementations in VLSI can be found in the literature. This in-

cludes purely digital approaches, purely analog approaches and hybrid analog-digital

approaches. Although the digital implementations are more straightforward to build

using standard digital design techniques, they often scale poorly with large networks

that require many bits of precision to implement the backpropagation algorithms[57][58].

However, some size reductions can be realized by reducing the bit precision and us-

ing the stochastic learning techniques available[55][56]. Other techniques use time

multiplexed multipliers to reduce the number of multipliers necessary.

An early attempt at implementation of an analog neural network involved making

separate VLSI chip modules for the neurons, synapses and routing switches[62]. The

modules could all be interconnected and controlled by a computer. The weights

were digital and were controlled by a serial weight bus from the computer. Since

the components were modular, it could be scaled to accommodate larger networks.

However, each of the individual circuit blocks was quite large and hence was not well

suited for dense integration on a single neural network chip. For example, the circuits

for the neuron blocks consisted of three multistage operational ampliﬁers and three

100k resistors. The learning algorithms were based on output information from the

neurons and no circuitry for local weight updates was provided.

40

4.3.1 Contrastive Neural Network Implementations

Some of the earliest VLSI implementations of neural networks were not based on

the stochastic weight update algorithms, but on contrastive weight updates[67]. The

basic idea behind a contrastive technique is to run the network in two phases. In the

ﬁrst phase, a teacher signal is sent, and the outputs of the network are clamped. In

the second phase, the network outputs are unclamped and free to change. The weight

updates are based on a contrast function that measures the diﬀerence between the

clamped, teacher phase, and the unclamped, free phase.

One of the ﬁrst fully integrated VLSI neural networks used one of these contrastive

techniques and showed rapid training on the XOR function[33]. The chip incorporated

a local learning rule, feedback connections and stochastic elements. The learning rule

was based on a Boltzmann machine algorithm, which uses bidirectional, asymmetric

weights[63]. Brieﬂy, the algorithm works by checking correlations between the input

and output neuron of a synapse. Correlations are checked between a teacher phase,

where the output neurons are clamped to the correct output, and a student phase,

where the outputs are unclamped. If the teacher and student phases share the same

correlation, then no change is made to the weight. If the neurons are correlated in

the teacher phase, but not the student phase, then the weight is incremented. If

it is reversed, then the weight is decremented. The weights were stored as 5 bit

digital words. On each pattern presentation, the network weights are perturbed

with noise and the correlations counted. Although the system actually needs noise

in order to learn, the weight update rule is not based on the previously mentioned

stochastic weight update rules, but is an example of contrastive learning that requires

randomness in order to learn. Analog noise sources were used to produce the random

perturbations. Unfortunately, the gains required to amplify thermal noise caused the

system to be unstable, and highly correlated oscillations were present between all of

the noise sources. Thus, it was necessary to intermittently apply the noise sources in

order to facilitate learning. The network showed good success with a 2:2:2:1 network

learning XOR. The learning rate was shown to be over 100,000 times faster than that

41

achieved with simulations on a digital computer.

Another implementation showed a contrastive version of backpropagation[43]. The

basic idea was to also use both a clamped phase and a free phase. The weight updates

are calculated in both phases using the standard backpropagation weight update rules.

The actual weight update was taken to be the diﬀerence between the clamped weight

change and the free weight change. The use of the contrastive technique allowed a

form of backpropagation despite circuit oﬀsets. However, this implementation was

not model free, and a derivative generator was also necessary in order to compute

the weight updates. The weights were stored as analog voltages on capacitors with

the eventual goal of using nonvolatile analog weights. Due to the size of the circuitry,

only one layer of the network was integrated on a single chip. The multilayer network

was obtained by cascading several chips together. Successful learning of the XOR

function was shown.

4.3.2 Perturbative Neural Network Implementations

A low power chip implementing the summed weight neuron perturbation algorithm

has been demonstrated[22]. The chip was designed for implantable applications which

necessitated the need for extremely low power and small area. The weights were stored

digitally and a serial weight bus was used to set the weights and weight updates. The

chip was trained in loop with a computer applying the input patterns, controlling

the serial weight bus and reading the output values for the error computation. To

reduce power, subthreshold synapses and neurons were used. The synapses consisted

of transconductance ampliﬁers with the weights encoded as a digital word which

set the bias current. Neurons consisted of current to voltage converters using a diode

connected transistor. They also included a few extra transistors to deal with common

mode cancelation. A 10:6:3 network was trained to distinguish two cardiac arrythmia

patterns. The chips showed a success rate with over 95% of patient data. With a 3V

supply, only 200 nW was consumed. This is a good example of a situation where an

analog VLSI neural network is very suitable as compared to a digital solution.

42

Another perturbative implementation was used for training a recurrent neural

network on a continuous time trajectory[39]. A parallel perturbative method was

used with analog volatile weights and local weight refresh circuitry. Two counter-

propagating linear feedback shift registers, which were combined with a network of

XORs, were used to digitally generate the random perturbations. The weight updates

were computed locally. The global functions of error accumulation and comparison

were performed oﬀ chip. Some of the weight refresh circuitry was also implemented

oﬀ chip. The network was successfully trained to generate two quadrature phase

sinusoidal outputs.

The chain rule perturbation algorithm has also been successfully implemented on

chip[66]. The analog weights were stored dynamically and provisions were made on

the chip for nonvolatile analog storage using tunneling and injection, but its use was

not demonstrated. The weight updates were computed locally. Many of the global

functions such as error computation were also performed on the chip. Only a few

control signals from a computer were necessary for its operation. The size of the per-

turbations was externally settable. However, all perturbations seemed to be of the

same sign, thus not requiring any specialized pseudorandom number generation. The

network was shown to successfully train on XOR and other functions. However, the

error function during learning showed some very peculiar behavior. Rather than grad-

ually decreasing to a small value, the error actually increased steadily, then entered

large oscillations between the high error and near zero error. After the oscillations,

the error was close to zero and stayed there. The authors attribute this strange and

erratic behavior to the fact that a large learning rate and a large perturbation size

were used.

43

Chapter 5 VLSI Neural Network with

Analog Multipliers and a Serial Digital

Weight Bus

Training an analog neural network directly on a VLSI chip provides additional beneﬁts

over using a computer for the initial training and then downloading the weights. The

analog hardware is prone to have oﬀsets and device mismatches. By training with

the chip in the loop, the neural network will also learn these oﬀsets and adjust the

weights appropriately to account for them. A VLSI neural network can be applied in

many situations requiring fast, low power operation such as handwriting recognition

for PDAs or pattern detection for implantable medical devices[22].

There are several issues that must be addressed to implement an analog VLSI

neural network chip. First, an appropriate algorithm suitable for VLSI implementa-

tion must be found. Traditional error backpropagation approaches for neural network

training require too many bits of ﬂoating point precision to implement eﬃciently in

an analog VLSI chip. Techniques that are more suitable involve stochastic weight

perturbation[21],[23],[24],[25],[26],[27], where a weight is perturbed in a random di-

rection, its eﬀect on the error is determined and the perturbation is kept if the error

was reduced; otherwise, the old weight is restored. In this approach, the network

observes the gradient rather than actually computing it.

Serial weight perturbation[23] involves perturbing each weight sequentially. This

requires a number of iterations that is directly proportional to the number of weights.

A signiﬁcant speed-up can be obtained if all weights are perturbed randomly in paral-

lel and then measuring the eﬀect on the error and keeping them all if the error reduces.

Both the parallel and serial methods can potentially beneﬁt from the use of anneal-

ing the perturbation. Initially, large perturbations are applied to move the weights

44

quickly towards a minimum. Then, the perturbation sizes are occasionally decreased

to achieve ﬁner selection of the weights and a smaller error. In general, however,

optimized gradient descent techniques converge more rapidly than the perturbative

techniques.

Next, the issue of how to appropriately store the weights on-chip in a non-volatile

manner must be addressed. If the weights are simply stored as charge on a capacitor,

they will ultimately decay due to parasitic conductance paths. One method would

be to use an analog memory cell [28],[29]. This would allow directly storing the

analog voltage value. However, this technique requires using large voltages to obtain

tunneling and/or injection through the gate oxide and is still being investigated.

Another approach would be to use traditional digital storage with EEPROMs. This

would then require having A/D/A (one A/D and one D/A) converters for the weights.

A single A/D/A converter would only allow a serial weight perturbation scheme that

would be slow. A parallel scheme, which would perturb all weights at once, would

require one A/D/A per weight. This would be faster, but would require more area.

One alternative would remove the A/D requirement by replacing it with a digital

counter to adjust the weight values. This would then require one digital counter and

one D/A per weight.

5.1 Synapse

A small synapse with one D/A per weight can be achieved by ﬁrst making a binary

weighted current source (Figure 5.1) and then feeding the binary weighted currents

into diode connected transistors to encode them as voltages. These voltages are then

fed to transistors on the synapse to convert them back to currents. Thus, many D/A

converters are achieved with only one binary weighted array of transistors. It is clear

that the linearity of the D/A will be poor because of matching errors between the

current source array and synapses which may be located on opposite sides of the chip.

This is not a concern because the network will be able to learn around these oﬀsets.

The synapse[26],[22] is shown in ﬁgure 5.2. The synapse performs the weighting

45

Vdd

1 1

i0 i1 i2 i3 i4

2 4 8 16

Iin

Figure 5.1: Binary weighted current source circuit

b5 b5

b0 b1 b4 b3 b2

i4 i3 i2 i1 i0

Vin- Vin+

b5

Iout+ Iout-

Figure 5.2: Synapse circuit

46

of the inputs by multiplying the input voltages by a weight stored in a digital word

denoted by b0 through b5. The sign bit, b5, changes the direction of current to

achieve the appropriate sign.

In the subthreshold region of operation, the transistor equation is given by[30]

I

d

= I

d0

e

κVgs/Ut

and the output of the synapse is given by[30],[22]

∆I

out

= I

out+

−I

out−

=

+I

0

W tanh

κ(V

in+

−V

in−

)

2Ut

b5 = 1

−I

0

W tanh

κ(V

in+

−V

in−

)

2Ut

b5 = 0

where W is the weight of the synapse encoded by the digital word and I

0

is the

least signiﬁcant bit (LSB) current.

Thus, in the subthreshold linear region, the output is approximately given by

∆I

out

≈ g

m

∆V

in

=

κI

0

2U

t

W∆V

in

In the above threshold regime, the transistor equation in saturation is approxi-

mately given by

I

D

≈ K(V

gs

−V

t

)

2

The synapse output is no longer described by a simple tanh function, but is

nevertheless still sigmoidal with a wider “linear” range.

In the above threshold linear region, the output is approximately given by

∆I

out

≈ g

m

∆V

in

= 2

KI

0

√

W∆V

in

47

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

x 10

−6

31

16

8

4

0

−4

−8

−16

−31

∆ V

in

(Volts)

∆

I

o

u

t

(

A

m

p

s

)

Synapse Output Current vs. Voltage Input for Various Weights

Figure 5.3: Synapse diﬀerential output current as a function of diﬀerential input

voltage for various digital weight settings

It is clear that above threshold, the synapse is not doing a pure weighting of the

input voltage. However, since the weights are learned on chip, they will be adjusted

accordingly to the necessary value. Furthermore, it is possible that some synapses will

operate below threshold while others above, depending on the choice of LSB current.

Again, on-chip learning will be able to set the weights to account for these diﬀerent

modes of operation.

Figure 5.3 shows the diﬀerential output current of the synapse as a function of

diﬀerential input voltage for various digital weight settings. The input current of the

binary weighted current source was set to 100 nA. The output currents range from

the subthreshold region for the smaller weights to the above threshold region for the

48

large weights. All of the curves show their sigmoidal characteristics. Furthermore, it is

clear the width of the linear region increases as the current moves from subthreshold

to above threshold. For the smaller weights, W = 4, the linear region spans only

approximately 0.2V - 0.4V. For the largest weights, W = 31, the linear range has

expanded to roughly 0.6V - 0.8V. As was discussed above, when the current range

moves above threshold, the synapse does not perform a pure linear weighting. The

largest synapse output current is not 3.1µA as would be expected from a linear

weighting of 31 × 100nA, but a smaller number. Notice that the zero crossing of

∆I

out

occurs slightly positive of ∆V

in

= 0. This is a circuit oﬀset that is primarily

due to slight W/L diﬀerences of the diﬀerential input pair of the synapse, and it is

caused by minor fabrication variations.

5.2 Neuron

The synapse circuit outputs a diﬀerential current that will be summed in the neuron

circuit shown in ﬁgure 5.4. The neuron circuit performs the summation from all of the

input synapses. The neuron circuit then converts the currents back into a diﬀerential

voltage feeding into the next layer of synapses. Since the outputs of the synapse will

all have a common mode component, it is important for the neuron to have common

mode cancelation[22]. Since one side of the diﬀerential current inputs may have a

larger share of the common mode current, it is important to distribute this common

mode to keep both diﬀerential currents within a reasonable operating range.

I

in+

= I

in+

d

+

I

in+

d

+I

in−

d

2

= I

in+

d

+I

cm

d

I

in−

= I

in−

d

+

I

in+

d

+I

in−

d

2

= I

in−

d

+I

cm

d

∆I = I

in+

−I

in−

= I

in+

d

−I

in−

d

49

Vdd

Vdd Vdd

Voffset Voffset

Iin+ Iin- Vout- Vout+

Vcm Vcm

Vgain Vgain

2:1:1

Iin+

d

Iin-

d

Icm

d

Icm

d

V

1+

V

1-

Figure 5.4: Neuron circuit

50

I

cm

=

I

in+

+I

in−

2

=

I

in+

d

+I

in−

d

+ 2I

cm

d

2

= 2I

cm

d

⇒I

in+

d

= I

in+

−

I

cm

2

=

∆I

2

+

I

cm

2

⇒I

in−

d

= I

in−

−

I

cm

2

= −

∆I

2

+

I

cm

2

If the ∆I is of equal size or larger than I

cm

, the transistor with I

in−

d

may begin

to cutoﬀ and the above equations would not exactly hold; however, the current cutoﬀ

is graceful and should not normally aﬀect performance. With the common mode

signal properly equalized, the diﬀerential currents are then mirrored into the current

to voltage transformation stage. This stage eﬀectively takes the diﬀerential input

currents and uses a transistor in the triode region to provide a diﬀerential output.

This stage will usually be operating above threshold, because the V

offset

and V

cm

controls are used to ensure that the output voltages are approximately mid-rail. This

is done by simply adding additional current to the diode connected transistor stack.

Having the outputs mid-rail is important for proper biasing of the next stage of

synapses. The above threshold transistor equation in the triode region is given by

I

d

= 2K(V

gs

− V

t

−

V

ds

2

)V

ds

≈ 2K(V

gs−

V

t

)V

ds

for small enough V

ds

. If K

1

denotes

the prefactor with W/L of the cascode transistor and K

2

denotes the same for the

transistor with gate V

out

, the voltage output of the neuron will then be given by

I

in

= K

1

(V

gain

−V

t

−V

1

)

2

≈ K

2

(2(V

out

−V

t

)V

1

)

V

1

= V

gain

−V

t

−

I

in

K

1

51

I

in

= 2K

2

(V

out

−V

t

)

V

gain

−V

t

−

I

in

K

1

V

out

=

I

in

2K

2

(V

gain

−V

t

) −2

K

2

2

K

1

I

in

+V

t

if

W

1

L

1

=

W

2

L

2

then K

1

= K

2

= K,

⇒V

out

=

I

in

2K(V

gain

−V

t

) −2

√

KI

in

+V

t

for small I

in,

R ≈

1

2K(V

gain

−V

t

)

Thus, it is clear that V

gain

can be used to adjust the eﬀective gain of the stage.

Figure 5.5 shows how the neuron diﬀerential output voltage, ∆V

out

, varies as a

function of diﬀerential input current for several values of V

gain

. The neuron shows

fairly linear performance with a sharp bend on either side of the linear region. This

sharp bend occurs when one of the two linearized, diode connected transistors with

gate attached to V

out

turns oﬀ.

Figure 5.6 displays only the positive output, V

out+

of the neuron. The diode

connected transistor, with gate attached to V

out+

turns oﬀ where the output goes ﬂat

on the left side of the curves. This corresponds to the left bend point in ﬁgure 5.5.

Furthermore, the baseline output voltage corresponds roughly to V

offset

. However, as

V

offset

increases in value, its ability to increase the baseline voltage is reduced because

of the cascode transistor on its drain. At some point, especially for small values of

V

gain

, the V

cm

transistor becomes necessary to provide additional oﬀset. Overall, the

neuron shows very good linear current to voltage conversion with separate gain and

52

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

x 10

−6

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

∆ I

in

(Amps)

∆

V

o

u

t

(

V

o

l

t

s

)

Neuron Output Voltage vs. Current Input for Various V

gain

1.5 V

1.65 V

2 V

Figure 5.5: Neuron diﬀerential output voltage, ∆V

out

, as a function of ∆I

in

, for various

values of V

gain

53

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

x 10

−6

1.8

2

2.2

2.4

2.6

2.8

3

3.2

3.4

∆ I

in

(Amps)

V

o

u

t

+

(

V

o

l

t

s

)

Neuron Positive Output Voltage vs. Current Input for Various V

offset

2 V

2.2 V

2.4 V

Figure 5.6: Neuron positive output, V

out+

, as a function of ∆I

in

, for various values of

V

offset

54

oﬀset controls.

5.3 Serial Weight Bus

A serial weight bus is used to apply weights to the synapses. The weight bus merely

consists of a long shift register to cover all of the possible weight and threshold bits.

Figure 5.7 shows the circuit schematic of the shift register used to implement the

weight bus. The input to the serial weight bus comes from a host computer which

implements the learning algorithm. The last output bit of the shift register, Q5,

feeds into the data input, D, of the next shift register to form a long compound shift

register.

The shift register storage cell is accomplished by using two staticizer or jamb

latches[34] in series. The latch consists of two interlocked inverters. The weak in-

verter, labeled with a W in the ﬁgure, is made suﬃciently weak such that a signal

coming from the main inverter and passing through the pass gate is suﬃciently strong

to overpower the weak inverter and set the state. Once the pass gate is turned oﬀ,

the weak inverter keeps the current bit state. The shift register is controlled by a dual

phase nonoverlapping clock. During the read phase of operation, when φ

1

is high, the

data from the previous register is read into the ﬁrst latch. Then, during the write

phase of operation, when φ

2

is high, the bit is written to the second latch for storage.

Figure 5.8 shows the nonoverlapping clock generator[35]. A single CLK input

comes from an oﬀ chip clock. The clock generator ensures that the two clock phases

do not overlap in order that no race conditions exist between reading and writing.

The delay inverters are made weak by making the transistors long. They set the

appropriate delay such that after one clock transitions from high to low, the next

one does not transition from low to high until after the small delay. Since the clock

lines go throughout the chip and drive long shift registers, it is necessary to make

the buﬀer transistors with large W/L ratios to drive the large capacitances on the

output.

55

W

j

2

j

2

Q

0

W

j

1

W

j

2

j

2

Q

1

W

D

j

1

j

1

W

j

2

j

2

Q

5

W

j

1

j

1

j

1

.

.

.

Figure 5.7: Serial weight bus shift register

56

C

L

K

j

1

j

1

j

2

j

2

D

e

l

a

y

i

n

v

e

r

t

e

r

s

B

u

f

f

e

r

I

n

v

e

r

t

e

r

s

Figure 5.8: Nonoverlapping clock generator

57

5.4 Feedforward Network

Using the synapse and neuron circuit building blocks it is possible to construct a

multilayer feed-forward neural network. Figure 5.9 shows a ﬁgure of a 2 input, 2

hidden unit, 1 output feedforward network for training of XOR.

Note that the nonlinear squashing function is actually performed in the next layer

of synapse circuits rather than in the neuron as in a traditional neural network.

However, this is equivalent as long as the inputs to the ﬁrst layer are kept within the

linear range of the synapses. For digital functions, the inputs need not be constrained

as the synapses will pass roughly the same current regardless of whether the digital

inputs are at the ﬂat part of the synapse curve near the linear region or all the way at

the end of the ﬂat part of the curve. Furthermore, for non-diﬀerential digital signals,

it is possible to simply tie the negative input to mid-rail and apply the standard digital

signal to the positive synapse input. The biases, or thresholds, for each neuron are

simply implemented as synapses tied to ﬁxed bias voltages. The biases are learned in

the same way as the weights.

Also, depending on the type of network outputs desired, additional circuitry may

be needed for the ﬁnal squashing function. For example, if a roughly linear output

is desired, the diﬀerential output can be taken directly from the neuron outputs.

In ﬁgure 5.9, a diﬀerential to single ended converter is shown on the output. The

gain of this converter determines the size of the linear region for the ﬁnal output.

Normally, during training, a somewhat linear output with low gain is desired to have

a reasonable slope to learn the function on. However, after training, it is possible to

take the output after a dual inverter digital buﬀer to get a strong standard digital

signal to send oﬀ-chip or to other sections of the chip.

Figure 5.10 shows a schematic of a diﬀerential to single ended converter and a dig-

ital buﬀer. The diﬀerential to single ended converter is based on the same transcon-

ductance ampliﬁer design used for the synapse. The V

b

knob is used to set the bias

current of the ampliﬁer. This bias current then controls the gain of the converter.

The gain is also controlled by sizing of the transistors.

58

b

5

b

4

b

3

b

2

b

1

b

0

V

in

+

V

in

−

i0 i1 i2 i3 i4 D

1

D D

0

V

in

1

+

V

in

1

−

C

L

K

V

b

ia

s

+

V

b

ia

s

−

b

5

b

4

b

3

b

2

b

1

b

0

V

in

+

V

in

−

i0 i1 i2 i3 i4 D

2

D D

1

C

L

K

V

in

2

+

V

in

2

−

b

5

b

4

b

3

b

2

b

1

b

0

V

in

+

V

in

−

i0 i1 i2 i3 i4 D

3

D D

2

Q

5

Q

4

Q

3

Q

2

Q

1

Q

0

D

C

L

K

S

h

if

t

R

e

g

.

Q

5

Q

4

Q

3

Q

2

Q

1

Q

0

D

C

L

K

S

h

if

t

R

e

g

.

Q

5

Q

4

Q

3

Q

2

Q

1

Q

0

D

C

L

K

S

h

if

t

R

e

g

.

b

5

b

4

b

3

b

2

b

1

b

0

i4 i3 i2 i1 i0

Io

u

t

+

Io

u

t

−

V

in

+

V

in

−

S

y

n

a

p

s

e

b

5

b

4

b

3

b

2

b

1

b

0

i4 i3 i2 i1 i0

Io

u

t

+

Io

u

t

−

V

in

+

V

in

−

S

y

n

a

p

s

e

b

5

b

4

b

3

b

2

b

1

b

0

i4 i3 i2 i1 i0

Io

u

t

+

Io

u

t

−

V

in

+

V

in

−

S

y

n

a

p

s

e

Q

5

Q

4

Q

3

Q

2

Q

1

Q

0

D

C

L

K

S

h

if

t

R

e

g

.

Q

5

Q

4

Q

3

Q

2

Q

1

Q

0

D

C

L

K

S

h

if

t

R

e

g

.

Q

5

Q

4

Q

3

Q

2

Q

1

Q

0

D

C

L

K

S

h

if

t

R

e

g

.

b

5

b

4

b

3

b

2

b

1

b

0

i4 i3 i2 i1 i0

Io

u

t

+

Io

u

t

−

V

in

+

V

in

−

S

y

n

a

p

s

e

b

5

b

4

b

3

b

2

b

1

b

0

i4 i3 i2 i1 i0

Io

u

t

+

Io

u

t

−

V

in

+

V

in

−

S

y

n

a

p

s

e

b

5

b

4

b

3

b

2

b

1

b

0

i4 i3 i2 i1 i0

Io

u

t

+

Io

u

t

−

V

in

+

V

in

−

S

y

n

a

p

s

e

b

5

b

4

b

3

b

2

b

1

b

0

V

in

+

V

in

−

i0 i1 i2 i3 i4 D

4

D D

3

V

in

1

+

V

in

1

−

C

L

K

V

b

ia

s

+

V

b

ia

s

−

b

5

b

4

b

3

b

2

b

1

b

0

V

in

+

V

in

−

i0 i1 i2 i3 i4 D

5

D D

4

C

L

K

V

in

2

+

V

in

2

−

b

5

b

4

b

3

b

2

b

1

b

0

V

in

+

V

in

−

i0 i1 i2 i3 i4 D

6

D D

5

Q

5

Q

4

Q

3

Q

2

Q

1

Q

0

D

C

L

K

S

h

if

t

R

e

g

.

Q

5

Q

4

Q

3

Q

2

Q

1

Q

0

D

C

L

K

S

h

if

t

R

e

g

.

Q

5

Q

4

Q

3

Q

2

Q

1

Q

0

D

C

L

K

S

h

if

t

R

e

g

.

b

5

b

4

b

3

b

2

b

1

b

0

i4 i3 i2 i1 i0

Io

u

t

+

Io

u

t

−

V

in

+

V

in

−

S

y

n

a

p

s

e

b

5

b

4

b

3

b

2

b

1

b

0

i4 i3 i2 i1 i0

Io

u

t

+

Io

u

t

−

V

in

+

V

in

−

S

y

n

a

p

s

e

b

5

b

4

b

3

b

2

b

1

b

0

i4 i3 i2 i1 i0

Io

u

t

+

Io

u

t

−

V

in

+

V

in

−

S

y

n

a

p

s

e

b

5

b

4

b

3

b

2

b

1

b

0

V

in

+

V

in

−

i0 i1 i2 i3 i4 D

7

D D

6

C

L

K

V

b

ia

s

+

V

b

ia

s

−

b

5

b

4

b

3

b

2

b

1

b

0

V

in

+

V

in

−

i0 i1 i2 i3 i4 D

8

D D

7

C

L

K

b

5

b

4

b

3

b

2

b

1

b

0

V

in

+

V

in

−

i0 i1 i2 i3 i4

D D

8

V

c

m

V

g

a

in

V

o

f

f

s

e

t

V

o

f

f

s

e

t

V

g

a

in

V

c

m

V

o

f

f

s

e

t

V

g

a

in

V

c

m

V

c

m

V

g

a

in

V

o

f

f

s

e

t

V

o

u

t

+

V

o

u

t

−

Iin

+

Iin

−

N

e

u

r

o

n

V

c

m

V

g

a

in

V

o

f

f

s

e

t

V

o

u

t

+

V

o

u

t

−

Iin

+

Iin

−

N

e

u

r

o

n

V

c

m

V

g

a

in

V

o

f

f

s

e

t

V

o

u

t

+

V

o

u

t

−

Iin

+

Iin

−

N

e

u

r

o

n

T

r

a

n

s

.

A

m

p

V

+

V

− D

if

f

e

r

e

n

t

ia

l t

o

S

in

g

le

e

n

d

e

d

C

o

n

v

e

r

t

e

r

D

ig

it

a

l B

u

f

f

e

r

V

o

u

t

D

ig

O

u

t

Figure 5.9: 2 input, 2 hidden unit, 1 output neural network

59

Vb

Digital

Vin− Vin+

Out

Differential to Single Ended

Digital Buffer

Out

Converter

Figure 5.10: Diﬀerential to single ended converter and digital buﬀer

60

Initialize Weights;

Get Error;

while(Error > Error Goal);

Perturb Weights;

Get New Error;

if (New Error < Error),

Weights = New Weights;

Error = New Error;

else

Restore Old Weights;

end

end

Figure 5.11: Parallel Perturbative algorithm

5.5 Training Algorithm

The neural network is trained by using a parallel perturbative weight update rule[21].

The perturbative technique requires generating random weight increments to adjust

the weights during each iteration. These random perturbations are then applied to

all of the weights in parallel. In batch mode, all input training patterns are applied

and the error is accumulated. This error is then checked to see if it was higher

or lower than the unperturbed iteration. If the error is lower, the perturbations are

kept, otherwise they are discarded. This process repeats until a suﬃciently low error is

achieved. An outline of the algorithm is given in ﬁgure 5.11. Since the weight updates

are calculated oﬄine, other suitable algorithms may also be used. For example, it

is possible to apply an annealing schedule wherein large perturbations are initially

applied and gradually reduced as the network settles.

5.6 Test Results

A chip implementing the above circuits was fabricated in a 1.2µm CMOS process. All

synapse and neuron transistors were 3.6µm/3.6µm to keep the layout small. The unit

size current source transistors were also 3.6µm/3.6µm. An LSB current of 100nA was

61

chosen for the current source. The above neural network circuits were trained with

some simple digital functions such as 2 input AND and 2 input XOR. The results of

some training runs are shown in ﬁgures 5.12-5.13. As can be seen from the ﬁgures, the

network weights slowly converge to a correct solution. Since the training was done on

digital functions, a diﬀerential to single ended converter was placed on the output of

the ﬁnal neuron. This was simply a 5 transistor transconductance ampliﬁer. The error

voltages were calculated as a total sum voltage error over all input patterns at the

output of the transconductance ampliﬁer. Since V

dd

was 5V, the output would only

easily move to within about 0.5V from V

dd

because the transconductance ampliﬁer

had low gain. Thus, when the error gets to around 2V, it means that all of the outputs

are within about 0.5V from their respective rail and functionally correct. A double

inverter buﬀer can be placed at the ﬁnal output to obtain good digital signals. At the

beginning of each of the training runs, the error voltage starts around or over 10V

indicating that at least 2 of the input patterns give an incorrect output.

Figure 5.12 shows the results from a 2 input, 1 output network learning an AND

function. This network has only 2 synapses and 1 bias for a total of 3 weights. The

weight values can go from -31 to +31 because of the 6 bit D/A converters used on

the synapses.

Figure 5.13 shows the results of training a 2 input, 2 hidden unit, 1 output network

with the XOR function. The weights are initialized as small random numbers. The

weights slowly diverge and the error monotonically decreases until the function is

learned. As with gradient techniques, occasional training runs resulted in the network

getting stuck in a local minimum and the error would not go all the way down.

An ideal neural network with weights appropriately chosen to implement the XOR

function is shown in ﬁgure 5.14. The inputs for the network and neuron outputs are

chosen to be (-1,+1), as opposed to (0,1). This choice is made because the actual

synapses which implement the squashing function give maximum outputs of +/- I

out

.

The neurons are assumed to be hard thresholds. The function computed by each of

the neurons is given by

62

0 10 20 30 40 50 60 70 80

−20

−10

0

10

20

30

W

e

i

g

h

t

Weight and Error vs. Iteration for training AND network

0 10 20 30 40 50 60 70 80

0

2

4

6

8

10

12

14

iteration

E

r

r

o

r

V

o

l

t

a

g

e

Figure 5.12: Training of a 2:1 network with AND function

63

0 50 100 150 200 250 300 350 400 450 500

−40

−20

0

20

40

W

e

i

g

h

t

Weight and Error vs. Iteration for training XOR network

0 50 100 150 200 250 300 350 400 450 500

2

4

6

8

10

12

iteration

E

r

r

o

r

V

o

l

t

a

g

e

Figure 5.13: Training of a 2:2:1 network with XOR function starting with small

random weights

64

t=20 t=−20

t=−20

10 −20

20 20

20 20

A B

C

Output

X X

1 2

Figure 5.14: Network with “ideal” weights chosen for implementing XOR

65

X

1

X

2

A B C

-1 -1 −60 ⇒−1 −60 ⇒−1 −10 ⇒−1

-1 +1 +20 ⇒+1 −20 ⇒−1 +10 ⇒+1

+1 -1 +20 ⇒+1 −20 ⇒−1 +10 ⇒+1

+1 +1 +60 ⇒+1 +20 ⇒+1 −30 ⇒−1

Table 5.1: Computations for network with “ideal” weights chosen for implementing

XOR

Out =

+1, if ((

¸

i

W

i

X

i

) +t) ≥ 0

−1, if ((

¸

i

W

i

X

i

) +t) < 0

Table 5.1 shows the inputs, intermediate computations, and outputs of the ideal-

ized network. It is clear that the XOR function is properly implemented. The weights

were chosen to be within the range of possible weight values, -31 to +31, of the actual

network. This ideal network would perform perfectly well with smaller weights. For

example, all of the input weights could be set to 1 as opposed to 20, and the A and

B thresholds would then be set to 1 and -1 respectively, without any change in the

network function. However, weights of large absolute value within the possible range

were chosen (20 as opposed to 1), because small weights would be within the linear

region of the synapses. Also, small weights such as +/- 1 might tend to get ﬂipped

due to circuit oﬀsets. A SPICE simulation was done on the actual circuit using these

ideal weights and the outputs were seen to be the correct XOR outputs.

Figure 5.15 shows the 2:2:1 network trained with XOR, but with the initial weights

chosen as the mathematically correct weights for the ideal synapses and neurons.

Although the ideal weights should, both in theory and based on simulations, start oﬀ

with correct outputs, the oﬀsets and mismatches of the circuit cause the outputs to be

incorrect. However, since the weights start near where they should be, the error goes

down rapidly to the correct solution. This is an example of how a more complicated

network could be trained on computer ﬁrst to obtain good initial weights and then

the training could be completed with the chip in the loop. Also, for more complicated

66

0 50 100 150 200 250 300 350 400 450 500

−40

−20

0

20

40

W

e

i

g

h

t

Weight and Error vs. Iteration for training XOR network

0 50 100 150 200 250 300 350 400 450 500

2

4

6

8

10

12

iteration

E

r

r

o

r

V

o

l

t

a

g

e

Figure 5.15: Training of a 2:2:1 network with XOR function starting with “ideal”

weights

67

networks, using a more sophisticated model of the synapses and neurons that more

closely approximates the actual circuit implementation would be advantageous for

computer pretraining.

5.7 Conclusions

A VLSI implementation of a neural network has been demonstrated. Digital weights

are used to provide stable weight storage. Analog multipliers are used because full

digital multipliers would occupy considerable space for large networks. Although the

functions learned were digital, the network is able to accept analog inputs and provide

analog outputs for learning other functions. A parallel perturbation technique was

used to train the network successfully on the 2-input AND and XOR functions.

68

Chapter 6 A Parallel Perturbative VLSI

Neural Network

A fully parallel perturbative algorithm cannot truly be realized with a serial weight

bus, because the act of changing the weights is performed by a serial operation. Thus,

it is desirable to add circuitry to allow for parallel weight updates.

First, a method for applying random perturbation is necessary. The randomness

is necessary because it deﬁnes the direction of search for ﬁnding the gradient. Since

the gradient is not actually calculated, but observed, it is necessary to search for

the downward gradient. It is possible to use a technique which does a nonrandom

search. However, since no prior information about the error surface is known, in the

worst case a nonrandom technique would spend much more time investigating upward

gradients which the network would not follow.

A conceptually simple technique of generating random perturbations would be

to amplify the thermal noise of a diode or resistor (ﬁgure 6.1). Unfortunately, the

extremely large value of gain required for the ampliﬁer makes the ampliﬁer susceptible

to crosstalk. Any noise generated from neighboring circuits would also get ampliﬁed.

Since some of this noise may come from clocked digital sections, the noise would

become very regular, and would likely lead to oscillations rather than the uncorrelated

noise sources that are desired.

Such a scheme was attempted with separate thermal noise generators for each

neuron[33]. The gain required for the ampliﬁer was nearly 1 million and highly

correlated oscillations of a few MHz were observed among all the noise generators.

Therefore, another technique is required.

Instead, the random weight increments can be generated digitally with linear

feedback shift registers that produce a long pseudorandom sequence. These random

bits are used as inputs to a counter that stores and updates the weights. The counter

69

A

V

noise

Figure 6.1: Ampliﬁed thermal noise of a resistor

1 2 3 n m ... ...

out

Figure 6.2: 2-tap, m-bit linear feedback shift register

outputs go directly to the D/A converter inputs of the synapses. If the weight updates

led to a reduction in error, the update is kept. Otherwise, an inverter block is activated

which inverts the counter inputs coming from the linear feedback shift registers. This

has the eﬀect of restoring the original weights. A block diagram of the full neural

network circuit function is provided in ﬁgure 6.8.

6.1 Linear Feedback Shift Registers

The use of linear feedback shift registers for generating pseudorandom sequences has

been known for some time[31]. A simple 2 tap linear feedback shift register is shown

in ﬁgure 6.2. A shift register with m bits is used. In the ﬁgure shown, an XOR is

performed from bit locations m and n and its output is fed back to the input of the

shift register. In general, however, the XOR can be replaced with a multiple input

parity function from any number of input taps. If one of the feedback taps is not

70

1 0 0 0

0 1 0 0

0 0 1 0

1 0 0 1

1 1 0 0

0 1 1 0

1 0 1 1

0 1 0 1

1 0 1 0

1 1 0 1

1 1 1 0

1 1 1 1

0 1 1 1

0 0 1 1

0 0 0 1

1 0 0 0

Table 6.1: LFSR with m=4, n=3 yielding 1 maximal length sequence

1 0 0 0 1 1 0 0 0 1 1 0

0 1 0 0 1 1 1 0 1 0 1 1

1 0 1 0 1 1 1 1 1 1 0 1

0 1 0 1 0 1 1 1 0 1 1 0

0 0 1 0 0 0 1 1

0 0 0 1 1 0 0 1

1 0 0 0 1 1 0 0

Table 6.2: LFSR with m=4, n=2, yielding 3 possible subsequences

taken from the ﬁnal bit, m, but instead the largest feedback tap number is from some

earlier bit, l, then this is equivalent to an l-bit linear feedback shift register and the

output is merely delayed by the extra m−l shift register bits.

A maximal length sequence from such a linear feedback shift register would be of

length 2

M

−1 and includes all 2

M

possible patterns of 0’s and 1’s with the exception

of the all 0’s sequence. The all 0’s sequence is considered a dead state, because

XOR(0, 0) = 0 and the sequence never changes. It is important to note that not all

feedback tap selections will lead to maximal length sequences.

Table 6.1 shows an example of a maximal length sequence by choosing feedback

taps of positions 4 and 3 from a 4 bit LFSR. It is clear that as long as the shift register

71

does not begin in the all 0’s state, it will progress through all 2

M

−1 = 15 sequences

and then repeat. Table 6.2 shows an example where non-maximal length sequences

are possible by utilizing feedback taps 4 and 2 from a 4 bit LFSR. In this situation it

is possible to enter one of three possible subsequences. The initial state will determine

which subsequence is entered. There is no overlap of the subsequences so it is not

possible to transition from one to another. In this situation, the subsequence lengths

are of size 6, 6, and 3. The sum of the sizes of the subsequences equal the size of a

maximal length sequence.

It is desirable to use the maximal length sequences because they allow the longest

possible pseudorandom sequence for the given complexity of the LFSR. Thus, it is

important to choose the correct feedback taps. Tables have been tabulated which

describe the lengths of possible sequences and subsequences based on LFSR lengths

and feedback tap locations[31].

LFSRs of certain lengths cannot achieve maximal length sequences with only 2

feedback tap locations. For example, for m = 8, it is required to have an XOR(4,5,6)

to get a length 255 sequence, and, for m = 16, it is required to have XOR(4,13,15)

to obtain the length 65536 sequence[32]. Table 6.3 provides a list of LFSRs that can

obtain maximal length sequences with only 2 feedback taps[32].

It is clear that it is possible to make very long pseudorandom sequences with

moderately sized shift registers and a single XOR gate.

In applications where analog noise is desired, it is possible to low pass ﬁlter the

output of the digital pseudorandom noise sequence. This is done by using a ﬁlter

corner frequency which is much less than the clock frequency of the shift register.

6.2 Multiple Pseudorandom Noise Generators

From the previous discussion it is clear that linear feedback shift registers are a

useful technique for generating pseudorandom noise. However, a parallel perturbative

neural network requires as many uncorrelated noise sources as there are weights.

Unfortunately, an LFSR only provides one such noise source. It is not possible to use

72

m n Length

3 2 7

4 3 15

5 3 31

6 5 63

7 6 127

9 5 511

10 7 1023

11 9 2047

15 14 32767

17 14 131071

18 11 262143

20 17 1048575

21 19 2097151

22 21 4194303

23 18 8388607

25 22 33554431

28 25 268435455

29 27 536870911

31 28 2147483647

33 20 8589934591

35 33 34359738367

36 25 68719476735

39 35 549755813887

Figure 6.3: Maximal length LFSRs with only 2 feedback taps

73

the diﬀerent taps of a single LFSR as separate noise sources because these taps are

merely the same noise pattern oﬀset in time and thus highly correlated. One solution

would be to use one LFSR with diﬀerent feedback taps and/or initial states for every

noise source required. For large networks with long training times, this approach

becomes prohibitively expensive in terms of area and possibly power required to

implement the scheme. Because of this, several approaches have been taken to resolve

the problem, many of which utilize cellular automata.

In one approach[36], they build from a standard LFSR with the addition of an

XOR network with inputs coming from the taps of an LFSR and with outputs pro-

viding the multiple noise sources. First, the analysis begins with an LFSR consisting

of shift register cells numbering a

0

to a

N−1

, with the cell marked a

0

receiving the

feedback. Each of the N units in the LFSR except for the a

0

unit is given by

a

t+1

i

= a

t

i

The ﬁrst unit is described by

a

t

0

=

N−1

i=0

c

i

a

t

i

where ⊕ represents the modulo 2 summation and the c

i

are the boolean feedback

coeﬃcients where a coeﬃcient of 1 represents a feedback tap at that cell.

The outputs are given by

g

t

=

N−1

i=0

d

i

a

t

i

A detailed discussion is given as to the appropriate choice of the c

i

and d

i

coeﬃ-

cients to ensure good noise properties.

In another approach a standard boolean cellular automaton is used[37][38]. The

boolean cellular automaton is comprised of N unit cells, a

i

, wherein their state at

time t + 1 depends only on the states of all of the unit cells at time t.

74

a

t+1

i

= f

a

t

0

, a

t

1

, ..., a

t

N−1

**An analysis was performed to choose a cellular automaton with good independence
**

between output bits. The following transition rule was chosen:

a

t+1

i

= (a

t

i+1

+a

n

i

) ⊕a

t

i−1

⊕a

t

i−1

Furthermore, every 4th bit is chosen as an output to ensure independent, unbiased,

bits.

6.3 Multiple Pseudorandom Bit Stream Circuit

Another simpliﬁed approach utilizes two counterpropagating LFSRs with an XOR

network to combine outputs from diﬀerent taps to obtain uncorrelated noise[39]. For

example, ﬁgure 6.4 shows two LFSRs with m = 7, n = 6 and m = 6, n = 5. Since

the two LFSRs are counterpropagating, the length of the output sequences obtained

from the XOR network are (2

6

− 1)(2

7

− 1) = 8001 bits[31]. In the ﬁgure, the bits

are combined to provide 9 pseudorandom bit sequences. It is possible to obtain more

channels and larger sequence lengths with the use of larger LFSRs. This was the

scheme that was ultimately implemented in hardware.

6.4 Up/down Counter Weight Storage Elements

The weights in the network are represented directly as the bits of an up/down digital

counter. The output bits of the counter feed directly into the digital input word

weight bits of the synapse circuit. Updating the weights becomes a simple matter

of incrementing or decrementing the counter to the desired value. The counter used

is a standard synchronous up/down counter using full adders and registers (ﬁgure

6.5)[34].

75

r

0

r

1

r

2

r

3

r

4

r

5

r

6

r

7

0

1

2

3

4

5

6

7

8

9

1

0

1

1

1

2

1

3

1

4

1

5

1

6

1

7

1

8

1

9

2

0

2

1

2

2

2

3

2

4

D

2

5

b

i

t

s

h

i

f

t

r

e

g

i

s

t

e

r

r

e

s

e

t

0

1

2

3

4

5

6

7

8

9

1

0

1

1

1

2

1

3

1

4

1

5

1

6

1

7

1

8

1

9

2

0

2

1

2

2

2

3

2

4

D f

1

2

5

b

i

t

s

h

i

f

t

r

e

g

i

s

t

e

r

r

e

s

e

t

r

e

s

e

t

r

8

f

2

f

1

f

2

f

1

f

2

f

1

f

2

f

1

f

2

f

1

f

2

Figure 6.4: Dual counterpropagating LFSR multiple random bit generator

76

A

B

Cin

Cout

S

A

B

Cin

Cout

S

b0

b1

down/up

A

B

Cin

Cout

S

D Q

D Q

D Q

b2

A

B

Cin

Cout

S

D Q

A

B

Cin

Cout

S

D Q

A

B

Cin

Cout

S

D Q

overflow

b3

b4

b5

f1

f1

f2

f2

f1

f1

f2

f2

f1

f1

f2

f2

f1

f1

f2

f2

f1

f1

f2

f2

f1

f1

f2

f2

f1

f1

f2

f2

Figure 6.5: Synchronous up/down counter

77

A

B

C

C

out

SUM

Vdd

Figure 6.6: Full adder for up/down counter

78

W j

2

j

2

Q

W

D

j

1

j

1

Figure 6.7: Counter static register

The full adder is shown in ﬁgure 6.6[34]. Since every weight requires 1 counter,

and every counter requires 1 full adder for each bit, the adders take up a considerable

amount of room. Thus, the adder is optimized for space instead of for speed. Most

of the adder transistors are made close to minimum size.

The register cell (ﬁgure 6.7) used in the counter is simply a 1 bit version of the

serial weight bus shown in ﬁgure 5.7. This register cell is both small and static, since

the output must remain valid for standard feedforward operation after learning is

complete.

6.5 Parallel Perturbative Feedforward Network

Figure 6.8 shows a block diagram of the parallel perturbative neural network circuit

operation. The synapse, binary weighted encoder and neuron circuits are the same as

those used for the serial weight bus neural network. However, instead of the synapses

interfacing with a serial weight bus, counters with their respective registers are used

to store and update the weights.

6.6 Inverter Block

The counter up/down inputs originate in the pseudorandom bit generator and pass

through an inverter block (ﬁgure 6.9). The inverter block is essentially composed of

pass gates and inverters. If the invert signal is low, the bit passes unchanged. If the

invert bit is high, then the inversion of the pseudorandom bit gets passed. Control of

the inverter block is what allows weight updates to either be kept or discarded.

79

Pseudorandom bitgenerator

Inverterblock

Counter

up/down

b0..b5

Toother

counters

Synapse

input

otherlayers

Neuron

fromothersynapses

Binaryweightedencoder

ErrorComparison

Neuron

output

Figure 6.8: Block diagram of parallel perturbative neural network circuit

80

b

0

in

b

0

in

v

e

r

t

b

1

in

b

2

in

b

3

in

b

4

in

b

5

in

b

6

in

b

7

in

b

8

in

b

1

b

2

b

3

b

4

b

5

b

6

b

7

b

8

Figure 6.9: Inverter block

81

6.7 Training Algorithm

The algorithm implemented by the network is a parallel perturbative method[21][25].

The basic idea of the algorithm is to perform a modiﬁed gradient descent search of

the error surface without calculating derivatives or using explicit knowledge of the

functional form of the neurons. The way that this is done is by applying a set of

small perturbations on the weights and measuring the eﬀect on the network error.

In a typical perturbative algorithm, after perturbations are applied to the weights, a

weight update rule of the following form is used:

∆

− →

w = −η

E

− →

w +

−−→

pert

−E (

− →

w)

pert

where

− →

w is a weight vector of all the weights in the network,

−−→

pert is a perturbation

vector applied to the weights which is normally of ﬁxed size, pert, but with random

signs, and η is a small parameter used to set the learning rate. Thus, the weight

update rule can be seen as based on an approximate derivative of the error functions

with respect to the weights.

∂E

∂

− →

w

≈

∆E

∆

− →

w

=

E

− →

w +

−−→

pert

−E (

− →

w)

(

− →

w + pert) −

− →

w

=

E

− →

w +

−−→

pert

−E (

− →

w)

pert

However, this type of weight update rule requires weight changes which are pro-

portional to the diﬀerence in error terms. A simpler, but functionally equivalent,

weight update rule can be achieved as follows. First, the network error, E (

− →

w), is

measured with the current weight vector,

− →

w. Next, a perturbation vector,

−−→

pert, of

ﬁxed magnitude, but random sign is applied to the weight vector yielding a new weight

vector,

− →

w +

−−→

pert. Afterwards, the eﬀect on the error, ∆E = E

− →

w +

−−→

pert

−E (

− →

w),

is measured. If the error decreases then the perturbations are kept and the next

iteration is performed. If the error increases, the original weight vector is restored.

Thus, the weight update rule is of the following form:

82

− →

w

t+1

=

− →

w

t

+

−−→

pert

t

, if E

− →

w

t

+

−−→

pert

t

< E (

− →

w

t

)

− →

w

t

, if E

− →

w

t

+

−−→

pert

t

> E (

− →

w

t

)

The use of this rule may require more iterations since it does not perform an

actual weight change every iteration and since the weight updates are not scaled

with the resulting changes in error. Nevertheless, it signiﬁcantly simpliﬁes the weight

update circuitry. For either update rule, some means must be available to apply

the weight perturbations; however, this rule does not require additional circuitry to

change the weight values proportionately with the error diﬀerence, and, instead, relies

on the same circuitry for the weight update as for applying the random perturbations.

Some extra circuitry is required to remove the perturbations when the error doesn’t

decrease, but this merely involves inverting the signs of all of the perturbations and

reapplying in order to cancel out the initial perturbation.

A pseudocode version of the algorithm is presented in ﬁgure 6.10.

6.8 Error Comparison

The error comparison section is responsible for calculating the error of the current

iteration and interfacing with a control section to implement the algorithm. Both

sections could be performed oﬀ-chip by using a computer for chip-in-loop training, as

was chosen for the current implementation. This allows ﬂexibility in performing the

global functions necessary for implementing the training algorithm, while the local

functions are performed on-chip. However, the control section could be implemented

on chip as a standard digital section such as a ﬁnite state machine. Also, there are

several alternatives for implementing the error comparison on chip. First, the error

comparison could simply consist of analog to digital converters which would then pass

the digital information to the control block. Another approach would be to have the

error comparison section compare the analog errors directly and then output digital

control signals.

83

Initialize Weights;

Get Error;

while(Error > Error Goal);

Perturb Weights;

Get New Error;

if (New Error < Error),

Error = New Error;

else

Unperturb Weights;

end

end

function Perturb Weights;

Set Invert Bit Low;

Clock Pseudorandom bit generator;

Clock counters;

function Unperturb Weights;

Set Invert Bit High;

Clock counters;

Figure 6.10: Pseudocode for parallel perturbative learning network

E

in+

E

in-

E

p+

E

p-

E

o-

E

o+

StoreE

o

StoreE

p

CompareE

C C

E

out

in+

in-

V

ref

Figure 6.11: Error comparison section

84

E

in+

E

in-

E

p+

E

p-

E

o-

E

o+

C

E

out

in+

in-

V

ref

a)StoreE high

o

b)StoreE high

p

c)CompareEhigh

C

E

in+

E

in-

E

p+

E

p-

E

o-

E

o+

C C

E

p+

E

p-

E

o-

E

o+

C C

Figure 6.12: Error comparison operation

85

A sample error comparison section is shown in ﬁgure 6.11. The section uses digital

control signals to pass error voltages onto capacitor terminals and then combines the

capacitors in such a way to subtract the error voltages. This error diﬀerence is then

compared with a reference voltage to determine if the error decreased or increased.

The output consists of a digital signal which shows the appropriate error change.

The error input terminals, E

in+

and E

in−

, are connected, in sequence, to two stor-

age capacitors which are then compared. StoreE

o

is asserted when the unperturbed,

initial error is stored. StoreE

p

is asserted in order to store the perturbed error after

the weight perturbations have been applied. CompareE is asserted to compute the

error diﬀerence, and E

out

shows the sign of the diﬀerence computation.

Figure 6.12 shows the operation of the error comparison section in greater detail.

In ﬁgure 6.12a, StoreE

o

is high and the other signals are low. The capacitor where

E

p

is stored is left ﬂoating, while the inputs E

in+

and E

in−

are connected to E

o+

and E

o−

respectively. This stores the unperturbed error voltages on the capacitor.

Next, in ﬁgure 6.12b, StoreE

p

is held high while the other signals are low. In the

same manner as before, the inputs E

in+

and E

in−

are connected to E

p+

and E

p−

respectively. This stores the perturbed error voltages on the capacitor. Note that the

polarities of the two storage capacitors are drawn reversed. This will facilitate the

subtraction operation in the ﬁnal step. In ﬁgure 6.12c, CompareE is asserted and

the other input signals are low. The two storage capacitors are connected together

and their charges combine. The total charge, Q

T

, across the parallel capacitors is

given as follows:

Q

T

= Q

+

−Q

−

= C (E

p+

+E

o−

) −C (E

p−

+E

o+

) = C ((E

p+

−E

p−

) −(E

o+

−E

o−

))

If both capacitors are of size C, the parallel capacitance will be of size 2C. The

voltage across the parallel capacitors can be obtained from Q

T

= 2CV

T

.

Thus,

V

T

=

Q

T

2C

=

1

2

((E

p+

−E

p−

) −(E

o+

−E

o−

)) =

1

2

(E

p

−E

o

)

86

Since these voltages are ﬂoating and must be referenced to some other voltage,

the combined capacitor is placed onto a voltage, V

ref

, which is usually chosen to be

approximately half the supply voltage to allow for the greatest dynamic range.

Also, during this step, this error diﬀerence feeds into a high gain ampliﬁer to

perform the comparison. The high gain ampliﬁer is shown as a diﬀerential transcon-

ductance ampliﬁer which feeds into a double inverter buﬀer to output a strong digital

signal. The ﬁnal output, E

out

, is given as follows:

E

out

=

1, if E

p

> E

o

0, if E

p

< E

o

where the 1 represents a digital high signal, and the 0 represents a digital low

signal.

This output can then be sent to a control section which will use it to determine

the next step in the algorithm.

The error inputs which are computed by a diﬀerence of the actual output and a

target output can be computed with a similar analog switched capacitor technique.

However, it might also be desirable to insert an absolute value computation stage.

Simple circuits for obtaining voltage absolute values can be found elsewhere[30].

Furthermore, this sample error computation stage can easily be expanded to ac-

commodate a multiple output neural network.

6.9 Test Results

In this section several sample training run results from the chip are shown. A chip

implementing the parallel perturbative neural network was fabricated in a 1.2µm

CMOS process. All synapse and neuron transistors were 3.6µm/3.6µm to keep the

layout small. The unit size current source transistors were also 3.6µm/3.6µm. An

LSB current of 100nA was chosen for the current source. The above neural network

circuits were trained with some simple digital functions such as 2 input AND and 2

input XOR. The results of some training runs are shown in ﬁgures 6.13-6.15. As can be

87

seen from the ﬁgures, the network weights slowly converge to a correct solution. The

error voltages were taken directly from the voltage output of the output neuron. The

error voltages were calculated as a total sum voltage error over all input patterns. The

actual output voltage error is arbitrary and depends on the circuit parameters. What

is important is that the error goes down. The diﬀerential to single-ended converter

and a double inverter buﬀer can be placed at the ﬁnal output to obtain good digital

signals. Also shown is a digital error signal that shows the number of input patterns

where the network gives the incorrect answer for the output. The network analog

output voltage for each pattern is also displayed.

The actual weight values were not made accessible and, thus, are not shown;

however, another implementation might also add a serial weight bus to read the

weights and to set initial weight values. This would also be useful when pretraining

with a computer to initialize the weights in a good location.

Figure 5.12 shows the results from a 2 input, 1 output network learning an AND

function. This network has only 2 synapses and 1 bias for a total of 3 weights. The

network starts with 2 incorrect outputs which is to be expected with initial random

weights. Since the AND function is a very simple function to learn, after relatively

few iterations, all of the outputs are digitally correct. However, the network continues

to train and moves the weight vector in order to better match the training outputs.

Figures 6.14 and 6.15 show the results of training a 2 input, 2 hidden unit, 1

output network with the XOR function. In both ﬁgures, although the error voltage is

monotonically decreasing, the digital error occasionally increases. This is because the

network weights occasionally transition through a region of reduced analog output

error that is used for training, but which actually increases the digital error. This

seems to be occasionally necessary for the network to ultimately reach a reduced

digital error. Both ﬁgures also show that the function is essentially learned after only

several hundred iterations. In ﬁgure 6.15, the network is allowed to continue learning.

After being stuck in a local minimum for quite some time, the network ﬁnally ﬁnds

another location where it is able to go slightly closer towards the minimum.

Some of the analog output values occasionally show a large jump from one iter-

88

0 10 20 30 40 50 60 70 80 90 100

0

0.2

0.4

0.6

0.8

1

Errors and Outputs vs. Iteration for training AND network

E

r

r

o

r

(

V

o

l

t

s

)

0 10 20 30 40 50 60 70 80 90 100

−1

0

1

2

3

4

D

i

g

i

t

a

l

E

r

r

o

r

0 10 20 30 40 50 60 70 80 90 100

−0.2

−0.1

0

0.1

0.2

0.3

iteration

A

n

a

l

o

g

O

u

t

p

u

t

s

(

V

o

l

t

s

)

Figure 6.13: Training of a 2:1 network with the AND function

89

0 100 200 300 400 500 600 700 800 900 1000

0

0.05

0.1

0.15

0.2

Errors and Outputs vs. Iteration for training XOR network

E

r

r

o

r

(

V

o

l

t

s

)

0 100 200 300 400 500 600 700 800 900 1000

−1

0

1

2

3

4

D

i

g

i

t

a

l

E

r

r

o

r

0 100 200 300 400 500 600 700 800 900 1000

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

iteration

A

n

a

l

o

g

O

u

t

p

u

t

s

(

V

o

l

t

s

)

Figure 6.14: Training of a 2:2:1 network with the XOR function

90

0 200 400 600 800 1000 1200 1400 1600 1800 2000

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Errors and Outputs vs. Iteration for training XOR network

E

r

r

o

r

(

V

o

l

t

s

)

0 200 400 600 800 1000 1200 1400 1600 1800 2000

−1

0

1

2

3

4

D

i

g

i

t

a

l

E

r

r

o

r

0 200 400 600 800 1000 1200 1400 1600 1800 2000

−0.1

−0.05

0

0.05

0.1

0.15

iteration

A

n

a

l

o

g

O

u

t

p

u

t

s

(

V

o

l

t

s

)

Figure 6.15: Training of a 2:2:1 network with the XOR function with more iterations

91

ation to another. This occurs when a weight value which is at maximum magnitude

overﬂows and resets to zero. The weights are merely stored in counters, and no spe-

cial circuitry was added to deal with these overﬂow conditions. It would require a

simple logic block to ensure that if a weight is at maximum magnitude and was incre-

mented, that it would not overﬂow and reset to zero. However, this circuitry would

need to be added to every weight counter and would be an unnecessary increase in

size. However, these overﬂow conditions should normally not be a problem. Since

the algorithm only accepts weight changes that decrease the error, if an overﬂow and

reset on a weight is undesirable, the weight reset will simply be discarded. In fact,

the weight reset may occasionally be useful for breaking out of local minima, where a

weight value has been pushed up against an edge which leads to a local minima, but a

sign ﬂip or signiﬁcant weight reduction is necessary to reach the global minimum. In

this type of situation, the network will be unwilling to increase the error as necessary

to obtain the global minimum.

6.10 Conclusions

A VLSI neural network implementing a parallel perturbative algorithm has been

demonstrated. Digital weights are used to provide stable weight storage. They are

stored and updated via up/down counters and registers. Analog multipliers are used

to decrease the space that a full parallel array of digital multipliers would occupy in

a large network. The network was successfully trained on the 2-input AND and XOR

functions. Although the functions learned were digital, the network is able to accept

analog inputs and provide analog outputs for learning other functions.

The chip was trained using a computer in a chip-in-loop fashion. All of the

local weight update computations and random noise perturbations were generated on

chip. The computer performed the global operations including applying the inputs,

measuring the outputs, calculating the error function and applying algorithm control

signals. This chip-in-loop conﬁguration is the most likely to be used if the chip were

used as a fast, low power component in a larger digital system. For example, in a

92

handheld PDA, it may be desirable to use a low power analog neural network to assist

in the handwriting recognition task with the microprocessor performing the global

functions.

93

Chapter 7 Conclusions

7.1 Contributions

7.1.1 Dynamic Subthreshold MOS Translinear Circuits

The main contribution of this section involved the transformation of a set of ﬂoating

gate circuits useful for analog computation into an equivalent set of circuits using

dynamic techniques. This makes the circuits more robust, and easier to work with

since they do not require using ultraviolet illumination to initialize the circuits. Fur-

thermore, the simplicity with which these circuits can be combined and cascaded to

implement more complicated functions was demonstrated.

7.1.2 VLSI Neural Networks

The design of the neuron circuit, although similar to another neuron design[22], is

new. The main distinction is a change in the current to voltage conversion element

used. The previous element was a diode connected transistor. This element proved

to have too large of a gain, and, in particular, learning XOR was diﬃcult with such

high neuron gain. Thus, a diode connected cascoded transistor with linearized gain

and a gain control stage was used instead.

The parallel perturbative training algorithm with simpliﬁed weight update rule

was used in the serial weight bus based neural network. Although many diﬀerent

algorithms could have been used, such as serial weight perturbation, this was done

with an eye toward the eventual implementation of the fully parallel perturbative

implementation. Excellent training results were shown for training of a 2 input, 1

output network on the AND function and for training a 2 input, 2 hidden layer, 1

output network on the XOR function. Furthermore, an XOR network was trained

which was initialized with “ideal” weights. This was done in order to show the eﬀects

94

of circuit oﬀsets in making the outputs initially incorrect, but also the ability to

signiﬁcantly decrease the training time required by using good initial weights.

The overall system design and implementation of the fully parallel perturbative

neural network with local computation of the simpliﬁed weight update rule was also

demonstrated. Although many of the subcircuits and techniques were borrowed from

previous designs, the overall system architecture is novel. Excellent training results

were again shown for a 2 input, 1 output network trained on the AND function and

for a 2 input, 2 hidden layer, 1 output network trained on the XOR function.

Thus, the ability to learn of a neural network using analog components for imple-

mentation of the synapses and neurons and 6 bit digital weights has been successfully

demonstrated. The choice of 6 bits for the digital weights was chosen to demon-

strate that learning was possible with limited bit precision. The circuits can easily

be extended to use 8 bit weights. Using more than 8 bits may not be desirable since

the analog circuitry itself may not have signiﬁcant precision to take advantage of the

extra bits per weight. Signiﬁcant strides can be taken to improve the matching char-

acteristics of the analog circuits, but then the inherent beneﬁts of using a compact,

parallel, analog implementation may be lost.

The size of the neuron cell in dimensionless units was 300λ x 96λ, the synapse was

80λ x 150λ, and the weight counter/register was 340λ x 380λ. In the 1.2µm process

used to make the test chips, λ was equal to 0.6µm. In a modern process, such as a

0.35µm process, it would be possible to make a network with over 100 neurons and

over 10,000 weights in a 1cm x 1cm chip.

7.2 Future Directions

The serial weight bus neural network and parallel perturbative neural network can be

combined. By adding a serial weight bus to the parallel perturbative neural network, it

would also make it possible to initialize the neural network with weights obtained from

computer pretraining. In a typical design cycle, signiﬁcant computer simulation would

be carried out prior to circuit fabrication. Part of this simulation may involve detailed

95

training runs of the network using either idealized models or full mixed-signal training

simulation with the actual circuit. With the serial weight bus, it would then make

it possible to initialize the network with the results of this simulation. Additionally,

for certain applications such as handwriting recognition, it may be desirable to train

one chip with a standard set of inputs, then read out those weights and copy them

to other chips. The next steps would also involve building larger networks for the

purposes of performing speciﬁc tasks.

96

Bibliography

[1] B. Gilbert, “Translinear Circuits: A Proposed Classiﬁcation”, Electronics Let-

ters, vol. 11, no. 1, pp. 14-16, 1975.

[2] B. Gilbert, “Current-Mode Circuits From A Translinear Viewpoint: A Tutorial”,

Analogue IC Design: The Current-Mode Approach, C. Toumazou, F. J. Ledgey,

and D. G. Haigh, Eds., London: Peter Peregrinus Ltd., pp. 11-91, 1990.

[3] B. Gilbert, “A Monolithic Microsystem for Analog Synthesis of Trigonometric

Functions and Their Inverses”, IEEE Journal of Solid-State Circuits, vol. SC-17,

no. 6, pp. 1179-1191, 1982.

[4] A. G. Andreou and K. A. Boahen, “Translinear Circuits in Subthreshold MOS”,

Analog Integrated Circuits and Signal Processing, vol. 9, pp. 141-166, 1996.

[5] T. Serrano-Gotarredona, B. Linares-Barranco, and A. G. Andreou, “A General

Translinear Principle for Subthreshold MOS Transistors”, IEEE Transactions on

Circuits and Systems - I: Fundamental Theory and Applications, vol. 46, no. 5,

May 1999.

[6] K. Bult and H. Wallinga, “A Class of Analog CMOS Circuits Based on the

Square-Law Characteristic of an MOS Transistor in Saturation”, IEEE Journal

of Solid-State Circuits, vol. SC-22, no. 3, June 1987.

[7] E. Seevinck and R. J. Wiegerink, “Generalized Translinear Circuit Principle”,

IEEE Journal of Solid-State Circuits, vol. 26, no. 8, Aug. 1991.

[8] P. R. Gray and R. G. Meyer, Analysis and Design of Analog Integrated Circuits,

3rd Ed., New York: John Wiley and Sons, 1993.

97

[9] R. J. Wiegerink, “Computer Aided Analysis and Design of MOS Translinear

Circuits Operating in Strong Inversion”, Analog Integrated Circuits and Signal

Processing, vol. 9, pp. 181-187, 1996.

[10] W. Gai, H. Chen and E. Seevinck, “Quadratic-Translinear CMOS Multiplier-

Divider Circuit”, Electronics Letters, vol. 33, no. 10, May 1997.

[11] S. I. Liu and C. C. Chang, “A CMOS Square-Law Vector Summation Circuit”,

IEEE Transactions on Circuits and Systems - II: Analog and Digital Signal Pro-

cessing, vol. 43, no. 7, July 1996.

[12] C. Dualibe, M. Verleysen and P. Jespers, “Two-Quadrant CMOS Analogue Di-

vider”, Electronics Letters, vol. 34, no. 12, June 1998.

[13] X. Arreguit, E. A. Vittoz, and M. Merz, “Precision Compressor Gain Controller

in CMOS Technology”, IEEE Journal of Solid-State Circuits, vol. SC-22, no. 3,

pp. 442-445, 1987.

[14] E. A. Vittoz, “Analog VLSI Signal Processing: Why, Where, and How?” Analog

Integrated Circuits and Signal Processing, vol. 6, no. 1, pp. 27-44, July 1994.

[15] B. A. Minch, C. Diorio, P. Hasler, and C. A. Mead, ”Translinear Circuits us-

ing Subthreshold ﬂoating-gate MOS transistors”, Analog Integrated Circuits and

Signal Processing, Vol. 9, No. 2, 1996, pp. 167-179.

[16] B. A. Minch, ”Analysis, Synthesis, and Implementation of Networks of Multiple-

Input Translinear Elements”, Ph.D. Thesis, California Institute of Technology,

1997.

[17] K. W. Yang and A. G. Andreou, ”A Multiple-Input Diﬀerential-Ampliﬁer Based

on Charge Sharing on a Floating-Gate MOSFET”, Analog Integrated Circuits

and Signal Processing, vol. 6, no. 3, 1994, pp. 167-179.

98

[18] E. Vittoz, ”Dynamic Analog Techniques”, Design of Analog-Digital VLSI Cir-

cuits for Telecommunications and Signal Processing, Ed. Franca J., Tsividis Y.,

Prentice Hall, Englewood Cliﬀs, New Jersey, 1994.

[19] S. Aritome, R. Shirota, G. Hemink, T. Endoh, and F. Masuoka, “Reliability

Issues of Flash Memory Cells”, Proceedings of the IEEE, vol. 81, no. 5, pp. 776-

788, May 1993,

[20] R. W. Keyes, The Physics of VLSI Systems, New York: Addison-Wesley, 1987.

[21] J. Alspector, R. Meir, B. Yuhas, A. Jayakumar, and D. Lippe, “A Parallel Gra-

dient Descent Method for Learning in Analog VLSI Neural Networks”, Advances

in Neural Information Processing Systems, San Mateo, CA: Morgan Kaufman

Publishers, vol. 5, pp. 836-844, 1993.

[22] R. Coggins, M. Jabri, B. Flower, and S. Pickard, “A Hybrid Analog and Digital

VLSI Neural Network for Intracardiac Morphology Classiﬁcation”, IEEE Journal

of Solid-State Circuits, vol. 30, no. 5, pp. 542-550, May 1995.

[23] M. Jabri and B. Flower, “Weight Perturbation: An Optimal Architecture and

Learning Technique for Analog VLSI Feedforward and Recurrent Multilayer Net-

works”, IEEE Transactions on Neural Networks, vol. 3, no. 1, pp. 154-157, Jan.

1992.

[24] B. Flower and M. Jabri, “Summed Weight Neuron Perturbation: An O(N) Im-

provement over Weight Perturbation”, Advances in Neural Information Process-

ing Systems, San Mateo, CA: Morgan Kaufman Publishers, vol. 5, pp. 212-219,

1993.

[25] G. Cauwenberghs, “A Fast Stochastic Error-Descent Algorithm for Supervised

Learning and Optimization”, Advances in Neural Information Processing Sys-

tems, San Mateo, CA: Morgan Kaufman Publishers, vol. 5, pp. 244-251, 1993.

99

[26] P. W. Hollis and J. J. Paulos, “A Neural Network Learning Algorithm Tailored

for VLSI Implementation”, IEEE Transactions on Neural Networks, vol. 5, no.

5, pp. 784-791, Sept. 1994.

[27] G. Cauwenberghs, “Analog VLSI Stochastic Perturbative Learning Architec-

tures”, Analog Integrated Circuits and Signal Processing, 13, 195-209, 1997.

[28] C. Diorio, P. Hasler, B. A. Minch, and C. A. Mead, “A Single-Transistor Silicon

Synapse”, IEEE Transactions on Electron Devices, vol. 43, no. 11, pp. 1972-1980,

Nov. 1996.

[29] C. Diorio, P. Hasler, B. A. Minch, and C. A. Mead, “A Complementary Pair of

Four-Terminal Silicon Synapses”, Analog Integrated Circuits and Signal Process-

ing, vol. 13, no. 1-2, pp. 153-166, May-June 1997.

[30] C. Mead, Analog VLSI and Neural Systems, New York: Addison-Wesley, 1989.

[31] S. W. Golomb, Shift Register Sequences, Revised Ed., Laguna Hills, CA: Aegean

Park Press, 1982.

[32] P. Horowitz, and W. Hill, The Art of Electronics, Second Ed., Cambridge, UK:

Cambridge University Press, pp. 655-664, 1989.

[33] J. Alspector, B. Gupta, and R. B. Allen, “Performance of a stochastic learn-

ing microchip” in Advances in Neural Information Processing Systems 1, D. S.

Touretzky, Ed., San Mateo, CA: Morgan Kaufman, 1989, pp. 748-760

[34] N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI Design, Second

Ed., Reading, Mass.: Addison-Wesley, 1993.

[35] R. Gregorian and G. C. Temes, Analog MOS Integrated Circuits for Signal Pro-

cessing, New York: John Wiley & Sons, 1986.

[36] J. Alspector, J. W. Gannett, S. Haber, M. B. Parker, and R. Chu, “A VLSI-

Eﬃcient Technique for Generating Multiple Uncorrelated Noise Source and Its

100

Application to Stochastic Neural Networks”, IEEE Transactions on Circuits and

Systems, vol. 38, no. 1, pp. 109-123, Jan. 1991.

[37] S. Wolfram, Theory and Applications of Cellular Automata, World Scientiﬁc

Publishing Co. Pte. Ltd., 1986.

[38] A. Dupret, E. Belhaire, and P. Garda, “Scalable Array of Gaussian White Noise

Sources for Analogue VLSI Implementation”, Electronics Letters, vol. 31, no. 17,

pp. 1457-1458, Aug. 1995.

[39] G. Cauwenberghs, “An Analog VLSI Recurrent Neural Network Learning a

Continuous-Time Trajectory”, IEEE Transactions on Neural Networks, vol. 7,

no. 2, pp. 346-361, March 1996.

[40] J. Hertz, A. Krogh, and R. G. Palmer, Introduction to the Theory of Neural

Computation, New York: Addison-Wesley, 1991.

[41] M. L. Minsky and S. A. Papert, Perceptrons, Cambridge: MIT Press, 1969.

[42] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning Representations

by Back-Propagating Errors”, Nature, vol. 323, pp. 533-536, Oct. 9, 1986.

[43] T. Morie and Y. Amemiya, “An All-Analog Expandable Neural Network LSI with

On-Chip Backpropagation Learning”, IEEE Journal of Solid-State Circuits, vol.

29, no. 9, pp. 1086-1093, Sept. 1994.

[44] A. Dembo and T. Kailath, “Model-Free Distributed Learning”, IEEE Transac-

tions on Neural Networks, vol. 1, no. 1, pp. 58-70, March 1990.

[45] B. Widrow and M. A. Lehr, “30 Years of Adaptive Neural Networks: Perceptron,

Madaline, and Backpropagation”, Proceedings of the IEEE, vol. 78, no. 9, pp.

1415-1442, Sept. 1990.

[46] Y. Xie and M. Jabri, “Analysis of the eﬀects of quantization in multilayer neural

networks using a statistical model”, IEEE Transactions on Neural Networks, vol.

3, no. 2, pp. 334-338, 1992.

101

[47] A. F. Murray, “Multilayer Perceptron Learning Optimized for On-Chip Imple-

mentation: A Noise-Robust System”, Neural Computation, no. 4, pp. 367-381,

1992.

[48] Y. Maeda, J. Hirano, and Y. Kanata, “A Learning Rule of Neural Networks via

Simultaneous Perturbation and Its Hardware Implementation”, Neural Networks,

vol. 8, no. 2, pp. 251-259, 1995.

[49] V. Petridis and K. Paraschidis, “On the Properties of the Feedforward Method:

A Simple Training Law for On-Chip Learning”, IEEE Transactions on Neural

Networks, vol. 6, no. 6, Nov. 1995.

[50] D. B. Kirk, D. Kerns, K. Fleischer, and A. H. Barr, “Analog VLSI Implemen-

tation of Multi-dimensional Gradient Descent”, Advances in Neural Information

Processing Systems, San Mateo, CA: Morgan Kaufman Publishers, vol. 5, pp.

789-796, 1993.

[51] J. S. Denker and B. S. Wittner, “Network Generality, Training Required, and

Precision Required”, Neural Information Processing Systems, D. Z. Anderson

Ed., New York: American Institute of Physics, pp. 219-222, 1988.

[52] P. W. Hollis, J. S. Harper, and J. J. Paulos, “The Eﬀects of Precision Constraints

in a Backpropagation Learning Network”, Neural Computation, vol. 2, pp. 363-

373, 1990.

[53] M. Stevenson, R. Winter, and B. Widrow, “Sensitivity of Feedforward Neural

Networks to Weight Errors”, IEEE Transactions on Neural Networks, vol. 1, no.

1, March 1990.

[54] A. J. Montalvo, P. W. Hollis, and J. J. Paulos, “On-Chip Learning in the Analog

Domain with Limited Precision Circuits”, International Joint Conference on

Neural Networks, vol. 1, pp. 196-201, June 1992.

[55] T. G. Clarkson, C. K. Ng, and Y. Guan, “The pRAM: An Adaptive VLSI Chip”,

IEEE Transactions on Neural Networks, vol. 4, no. 3, May 1993.

102

[56] A. Torralba, F. Colodro, E. Ib´ a˜ nez, and L. G. Franquelo, “Two Digital Circuits

for a Fully Parallel Stochastic Neural Network”, IEEE Transactions on Neural

Networks, vol. 6, no. 5, Sept. 1995.

[57] H. Card, “Digital VLSI Backpropagation Networks”, Can. J. Elect. & Comp.

Eng., vol. 20, no. 1, 1995.

[58] H. C. Card, S. Kamarsu, and D. K. McNeill, “Competitive Learning and Vector

Quantization in Digital VLSI Systems”, Neurocomputing, vol. 18, pp. 195-227,

1998.

[59] H. C. Card, C. R. Schneider, and R. S. Schneider, “Learning Capacitive Weights

in Analog CMOS Neural Networks”, Journal of VLSI Signal Processing, vol. 8,

pp. 209-225, 1994.

[60] G. Cauwenberghs and A. Yariv, “Fault-Tolerant Dynamic Multilevel Storage

in Analog VLSI”, IEEE Transactions of Circuits and Systems - II: Analog and

Digital Signal Processing, vol. 41, no. 12, Dec. 1994.

[61] G. Cauwenberghs, “A Micropower CMOS Algorithmic A/D/A Converter”, IEEE

Transactions on Circuits and Systems - I: Fundamental Theory and Applications,

vol. 42, no. 11, Nov. 1995.

[62] P. Mueller, J. Van Der Spiegel, D. Blackman, T. Chiu, T. Clare, C. Donham,

T. P. Hsieh, and M. Loinaz, “Design and Fabrication of VLSI Components for

a General Purpose Analog Neural Computer”, Analog VLSI Implementation of

Neural Systems, C. Mead and M. Ismail, Eds., Norwell, Mass.: Kluwer, pp.

135-169, 1989.

[63] R. B. Allen and J. Alspector, “Learning of Stable States in Stochastic Asym-

metric Networks”, IEEE Transactions on Neural Networks, vol. 1, no. 2, June

1990.

103

[64] G. Cauwenberghs, C. F. Neugebauer, and A. Yariv, “Analysis and Veriﬁcation

of an Analog VLSI Incremental Outer-Product Learning System”, IEEE Trans-

actions on Neural Networks, vol. 3, no. 3, May 1992.

[65] C. Diorio, S. Mahajan, P. Hasler, B. Minch, and C. Mead, “A High-Resolution

Nonvolatile Analog Memory Cell”, Proceedings of the 1995 IEEE International

Symposium on Circuits and Systems, vol. 3, pp. 2233-2236, 1995.

[66] A. J. Montalvo, R. S. Gyurcsik, and J. J. Paulos, “An Analog VLSI Neural

Network with On-Chip Perturbation Learning”, IEEE Journal of Solid-State

Circuits, vol. 32, no. 4, April 1997.

[67] P. Baldi and F. Pineda, “Contrastive Learning and Neural Oscillations”, Neural

Computation, vol. 3, no. 4, pp. 526-545, Winter 1991.

ii

c

2001

Vincent F. Koosh All Rights Reserved

iii

Acknowledgements

First, I would like to thank my advisor, Rod Goodman, for providing me with the resources and tools necessary to produce this work. Next, I would like to thank the members of my thesis defense committee, Jehoshua Bruck, Christof Koch, Alain Martin, and Chris Diorio. I would also like to thank Petr Vogel for being on my candidacy committee. Thanks also goes out to some of my undergraduate professors, Andreas Andreou and Gert Cauwenberghs, for starting me oﬀ on this analog VLSI journey and for providing occasional guidance along the way. Many thanks go to the members of my lab and the Caltech VLSI community for fruitful conversations and useful insights. This includes Jeﬀ Dickson, Bhusan Gupta, Samuel Tang, Theron Stanford, Reid Harrison, Alyssa Apsel, John Cortese, Art Zirger, Tom Olick, Bob Freeman, Cindy Daniell, Shih-Chii Liu, Brad Minch, Paul Hasler, Dave Babcock, Adam Hayes, William Agassounon, Phillip Tsao, Joseph Chen, Alcherio Martinoli, and many others. I thank Karin Knott for providing administrative support for our lab. I must also thank my parents, Rosalyn and Charles Soﬀar, for making all things possible for me, and I thank my brothers, Victor Koosh and Jonathan Soﬀar. Thanks also goes to Lavonne Martin for always lending a helpful hand and a kind ear. I also thank Grace Esteban for providing encouragement for many years. Furthermore, I would like to thank many of my other friends and colleagues, too numerous to name, for the many joyous times.

locally generated. nature has evolved techniques to deal with imprecise analog components by using redundancy and massive connectivity.iv Abstract Nature has evolved highly advanced systems capable of performing complex computations. A parallel perturbative weight update algorithm is used. adaptation. The chip uses multiple. Test results are shown for all of the circuits. The second neural network also uses a computer for the global control operations. it is possible to learn around circuit oﬀsets. A circuit that performs a vector normalization. If the perturbation causes the error function to decrease. but all of the local operations are performed on chip. The chip is trained in loop with the computer performing control and weight updates. The arithmetic function circuits presented are based on MOS transistors operating in the subthreshold region with capacitive dividers as inputs to the gates. mathematical computations. and multiplication/division are shown. . digital systems cannot outperform analog systems in terms of power. Although digital systems have signiﬁcantly surpassed analog systems in terms of performing precise. Two feedforward neural network implementations are also presented. pseudorandom bit streams to perturb all of the weights in parallel. is shown to display the ease with which simpler circuits may be combined to obtain more complicated functions. Furthermore. Circuits for performing squaring. and learning using analog components. Because the inputs to the gates of the transistors are ﬂoating. The ﬁrst uses analog synapses and neurons with a digital serial weight bus. based on cascading the preceding circuits. digital switches are used to dynamically reset the charges on the ﬂoating gates to perform the computations. square root. The weights are implemented digitally. analog VLSI circuits are presented for performing arithmetic functions and for implementing neural networks. By training with the chip in the loop. high speed. and counters are used to adjust them. In this thesis. These circuits draw on the power of the analog building blocks to perform low power and parallel computations.

otherwise it is discarded. .v the weight change is kept. Test results are shown of both networks successfully learning digital functions such as AND and XOR.

1.1 Overview . . . . .1. . . . . . . .2 3. 3 Analog Computation with Dynamic Subthreshold Translinear Circuits 3. . . . . Multiplier/Divider Circuit . . . . . . . . . . . .3 2. . . . . . .5. . . . . . . . . . . . . . . . . . Above Threshold MOS Translinear Circuits . . . . . . . . . . . . . . . . . .1 3. . . . . . . . . . . . . . Subthreshold MOS Translinear Circuits . . . . . .1 1. . . . iii iv 1 2 2 2 4 4 5 6 7 2 Analog Computation with Translinear Circuits 2. . . . . . . . . . . . . . . . . . . . . . . . . 8 9 10 11 12 14 16 19 20 20 26 Squaring Circuit . . . .1 2. .2 Early Eﬀects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 3. . . . . . . . . . . . . . . . . . Dynamic Gate Charge Restoration . . . . . . . . . .5. . . .5 . . . . . . . . . . . . .3 3. . . . . . . . . .1 3. . . . . . . . . . . . . . . . .1. . . . . .1. . . . . . . . Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vi Contents Acknowledgements Abstract 1 Introduction 1.2 Analog Computation . VLSI Neural Networks . . . . . . . .4 BJT Translinear Circuits . . 3. . . . . . . . . . . . Translinear Circuits With Multiple Linear Input Elements . .4 3. . . . . . . . . . . . . . . . . . . 1. .2 3. Clock Eﬀects . . . . . . . . . . . . . . . Root Mean Square (Vector Sum) Normalization Circuit. . . . . . . . . . . . . . . . Square Root Circuit . . . . . . . . . . . . . . . .2 2. . . . .1. .1 Floating Gate Translinear Circuits 3. . . . Discussion . . . . . . .

.1 4. . . . . . . . Test Results . Feedforward Network . . . . . . .5. . . . . . . . . Conclusions . .1. . . . . . .6 5. .3 4. . . . . . . . . . . . . . .6 4. . . . . . . . . . . . . . . .7 4. . . . . . . . Perturbative Neural Network Implementations . . . . Feedforward Method . . . . . . . . . . . . . . . . . . .3 3. . . . . . . . . . .7 Synapse .1. . . . . Limitations. . .1 5.2. . . . . . . . .7 Speed. . .2. . . . . . . Serial Weight Perturbation . . . . . . . . .1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Neuron . . . . . . . . . . 27 27 29 Floating Gate Versus Dynamic Comparison Conclusion . . . .1. . . . . .5 4. . . . .2 4. 4 VLSI Neural Network Algorithms. . . . . . . . . . . . . . . . . . . . . . .4 5. . . . . .3 Number of Learnable Functions with Limited Precision Weights 36 Trainability with Limited Precision Weights . .5 5. . . . . . .2 4. . . . . . . . . . . .2 Backpropagation . . . . . .1.4 4. .1. . . . . . . 4. . . . . . . . . . .2 Contrastive Neural Network Implementations . . . . . . . . . .3 5. . . . . . . . . . . . . . . . . . .3 VLSI Neural Network Implementations . . . .1 4. . . . . . . . and Implementations 4. . . . . . . 4. . . . . . . . . . . . . . . . . . . . . . . Summed Weight Neuron Perturbation . . . . . . 43 44 48 54 57 60 60 67 . . .3. . . . . . . . . . . . Serial Weight Bus .1. . . . . . . . . . . . . . . Accuracy Tradeoﬀs . .3. . . . . . . . . . . . . . Analog Weight Storage . . . . . . . . . .1 4. . . . . . . . . . . . . . . . . . . . .6 3. . . . . . . . . . . . . . . . . . . . . . . . . Training Algorithm . . . . Parallel Weight Perturbation . 30 30 30 31 32 33 34 35 36 36 VLSI Neural Network Limitations 4. .1 VLSI Neural Network Algorithms . . . . . . . . . . . . . . . . . . . . . . . 37 38 39 40 41 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii 3. . . . 5 VLSI Neural Network with Analog Multipliers and a Serial Digital Weight Bus 5. . . . Chain Rule Perturbation . . . Neuron Perturbation . .2. . . . . . . . .2 5. . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1. . . . . . . . . . . . . . . . . 68 69 71 74 74 78 78 81 82 86 91 93 93 93 93 94 96 6. . . . . . . . . . . . . . . . . . 7. VLSI Neural Networks . . . . . . . . . . . . . .2 Dynamic Subthreshold MOS Translinear Circuits .7 6. . . .8 6. .viii 6 A Parallel Perturbative VLSI Neural Network 6. . . . . . . Multiple Pseudorandom Bit Stream Circuit . . Multiple Pseudorandom Noise Generators . . . . . Up/down Counter Weight Storage Elements . .4 6. . . . . Future Directions . . . . . . . .3 6. . . . .9 Linear Feedback Shift Registers . . . . Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Error Comparison . . . . . . . . . . . . . . . . . .1.2 7. . . . . . . . . . . . . .10 Conclusions . . .5 6. . . . . . . . . . . . . . . . . . Bibliography . .1 Contributions . .1 7. . . . . . . . . . . . . . Training Algorithm . . . . . . . . . . . . . Inverter Block . . . . . . Parallel Perturbative Feedforward Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 6. . . . . . 7 Conclusions 7. . . . . .2 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 6. . . . . . . . . . .

. . . . . . .10 Square root circuit results . . . . . . . 2 hidden unit. . . . . . . . Squaring circuit results .4 5. . . . . . Vout+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 3. . . . . . . . . .ix List of Figures 3. . . . for various values of Vof f set . . . 2 input. . . . . as a function of ∆Iin . . . Dynamically restored squaring circuit . . Multiplier/divider circuit . . .7 5. .6 3. . . . . . . . 5. . . . . . . . . . . . . . . . . . . . . .2 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 Neuron circuit . . . . . . . . . . . .1 5. . . . . . . . . . . . . . .10 Diﬀerential to single ended converter and digital buﬀer . . . . . . .5 3. . . 1 output neural network . . Dynamically restored multiplier/divider circuit . . . . . . . . . . . . . . . . . . Dynamically restored square root circuit . Synapse circuit . . . . . . . . . . Neuron diﬀerential output voltage. . . . . . . Squaring circuit . . . . . . .1 3. . .3 Binary weighted current source circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 49 52 53 55 56 58 59 5. . 5. . . . . . . . . as a function of ∆Iin . . . . . . . . . . . . . . . . . Root mean square (vector sum) normalization circuit . . . . . . . . . . .9 Capacitive voltage divider . . . 5. . . . . . .4 3.6 Neuron positive output. . . . . . . . .8 5. . . . . . . . . for various values of Vgain . .2 5. . 9 10 11 12 14 15 16 17 21 22 23 24 45 45 3. Square root circuit . . . . . . . . . . . . . . . . . . ∆Vout . . . . . . . . . . . . . . . . . . .12 RMS normalization circuit results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Synapse diﬀerential output current as a function of diﬀerential input voltage for various digital weight settings .8 3. . 3. . . . . . .7 3. .11 Multiplier/divider circuit results . . Nonoverlapping clock generator . . . . . . . . . . . . . .9 Serial weight bus shift register . . 3. .

15 Training of a 2:2:1 network with the XOR function with more iterations 90 . . . . . .15 Training of a 2:2:1 network with XOR function starting with “ideal” weights . . . . . . 6. . . . . . . . . . . . . . . . .12 Training of a 2:1 network with AND function . . . . . . . . . .11 Parallel Perturbative algorithm . . . .4 6. . . . . . . . . . .7 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . .8 6. . . . . . . . . . . . . . . . .10 Pseudocode for parallel perturbative learning network . .2 6. . . . . . . . . . . . . . . . . . . . . . . . . .13 Training of a 2:1 network with the AND function . . . 6. . . . . . . . . . . . . . . . . . . . . . . Full adder for up/down counter . . . 6. . 5. . . 66 69 69 72 75 76 77 78 79 80 83 83 84 88 89 63 64 6. .11 Error comparison section . . . . . . . . . . . . . . . . . . . . . Block diagram of parallel perturbative neural network circuit . . . . . . . . . 2-tap. . . . . . . . . . . . .12 Error comparison operation . . Maximal length LFSRs with only 2 feedback taps . . . . . . . . . . Synchronous up/down counter . . . . . . 6. . . . . . . . . 5. . . . . .14 Network with “ideal” weights chosen for implementing XOR . .9 Ampliﬁed thermal noise of a resistor . . . . . . . .14 Training of a 2:2:1 network with the XOR function . . . . Dual counterpropagating LFSR multiple random bit generator . m-bit linear feedback shift register . 60 62 5. . . Counter static register . . . . . . . . . . . . .3 6. .13 Training of a 2:2:1 network with XOR function starting with small random weights . . . 6. 5. . . . . . . . . Inverter block . . . . . . .6 6. . . . . . . .x 5. . . . . . . . . 6. . . . . . . . . . . . . . . . .5 6. . . . . . . . . . . .1 6. . . . . . . . . . . . . . . . . .

6. . . . . . . . . . . . n=2. . . . . . . .1 6. . . . . . . . . . . . LFSR with m=4. . . . 65 70 70 . . . . .1 Computations for network with “ideal” weights chosen for implementing XOR . .2 LFSR with m=4. . n=3 yielding 1 maximal length sequence . . . . yielding 3 possible subsequences . . . . .xi List of Tables 5.

The places where analog circuits are required and excel are the places where digital equivalents cannot live up to the task. These circuits work with extremely low current levels and would ﬁt well into applications requiring ultra low power designs. For example. ultra high speed circuits. handheld devices. Although these neural networks can be implemented with digital microprocessors. The inherent parallelism of neural networks allows a compact high speed solution in analog VLSI. several circuits are presented for implementing neural network architectures.1 Chapter 1 Introduction Although many applications are well suited for digital solutions. First. In this thesis. the large growth in portable devices with limited battery life increases the need of ﬁnding custom low power solutions. such as in wireless applications. require the use of analog design techniques. a set of circuits for performing basic analog calculations such as squaring. Furthermore. square rooting. the area of operation of the neural network circuits can be modiﬁed from low power to high speed to meet the needs of the speciﬁc application. Finally. Another important area where digital circuits may not be suﬃcient is that of low power applications such as in biomedical. implantable devices or portable. analog circuit design continues to have an important place. Next. the importance of analog circuitry in interfacing with the world will never be displaced because the world itself consists of analog signals. multiplication and division are described. Neural networks have proven useful in areas requiring man-machine interactions such as handwriting or speech recognition. several circuit designs are described that ﬁt well into the analog arena. .

linear input elements. First. such as bipolar transistors. Each of these circuits have switches at the inputs of their ﬂoating gates which are used to dynamically set and restore the charges at the ﬂoating gates to proceed with the computation.1 1. A class of analog CMOS circuits that can be used to perform many analog computational tasks is presented.1. whereas. a brief discussion of several VLSI neural network implementations is presented. These circuit building blocks are then combined into a more complicated circuit that normalizes a current by the square root of the sum of the squares (vector sum) of the currents. Next. a brief review of VLSI neural networks is given. Next. a discussion of the various learning rules which are appropriate for analog VLSI implementation are discussed. 1. Also. traditional translinear circuits are discussed which are implemented with devices showing an exponential characteristic. Also mentioned are translinear circuits that utilize networks of multiple. A few basic current-mode building blocks that perform squaring. and multiplication/division are shown.1 Overview Analog Computation In chapter 2.1. square root. MOS transistor versions of translinear circuits are discussed. In subthreshold MOS translinear circuits the exponential characteristic is exploited similarly to bipolar transistors. a brief review of analog computational circuits is given with an emphasis on translinear circuits. above threshold MOS translinear circuits use the square law characteristic and require a diﬀerent design style. First. some of the issues associated with analog VLSI networks is presented such as limited precision weights and weight storage issues. This should be suﬃcient to gain understanding of how to implement other power law circuits. .2 VLSI Neural Networks In chapter 4. The circuits utilize MOSFETs in their subthreshold region as well as capacitors and switches to produce the computations.2 1. These form the basis of the circuits described in the next chapter. In chapter 3.

3 In chapter 5. 2 hidden layer. The inputs and outputs of the network are provided directly at pins on the chip. and for a 2 input. 1 output network trained with an XOR function. The chip uses a serial digital weight bus implemented by a long shift register to input the weights. 1 output network trained with an AND function. 2 hidden layer. Training results are also shown for a 2 input. The training algorithm used is a parallel weight perturbation technique. The network is trained in a chip-in-loop fashion with a host computer implementing the training algorithm. This includes the generation of random perturbations and counters for updating the digital words where the weights are stored. parallel. but all of the local. a VLSI feedforward neural network is presented that makes use of digital weights and analog synaptic multipliers and neurons. . In chapter 6. Training results are shown for a 2 input. a VLSI neural network that uses a parallel perturbative weight update technique is presented. and for a 2 input. 1 output network trained with an XOR function. 1 output network trained with an AND function. weight update computations are performed on-chip. The network uses the same synapses and neurons as the previous network.

and the collector current is the output which performs the exponentiation or antilog operation. which is approximately 26mV at room temperature. as a function of its base-emitter voltage. Operational ampliﬁer circuits are commonly used for these tasks as well as for generic derivative. 2. and highpass ﬁlters. It is also possible to think of the collector . the base-emitter voltage is considered the input. analog circuits can perform many types of signal ﬁltering including lowpass. This type of operation was well know to users of slide rules in the past. For a bipolar junction transistor.4 Chapter 2 Analog Computation with Translinear Circuits There are many forms of analog computation. bandpass. power law implementations. A well known class of circuits exists to perform such tasks called translinear circuits. integrator. and Ut = kT /q is the thermal voltage. is given as IC = IS exp Vbe Ut where IS is the saturation current. In using this formula. the collector current. and gain circuits. For example. The emphasis of this chapter. IC . The basic idea behind using these types of circuits is that of using logarithms for converting multiplications and divisions into additions and subtractions. Vbe . shall be on circuits with the ability to implement arithmetic functions such as signal multiplication. and trigonometric functions. however. division.1 BJT Translinear Circuits The earliest forms of these type of circuits were based on exploiting the exponential current-voltage relationships of diodes or bipolar junction transistors[1][2].

the equation can be seen as follows: Vbe = Ut ln IC IS From here it is possible to connect the transistors. and κ is a process dependent parameter which is normally around 0. or possibly a voltage buﬀer to remove base current oﬀsets.5 current as the input and the base-emitter voltage as the output such as in a diode connected transistor.7. care must be taken because of the nonidealities introduced in subthreshold MOS. Id . where the collector has a wire. as a function of its gate-source voltage. connecting it to the base. For a subthreshold MOS transistor. using log and antilog techniques. the drain current. Vgs . It is also possible to obtain trigonometric functions by using other power law functional approximations[3]. it is sometimes possible to directly translate BJT translinear circuits into subthreshold MOS translinear circuits. is given as[30] κVgs Ut Id = Io exp where Io is the preexponential current factor. However. Due to this extra parameter. In this case. Thus.2 Subthreshold MOS Translinear Circuits Because of the similarities of the subthreshold MOS transistor equation and the bipolar transistor equation. to obtain various multiplication. it is clear that the subthreshold MOS transistor equation diﬀers from the BJT equation with the introduction of the parameter κ. 2. direct copies of the functional form of BJT translinear circuits into subthreshold MOS translinear circuits do not always lead to the same . However. division and power law circuits. it is not always trivial to obtain the best circuit conﬁguration for a given application.

3 Above Threshold MOS Translinear Circuits The ability of above threshold MOS transistors to exhibit translinear behavior has also been demonstrated[6]. but since normally κ ≈ 0. Nevertheless. division. and vector sum computation[6][10][11]. general function synthesis is usually not as straightforward as for exponential translinear circuits[7][9]. 2. For an above threshold MOS transistor in saturation. the characteristics slowly decay in the transition region from subthreshold to above threshold. Several interesting circuits have been demonstrated such as those for performing squaring. Furthermore. the power becomes ≈ 0. Although many of the circuits . However. Subthreshold MOS translinear circuits are more limited in their dynamic range than BJT translinear circuits because of entrance into strong inversion. is given as Id = K (Vgs − Vt )2 Because of the square law characteristic used. When κ = 1. power. easier integration with standard digital circuits in a CMOS process is possible. Furthermore.7. because of the inﬂexibility of the square law compared to an exponential law. Vgs . lower power operation may be possible with the subthreshold MOS circuits. These involve a few added restrictions on the types of translinear loop topologies allowed. for a particular circuit that performs a square rooting operation on a current using BJTs. However.6 implemented arithmetic function. For example. due to the lack of a required base current. these circuits are sometimes called quadratic-translinear or MOS translinear to distinguish them from translinear circuits based on exponential characteristics. I d . as a function of its gate-source voltage. when converted to the subthreshold MOS equivalent. multiplication. several types of subthreshold MOS circuits which display κ independence do exist[4][5]. the drain current.41. it will actually raise the current to the this reduces to the appropriate 1 2 κ κ+1 power[30].

The linear elements are used to add. for decreasingly small submicron channel lengths. and when the exponential translinear devices used are subthreshold MOS transistors. Thus. Furthermore. it is clear that multiplying values by a constant. subtract and multiply variables by a constant. the linear networks are usually deﬁned by resistor networks[13][14]. These circuits are usually more restricted in their dynamic range than bipolar translinear circuits. can be transformed into raising a value to the n-th power. When the exponential translinear devices used are bipolar transistors. the square-law current-voltage equation is usually not as good an approximation to the actual circuit behavior as for both bipolar circuits and MOS circuits that are deep within the subthreshold region. This class of circuits can be viewed as a generalization of the translinear principle. . They are limited at the low end by weak inversion and at the high end by mobility reduction[7]. From the point of view of log-antilog mathematics. some also take advantage of transistors biased in the triode or linear region[12]. the addition of a linear network provides greater ﬂexibility in function synthesis. 2.4 Translinear Circuits With Multiple Linear Input Elements Another class of translinear circuits exists that involves the addition of a linear network of elements used in conjunction with exponential translinear devices. the linear networks are usually deﬁned by capacitor networks[15][16]. n. Also.7 rely upon transistors in saturation. the current-voltage equation begins to approximate a linear function more than a quadratic[8].

One technique is to expose the circuit to ultraviolet light while grounding all of the pins. These circuits use capacitively coupled inputs to the gate of the MOSFET in a capacitive voltage divider conﬁguration. because the actual gate charge is not critical so long as the gate charges are equalized. stray charge may again begin to accumulate requiring another ultraviolet exposure. the gate voltages are somewhat random and must be equalized for proper circuit operation. a dynamic restoration technique is utilized that makes it possible to operate the circuits indeﬁnitely. Since the gate has no DC path to ground it is ﬂoating and some means must be used to initially set the gate charge and.8 Chapter 3 Analog Computation with Dynamic Subthreshold Translinear Circuits A class of analog CMOS circuits has been presented which made use of MOS transistors operating in their subthreshold region[15][16]. such as when the ﬂoating gate transistors are in a fully diﬀerential conﬁguration. The gate voltage is initially set at the foundry by a process which puts diﬀerent amounts of charge into the oxides. when one gets a chip back from the foundry. This has the eﬀect of reducing the eﬀective resistance of the oxide and allowing a conduction path. voltage value. Thus. it does not always constrain the actual value of the gate voltage to a particular value. Although this technique ensures that all the gate charges are equalized. Therefore. For the circuits presented here it is imperative that the initial voltages on the gates are set precisely and eﬀectively to the same value near the circuit ground. hence. . This technique works in certain cases[17]. If the ﬂoating gate circuits are operated over a suﬃcient length of time.

A few simple circuit building blocks that should give the main idea of how to implement other designs will be presented. the voltage at the node VT becomes VT = C 1 V1 + C 2 V2 C1 + C 2 If C1 = C2 then this equation reduces to VT = V1 V2 + 2 2 The current-voltage relation for a MOSFET transistor operating in subthreshold and in saturation (Vds > 4Ut .9 C1 V1 VT V2 C2 Figure 3. all of the transistors are assumed to be identical. the necessary equations of the computations desired can be produced.1 Floating Gate Translinear Circuits The analysis begins by describing the math that governs the implementation of these circuits.1. if all of the voltages are initially set to zero volts and then V1 and V2 are applied. In the following formulations. For the capacitive voltage divider shown in ﬁg. as well as a capacitive voltage divider.1: Capacitive voltage divider 3. A more thorough analysis for circuit synthesis is given elsewhere[15][16]. where Ut = kT /q) is given by[30]: Ids = Io exp(κVgs /Ut ) Using the above equation for a transistor operating in subthreshold. . 3.

1.2. 3.Vin Iout Figure 3. Ut Iout ln κ Io = Vin Ut Iref ln κ Io = Vref Ut Iin ln κ Io = Vin Vref + 2 2 Then. these results can be combined to obtain the following: Ut Iin ln κ Io Ut Iout = ln κ Io 1 2 Iref Ut ln + κ Io 1 2 Iout I2 Iin /Io = Io = in 1 Iref (Iref /Io ) 2 2 .2: Squaring circuit Also. all of the capacitors are of the same size.10 Iref.Vref Iin. the I-V equations for each of the transistors can be written as follows.1 Squaring Circuit For the squaring circuit shown in ﬁgure 3.

2 Square Root Circuit The equations for the square root circuit shown in ﬁgure 3.1.Vref Iin.3: Square root circuit 3.11 Iref.3 are shown similarly as follows: Ut Iref ln κ Io = Vref Ut Iin ln κ Io = Vin Ut Iout ln κ Io Combining the equations then gives Ut Iout ln κ Io = Vin Vref + 2 2 Iin Ut = ln κ Io 1 2 Iref Ut ln + κ Io 1 2 Iout = Iref Iin .Vin Iout Figure 3.

12

Iin1,Vin1

Iin2,Vin2

Iref,Vref

Iout

Figure 3.4: Multiplier/divider circuit

3.1.3

Multiplier/Divider Circuit

For the multiplier/divider circuit shown in ﬁgure 3.4, the calculations are again performed as follows: Iin1 Ut ln κ Io = Vin1 2

Iin2 Ut ln κ Io

=

Vin2 2

Ut Iref ln κ Io

=

Vin2 Vref + 2 2

Vref Ut Iref = ln 2 κ Io

−

Iin2 Ut ln κ Io

Iout Ut ln κ Io

=

Vin1 Vref + 2 2

Ut Iout ln κ Io

=

Iref Ut ln κ Io

− ln

Iin2 Iin1 + ln Io Io

13

Iout = Iref

Iin1 Iin2

Although the input currents are labeled in such a way as to emphasize the dividing nature of this circuit, it is possible to rename the inputs such that the divisor is the reference current and the two currents in the numerator would be the multiplying inputs. Note that the divider circuit output is only valid when Iref is larger than Iin2 . This is because the gate of the transistor with current Iref is limited to the voltage Vf gref at the gate by the current source driving Iref . Since this gate is part of a capacitive voltage divider between Vref and Vin2 , when Vin2 > Vf gref , the voltage at the node Vref is zero and cannot go lower. Thus, no extra charge is coupled onto the output transistors. Thus, when Iin2 > Iref , Iout ≈ Iin1 . Also, from the input/output relation, it would seem that Iin1 and Iref are interchangeable inputs. Unfortunately, based on the previous discussion, whenever Iin2 > Iin1 , Iout ≈ Iref . This would mean that no divisions with fractional outputs of

Iin1 Iin2

< 1 could be performed. When such a fractional output is desired, the use of the

circuit in this conﬁguration would not perform properly. The preceding discussion indicates that the ﬁnal simpliﬁed input/output relation of the circuit is slightly deceptive. Care must be taken to ensure that the circuit is operated in the proper operating region with appropriately sized inputs and reference currents. Furthermore, when doing the analysis for similar power law circuits the circuit designer must be aware of the charge limitations at the individual ﬂoating gates. A more accurate input/output relation for the multiplier/divider circuit is thus given as follows:

14

Iref,Vref f2

Iin,Vin

Iout

f1

Figure 3.5: Dynamically restored squaring circuit

3.2

Dynamic Gate Charge Restoration

Iout =

Iin1 Iref Iin2 if Iin2 < Iref

Iin1

if Iin2 > Iref

All of the above circuits assume that some means is available to initially set the gate charge level so that when all currents are set to zero, the gate voltages are also zero. One method of doing so which lends itself well to actual circuit implementation is that of using switches to dynamically set the charge during one phase of operation, and then to allow the circuit to perform computations during a second phase of operation. The squaring circuit in ﬁgure 3.5 is shown with switches now added to dynamically equalize the charge. A non-overlapping clock generator generates the two clock

Vin Iout f1 Figure 3.6) and multiplier/divider circuit (ﬁgure 3. φ1 is high and φ2 is low. This establishes an initial condition with no current through the transistors corresponding to zero gate voltage. Thus. During the ﬁrst phase of operation. The square root (ﬁgure 3. This is the compute phase of operation. during the second phase of operation.6: Dynamically restored square root circuit signals. The circuit now exactly resembles the aforementioned ﬂoating gate squaring circuit.15 Iref. Thus.Vref f2 Iin. the input currents do not aﬀect the circuit and all sides of the capacitors are discharged and the ﬂoating gates of the transistors are also grounded and discharged. and the input currents are reapplied. it is able to perform the necessary computation. The transistor gates are allowed to ﬂoat. φ1 goes low and φ2 goes high. This is the reset phase of operation.7) are also constructed in the same manner by adding switches connected to φ2 at the drains of each of the transistors and switches connected to φ1 at each of the ﬂoating gate and . Then.

Iref .8. The reference current is doubled with the 1:2 current mirror into the divider stage. is mirrored to all of the reference inputs of the individual stages.Vin2 Iref. when Iin1 = Iin2 = Imax . For example. it is necessary to make sure that the reference . The reference current. The above circuits can be used as building blocks and combined with current mirrors to perform a number of useful computations. This normalization stage is seen in ﬁgure 3. Since the current that is being divided will be the square root of a sum of squares of two currents.Vref Iout f1 Figure 3. a normalization stage can be made which normalizes a current by the square root of the sum of the squares of other currents.7: Dynamically restored multiplier/divider circuit capacitor terminals. Such a stage is useful in many signal processing tasks.3 Root Mean Square (Vector Sum) Normalization Circuit. This is necessary because the reference current must be larger than the largest current that will be divided by. 3.Vin1 f2 Iin2.16 Iin1.

17 Iin1 Iin2 Iin1 Iout Iref Figure 3.8: Root mean square (vector sum) normalization circuit f2 f1 .

The other input to the divider stage will be a mirrored copy of one of the input currents. the overall transfer function computed would be Iin1 2 2 Iin1 + Iin2 Iout = Iref The addition of other divider stages can be used to normalize other input currents. . Thus.18 current for the divider section is greater than √ 2Imax . Thus. The reference current for the divider stage is twice the reference current for the other stages as discussed before. The input to the square root stage is thus given by 2 I2 Iin1 + in2 Iref Iref The square root stage then performs the computation 2 Iin1 I2 + in2 Iref Iref Iref = 2 2 Iin1 + Iin2 The output of the square root stage is then fed into the multiplier/divider stage as the divisor current. the output of the divider stage is given by Iin1 2 2 Iin1 + Iin2 2Iref The output can then be fed back through another 2:1 current mirror (not shown) to remove the factor of 2. the reference current in the divider section will be 2Imax . Using the 1:2 current mirror and setting Iref = Imax . The ﬁrst two stages that read the input currents are the squaring circuit stages. The outputs of these stages are summed and then fed back into the square root stage with a current mirror. This circuit easily extends to more variables by adding more input squaring stages and connecting them all to the input of the current mirror that outputs to the square root stage. which is suﬃcient to enforce the condition.

19 3. Figure 3. The circuits show good performance over several orders of magnitude in current range. in subthreshold it follows the exponential characteristic.9 and 3.6µ. This is because in above threshold. Since the reference transistor is leaving the subthreshold region. the circuit deviates from the ideal when one or more transistors leaves the subthreshold region. The solid line in the ﬁgures represents the ideal ﬁt.11. more voltage is coupled onto the output transistor. Some of the anomalous points. The subthreshold region for these transistors is below approximately 100nA. The circle markers represent the actual data points. and 3. The ﬂoating gate nfets were all W=30µ.475 pF.10. Since the diode connected reference transistor sets up a slightly higher gate voltage. 3. All of the pfet transistors used for the mirrors were W=16.6µ.8µ.2µm double poly CMOS process. more charge and. are believed to be caused by autoranging errors in the instrumentation used to collect . hence. The current W/L is 1. The data gathered from the current squaring circuit. such as those seen in ﬁgures 3.9. an oﬀset from the theoretical ﬁt for the Iref = 100nA case is seen. The nfet switches were all W=3. the current goes as the square of the voltage.4 Experimental Results The above circuits were fabricated in a 1. This causes the output current to be slightly larger than expected. more gate voltage is necessary per amount of current. Leakage currents limit the low end of the dynamic range. increasing W/L to 10 would change the subthreshold current range of these circuits to be below approximately 1µA and hence increase the high end of the dynamic range appropriately. For example.11. L=3. in ﬁgure 3. At the high end of current. L=30µ.10. L=6µ. respectively. square root circuit and multiplier/divider circuit is shown in ﬁgures 3. whereas. Extra current range at the high end can be achieved by increasing the W/L of the transistors so that they remain subthreshold at higher current levels.12 shows the data from the vector sum normalization circuit. The capacitors were all 2.

The reference current for the normalization circuit. it is possible to simply scale the W/L of the ﬁnal output mirror stage to correct it. the switches will have a small voltage drop across them and the ﬂoating gates are actually set slightly above ground. The divider circuit.1 Discussion Early Eﬀects Although setting the ﬂoating gates near the circuit ground has been the goal of the dynamic technique discussed above. in the circuits previously discussed. 3. the transistors will enter the above threshold region with smaller input currents. the ideal ﬁt required a value of 14nA to be used. This was not seen in the other circuits.20 the data. However. If the ﬂoating gates are reset to a value higher than ground.5 3. thus it is assumed that this is due to the Early eﬀect of the current mirrors. it is possible to allow the ﬂoating gate values to be reset to something other than the circuit ground. does not perform the division after Iin2 > Iref . and instead outputs Iin1 as is seen in the ﬁgure. The main eﬀect this has is to change the dynamic range. It may also be possible to use a reference voltage smaller than ground in order to set . In fact. This is because the reference current and the maximum input currents were chosen to keep the divider and all circuits within the proper subthreshold operating regions of the building block circuits. Iref . The normalization circuit shows very good performance. at the input of the reference current mirror array was set to 10nA. as previously discussed. In fact. it is possible to use the SPICE simulations to change the mirror transistor ratios to obtain the desired output. Since this is merely a multiplicative eﬀect. the SPICE simulations of the circuit also predict the value of 14nA. This is helped by the compressive nature of the function. it is possible to increase the length of the mirror transistors to reduce Early eﬀect or to use a more complicated mirror structure such as a cascoded mirror. Alternatively.5. Therefore.

21 10 −5 Iout = Iin / Iref) 2 10 −6 10 −7 Iout (Amps) 10 −8 Iref=100pA −9 Iref=1nA Iref=10nA Iref=100nA 10 10 −10 10 −11 10 −10 10 −9 10 Iin (Amps) −8 10 −7 Figure 3.9: Squaring circuit results .

10: Square root circuit results .22 10 −6 Iout = sqrt(Iin * Iref) 10 −7 Iout (Amps) Iref=100nA 10 −8 Iref=10nA 10 −9 Iref=1nA Iref=100pA 10 −10 10 −10 10 −9 10 Iin (Amps) −8 10 −7 Figure 3.

23 10 −6 Iout = Iref * I1 / I2. Iref=10nA 10 −7 I1=10nA 10 Iout (Amps) −8 I1=1nA 10 −9 I1=100pA 10 −10 10 −11 10 −10 10 −9 10 I2 (Amps) −8 10 −7 Figure 3.11: Multiplier/divider circuit results .

24 10 −7 Iout = Iref * I1 / sqrt(I1 + I2 ) 2 2 I2=100pA Iout (Amps) 10 −8 I2=1nA 10 −9 I2=10nA −10 10 10 −10 10 I1 (Amps) Iout = Iref * I1 / sqrt(I1 + I2 ) 2 2 −9 10 −8 10 −7 I1=10nA Iout (Amps) 10 −8 I1=1nA 10 −9 I1=100pA −10 10 10 −10 10 I2 (Amps) −9 10 −8 Figure 3.12: RMS normalization circuit results .

the drain voltage couples onto the ﬂoating gate by an amount Vd Cf g−d . In this way. The method chosen to overcome these combined Early eﬀects is to ﬁrst make the transistors long. If L is suﬃciently long. Vds . W. by the same amount as the length. Since this is coupled directly onto gate. the main overlap capacitance of interest is the ﬂoating gate to drain capacitance. CT where CT is the total input capacitance. It may be possible to use some of the switches as cascode transistors to reduce the Early eﬀect in the output transistors of the various stages. Ids = Io e κVgs Ut 1+ Vds Vo The voltage Vo . the standard Early eﬀect comes from an increase in channel current due to channel length modulation[30]. Another aspect of the Early eﬀect which is important for circuits with ﬂoating gates is due to the overlap capacitance of the gate to the drain/source region of the transistor[15]. then the Early voltage. The input and reference . V d . depends on process parameters and is directly proportional to the channel length. to keep the same subthreshold current levels and not lose any dynamic range. This parasitic capacitance acts as another input capacitor from the drain voltage. L. Vo . this has an exponential eﬀect on the current level.25 the ﬂoating gates in such a way as to increase the dynamic range. Cf g−d . This also requires increasing the width. This also depends on process parameters and is directly proportional to the width. will be large and this will reduce the contribution of the linear Early eﬀect. This can be modeled by an extra linear term that is dependent on the drain-source voltage. it is necessary to further increase the size of the input capacitors to make the total input capacitance large compared to the ﬂoating gate to drain parasitic capacitance to reduce the impact of the exponential Early eﬀect. One problem with these circuits is the presence of a large Early eﬀect. Thus. First. However. it is possible to trade circuit size for accuracy. W. This eﬀect comes from two sources. In this way. this increase in W increases Cf g−d which worsens the exponential Early eﬀect. In these circuits. called the Early voltage. L.

One such technique would involve using a single transistor with a multiphase clock that can be used as a replacement for all the input and output transistors in the building blocks[18]. it is important to make the input capacitors large enough that other parasitic capacitances are very small compared to them.5. but the output transistor drains are allowed to go near Vdd . It may be possible to use a current mode ﬁlter or use two transistors in complementary phase at the output to compensate for the glitches[18]. Furthermore. 3. This would involve not setting φ2 all the way to Vdd during the computation and instead setting it to some lower cascode voltage. Thus.26 transistors already have their drain voltages ﬁxed due to the diode connection and would not beneﬁt from this. The input can change faster than the clock rate and the output will be valid during most of the computation phase. The output does require a short settling period due to the presence of glitches in the output current from charge injection by the switches. Other techniques are also available which can improve matching characteristics and reduce the size of the circuits. it is possible to make the clock as slow as 1 Hz or slower. Also making the input capacitors large with respect to the size of the switches will reduce the eﬀects of the glitches by making the injected charge less eﬀective at inducing a voltage on the gate. The amount of charge injection is determined by the area of the switches and is minimized with small switches. the clock rate for these circuits is determined solely by the leakage rate of the switch transistors. .2 Clock Eﬀects Unlike switched capacitor circuits that require a very high clock rate compared to the input frequencies. This may allow a reduction in the size of the transistors and capacitors.

However. this also has an eﬀect on the speeds at which the circuits may be operated and the settling times of the signals. .3 Speed. Subthreshold circuits are already limited to low currents which in turn limits their speed of operation since it takes longer to charge up a capacitor or gate capacitance with small currents. thus. the increases in input capacitance necessary to increase accuracy further limits the speed of operation. The switches feeding into the capacitor inputs can be viewed as a lowpass R-C ﬁlter at the inputs of the circuits. Floating gate advantages • Larger dynamic range .6 Floating Gate Versus Dynamic Comparison As previously discussed. Accuracy Tradeoﬀs It is necessary to increase the size of the input capacitors and reduce the size of the switches in order to improve the accuracy and decrease the eﬀect of glitches. but this will increase the parasitic capacitance and require an increase in the size of the input capacitors to obtain the same accuracy.27 3. Thus. a brief comparison is made between the advantages and disadvantages of the two design styles. there is an inherent tradeoﬀ between the speed of operation and the accuracy of the circuits. • No glitches . does not have the same settling time requirements.5.The ﬂoating gate version does not suﬀer from the appearance of glitches on the output and.The switches of the dynamic version introduce leakage currents and oﬀsets which reduce dynamic range. In this section. Furthermore. the addition of a switch along the current path reduces the speed of the circuits because of the eﬀective on-resistance of the switch. Thus. It is possible to make the switches larger in order to decrease the eﬀective resistance. 3. it is possible to use these circuits in either the purely ﬂoating gate version or in the dynamic version.

Also.There are fewer parasitics because of the lack of switches. • Faster . thus.It is possible to go to smaller current values and. • Better matching with smaller input capacitors . thereby causing the shift. ultraviolet irradiation fails to place these circuits into a proper subthreshold operating range and instead places the ﬂoating gate in a state where the transistors are biased above threshold.Packaging costs are increased when it is necessary to accommodate a UV transparent window[19]. • Less power . • Decreased packaging costs . • Robust against long term drift . it is possible to reduce the size of input capacitors and still have good matching. These electrons get stuck in trap states in the oxide. power is saved by not operating any switches. since there are fewer stacked transistors. Much of the threshold shift is attributed to hot electron injection. However.Since it is possible to use smaller input capacitors and there is no added switch resistance along the current path. Much eﬀort is placed into reducing the number of oxide traps to minimize this eﬀect. also.Transistor threshold drift is a known problem in CMOS circuits[20] and Flash EEPROMs[19]. Dynamic advantages • No UV radiation required .There are fewer transistors because of the lack of switches and also smaller input capacitors may be used. smaller operating voltages are possible. • Less area . The electrons can then exit the oxide and normally contribute to a very small gate current.In many cases it is desirable not to introduce an extra post processing step in order to operate the circuits. in some instances. Furthermore. it is possible to operate the circuits at higher speeds. in ﬂoating .28 • Constant operation .The output is always valid.

Furthermore. the dynamic charge restoration technique may be applied to other ﬂoating gate computational circuits that may otherwise require initial ultraviolet illumination or other methods to set the initial conditions. In practice. Furthermore. 3. The dynamically restored circuits remove these excess charges during the reset phase of operation and provide greater immunity against long term drift. as process scales continue to decrease. direct gate tunneling eﬀects will further exacerbate the problem. a full consideration of each style’s constraints with respect to the overall system design criteria would favor one or the other of the two design styles. Although the advantages that the pure ﬂoating gate circuits provide outnumber those of the dynamic circuits.29 gate circuits. the electrons are not removed and remain as excess charge which leads to increased threshold drift. . The dynamic charge restoration technique is shown to be a useful implementation of this class of analog circuits.7 Conclusion A set of circuits for analog circuit design that may be useful for analog computation circuits and neural network circuits has been presented. the advantages aﬀorded by the dynamic technique are of such signiﬁcant importance that many circuit designers would choose to utilize the technique. It is hoped that it is clear from the derivations how to obtain other power law circuits that may be necessary and how to combine them to perform useful complex calculations.

1 VLSI Neural Network Algorithms Backpropagation Early work in neural networks was hindered by the problem that single layer networks could only represent linearly separable functions[41]. for example. ∂wij are calculated for each of the weights. Furthermore. s(.1. did not suﬀer from this limitation. and Oi is the respective network output. Although it was known that multilayer networks could learn any boolean function and. Thus. the lack of a learning rule to program multilayered networks reduced interest in neural networks for some time[40]. The functional form of the update involves the weights themselves propagating the errors backwards to more internal nodes.1 4. Limitations. In backpropagation. hence.).30 Chapter 4 VLSI Neural Network Algorithms. A typical sum-squared error function used for neural networks is deﬁned as follows: 1 − E (→) = w 2 (Ti − Oi )2 i where Ti is the target output for input pattern i from the training set. weight updates. which leads to the name backpropagation. based on gradient descent. ∆wij ∝ ∂E . provided a viable means for programming multilayer networks and created a resurgence of interest in neural networks. if . and Implementations 4. the functional form of the weight update rules are composed of derivatives of the neuron sigmoidal function. The network output is a composite function of the network weights and composed of the underlying neuron sigmoidal function. The backpropagation algorithm[42].

the functional form of the neuron. deviations in the multiplier circuits can add oﬀsets which would also accumulate and signiﬁcantly reduce the ability of the network to learn. . Furthermore. 4. For example. it was discovered that oﬀsets of 1% degraded the learning success rate of the XOR problem by 50%[43]. unwanted errors would get backpropagated and accumulate. approximates a gradient by applying a perturbation to the input of each neuron and then measuring the diﬀerence in error. For example. Otherwise. eﬀorts have been made to ﬁnd suitable learning algorithms which are independent of the model of the neuron[44]. In typical analog VLSI implementations. in the above example the neuron was modeled as a tanh function. if the hardware implementation of the sigmoidal function tends to deviate from an ideal function such as the tanh function. This error is then used for the weight update rule.1. analog implementations of the derivative can prove diﬃcult and implementations which avoid these diﬃculties are desirable.31 s(x) = tanh(ax). In a model-free learning algorithm. in one implementation. For software implementations of the neural network. Most of the algorithms developed for analog VLSI implementation utilize stochastic techniques to approximate the gradients rather than to directly compute them. oﬀsets much larger than 1% are quite common. it becomes necessary to compute the derivative of the function actually being implemented which may not have a simple functional form. called the Madaline Rule III (MR III). However. the ease of computing this type of function allows an eﬀective means to implement the backpropagation algorithm. does not enter into the weight update rule. One such implementation[45]. which would lead to seriously degraded operation. then the weight update rule would involve functions of the form s (x) = 2a (tanh(ax)) (1 − tanh(ax)). as well as the functional form of its derivative.2 Neuron Perturbation Because of the problems associated with analog VLSI backpropagation implementations. Furthermore. ∆E = E(with perturbation)-E(without perturbation).

1. multiplication/division circuitry would be required to implement the learning rule. summed input to the neuron i that is perturbed. Another variant of this method. but studies of the eﬀects of inaccurate derivative approximations were not done. a reasonable approximation is achieved. and no assumptions are made about the neuron model. no explicit derivative hardware is required. Smaller perturbations provide better approximations to the gradient at the expense of very small steps in the gradient direction which then requires many extra iterations to ﬁnd the minimum. a natural tradeoﬀ between speed and accuracy exists. This weight update rule requires access to every neuron input and every neuron output. required the use of the neuron function derivatives in the weight update rule[47]. Thus. The gradient approximation can also be improved with the central diﬀerence ap- . However. and xj is either a network input or a neuron output from a previous layer which feeds into neuron i. The weight updates are based on a ﬁnite diﬀerence approximation to the gradient of the mean squared error with respect to a weight. pertij . The weight update rule is given as follows: E wij + pertij − E (wij ) pertij ∆wij = This is the forward diﬀerence approximation to the gradient. Furthermore.32 ∆wij = −η ∆E xj ∆neti where neti is the weighted. For small enough perturbations. The method showed good robustness against noise.3 Serial Weight Perturbation Another implementation which is more optimal for analog implementation involves serially perturbing the weights rather than perturbing the neurons[23]. 4. called the virtual targets method and which is not model-free.

the above weight update rule merely involves changing the weights by the error diﬀerence scaled by a small constant.33 proximation: E wij + pertij 2 ∆wij = pertij − E wij − pertij 2 However. Also. Studies have also shown that this algorithm performs better than neuron perturbation in networks with limited precision weights[46]. However.1. Thus.4 Summed Weight Neuron Perturbation The summed weight neuron perturbation approach[24] combines elements from both the neuron perturbation technique and the serial weight perturbation technique. The multiplication/division step is replaced by scaling with a constant. This allows for a reduction in the number of feedforward passes. this speed . It is clear that this implementation should require simpler circuitry than the neuron perturbation algorithm. in summed weight neuron perturbation. Since the serial weight perturbation technique applies perturbations to the weights one at a time. weight perturbations are applied for each particular neuron. 4. A usual implementation would use a weight update rule of the following form: η E wij + pertij − E (wij ) pert ∆wij = − Since pert and η are both small constants. there is no need to provide access to the computations of internal nodes. its computational complexity was shown to be of higher order than that of the neuron perturbation technique. Applying a weight perturbation to each of the weights connected to a particular neuron is equivalent to perturbing the input of that neuron as is done in neuron perturbation. this requires more feedforward passes of the network to obtain the error terms. the error is then calculated and weight updates are made using the same weight update rule as for serial weight perturbation. yet still does not require access to the outputs of internal nodes of the network.

34 up is at the cost of increased hardware to store weight perturbations. the weight perturbation storage can be achieved with 1 bit per weight. This is feasible only if the speed of averaging over multiple perturbations is greater than the speed of applying input patterns. Furthermore. This can be better understood if the parallel perturbative method is thought of in analogy to Brownian motion where there is considerable movement of the network through weight space. a speedup of W 2 is achieved over serial weight perturbation. a serial weight perturbation approach gives a factor W reduction in speed over a straight gradient descent approach because the overall gradient is calculated based on W local approximations to the gradient. With parallel weight perturbation. but with a general guiding force in the proper direction. whereas. The perturbed error is then measured and compared to the unperturbed error. Thus. If there are W total weights in a network. Thus. large 1 1 . the weight perturbations are kept at a ﬁxed small size. but the sign of the perturbation is random. In this way. This increase in speed may not always be seen in practice because it depends upon certain conditions being met such as the perturbation size being suﬃciently small. convergence rates can be improved by using an annealing schedule where large perturbation sizes are used initially and then reduced. This usually holds true because the input patterns must be obtained externally. the multiple perturbations are applied locally. The same weight update rule is then used as in the serial weight perturbation method. on average. Thus.1. It is also possible to improve the convergence properties of the network by averaging over multiple perturbations for each input pattern presentation[21]. Normally.5 Parallel Weight Perturbation The parallel weight perturbation techniques[21][25][48] extend the above methods by simultaneously applying perturbations to all weights. there is a factor W 2 reduction in speed over direct gradient descent[25]. 4. all of the weights are updated in parallel.

from this parallel weight perturbation is measured and then used to calculate and store another partial gradient term. j hidden nodes. Then. wij . wpertij for each weight/neuron pair. ∆ni . This neuron output perturbation is repeated sequentially and a partial error gradient term stored for every neuron in the hidden layer. For example. .6 Chain Rule Perturbation The chain rule perturbation method is a semi-parallel method that mixes ideas from both serial weight perturbation and neuron perturbation[26]. a hidden layer neuron output.1. each of the weights. In a 1 hidden layer network. ∆E . nperti is stored. ni . is perturbed by amount nperti . Finally. the chain rule perturbation method requires k ·j +j +i perturbations. ∆ni . in a network with i inputs. the name of the method comes from the analog of the weight update rule to the chain rule in derivative calculus. the same hidden layer weight update procedure is repeated. The resultant change in hidden layer neuron outputs. For more hidden layers. the hidden layer weights are updated in parallel according to the following rule: ∆wij = −η ∆E ∆E ∆ni = −η wpert nperti wpertij Hence.35 weight changes are discovered ﬁrst and gradually decreased to get ﬁner and ﬁner resolution in approaching the minimum. This method allows a speedup over serial weight perturbation by a factor approaching the number of hidden layer neurons at the expense of the extra hardware necessary to store the partial gradient terms. 4. serial perturbation requires k · j + j · i perturbations. feeding into the hidden layer neurons are perturbed in parallel by an amount wpertij . the output layer weights are ﬁrst updated serially using the weight update formula from the serial weight perturbation method. whereas. The eﬀect on the error is measured and a partial gradient term. and k outputs. Next.

4.2 4.1. in the backward phase. First.36 4. The total number of bits in the network is given as B = bW . then the direction of search is positive. Next. for a weight. the direction of search is found by applying a fraction of the perturbation.2.1 VLSI Neural Network Limitations Number of Learnable Functions with Limited Precision Weights Clearly in a neural network with a small. ﬁxed number of bits per weight. For example if 0. in the forward phase. the total number of diﬀerent functions that the network can implement is no more than 2B .1pert is applied and the error goes down. Most of the stochastic algorithms and implementations concentrate on iterative approaches to weight updates as discussed above. wij . Then. In computer simulations of neural networks where ﬂoating point . Since each bit can take on only 1 of 2 values. perturbations of size pert continue to be applied in the direction of search until the error does not decrease or some predetermined maximum number of forward perturbations is reached. Successive backwards steps are taken with each step decreasing by another factor of 2 until the error again goes up or another predetermined maximum number of backward perturbations is reached. the ability of the network to classify multiple functions is limited. otherwise it is negative. a step of size pert 2 in the reverse direction is taken. Assume that each weight is comprised of b bits of information and that there are W total weights. but continuous-time implementations are also possible[50]. This method is eﬀectively a sophisticated annealing schedule for the serial weight perturbation method.7 Feedforward Method The feedforward method is a variant of serial weight perturbation that incorporates a direct search technique[49]. Many other variations and combinations of these stochastic learning rules are possible. These steps are repeated for every weight.

at least 12 bits were required to obtain reasonable convergence. the perturbative approaches will be desirable in . Due to the increased hardware complexity necessary to achieve the extra number of bits. Most of these are speciﬁc to a particular learning algorithm. and the output neuron quantization were all tested for a range of bits. However. This implies that standard analog VLSI neurons should be suﬃcient for a VLSI neural network implementation. This shows that a perturbative algorithm that relies only on feedforward operation can be eﬀective in training with as few as half the number of bits required for a backpropagation implementation. neuron output quantization of 6 bits was suﬃcient. The eﬀect of the number of bits on the ability of the network to properly converge was determined. this is usually not a limitation to learning.37 precision is used and b is on the order of 32 or 64.2.2 Trainability with Limited Precision Weights Several studies have been done on the eﬀect of the number of bits per weight constraining the ability of a neural network to properly train[52][53][54]. 2B is merely an upper bound. Also. In one result. 4. In one study. then the ability of the network to learn a speciﬁc function may be compromised. the eﬀects of the number of bits of precision on backpropagation were investigated[52]. the signs of the outputs of the hidden layer neurons can also be ﬂipped by adjusting the weights coming into that neuron and by also ﬂipping the sign of the weight on the output of the neuron to ensure the same output. Furthermore. the total number of boolean functions which can be implemented is given approximately by 2B N !2N for a fully connected network with N units in each layer[51]. The weight precision for a feedforward pass of the network. the hidden neuron outputs can be permuted since the ﬁnal output of the network does not depend upon the order of the hidden layer neurons. There are many diﬀerent settings of the weights that will lead to the same function. the weight precision for the backpropagation weight update calculation. For example. The empirical results showed that a weight precision of 6 bits was suﬃcient for feedforward operation. For the backpropagation weight update calculation. if b is a small number like 5. Also.

The weights are stored permanently on a ﬂoating gate transistor which can be updated under UV illumination. Other schemes involve periodic refresh by use of quantized levels deﬁned by input ramp functions or by periodic repetition of input training patterns[59]. In one such implementation.38 analog VLSI implementations.3 Analog Weight Storage Due to the inherent size requirements to store and update digital weights with A/D and D/A converters. A second drive transistor is used as a switch to sample the input drive voltage onto a drive electrode in order to adjust the ﬂoating gate voltage. a 3 transistor analog . involves quantization and refresh of the weights by use of A/D/A converters which can be shared by multiple weights[60][61]. hence. only one ﬂoating gate transistor per synapse was required. weight value. A learning rule was developed which balances the eﬀects of the tunneling and injection. it is necessary to implement some sort of scheme to continuously refresh the weight. Since signiﬁcant space and circuit complexity overhead exists for capacitive storage of weights. The weight multiplication with an input voltage is achieved by biasing the ﬂoating gate transistor into the triode region which produces an output current which is roughly proportionate to the product of its ﬂoating gate voltage. Since the capacitors slowly discharge. representing the weight. 4. Another approach which also implements a dense array of synapses based on ﬂoating gates utilized tunneling and injection of electrons onto the ﬂoating gate rather than UV illumination[28][29]. The use of UV illumination also allowed a simple weight decay term to be introduced into the learning rule if necessary. a dense array of synapses is achieved with only two transistors required per synapse[64]. some analog VLSI implementations concentrate on reducing the space necessary for weight storage by using analog weights. and its drain to source voltage. In another approach. which showed long term 8 bit precision. attempts have also been made at direct implementation of nonvolatile analog memory. and.2. One such scheme. Usually. Using this technique. representing the input. this involves storing weights on a capacitor.

Since the components were modular. Other techniques use time multiplexed multipliers to reduce the number of multipliers necessary. each of the individual circuit blocks was quite large and hence was not well suited for dense integration on a single neural network chip. Although the digital implementations are more straightforward to build using standard digital design techniques. they often scale poorly with large networks that require many bits of precision to implement the backpropagation algorithms[57][58]. the circuits for the neuron blocks consisted of three multistage operational ampliﬁers and three 100k resistors. This includes purely digital approaches. An early attempt at implementation of an analog neural network involved making separate VLSI chip modules for the neurons. Some of the weight update speeds obtainable using the tunneling and injection techniques may be limited because rapid writing of the ﬂoating gates requires large oxide currents which may damage the oxide. The learning algorithms were based on output information from the neurons and no circuitry for local weight updates was provided. purely analog approaches and hybrid analog-digital approaches. it could be scaled to accommodate larger networks. but eventually written into nonvolatile analog memories[66]. synapses and routing switches[62]. For example. 4. The modules could all be interconnected and controlled by a computer.3 VLSI Neural Network Implementations Many neural network implementations in VLSI can be found in the literature. these techniques can be combined with dynamic refresh techniques where weights and weight updates are quickly stored and learned on capacitors.39 memory cell was shown to have up to 14 bits of resolution[65]. . The weights were digital and were controlled by a serial weight bus from the computer. some size reductions can be realized by reducing the bit precision and using the stochastic learning techniques available[55][56]. Nevertheless. However. However.

asymmetric weights[63].3. it was necessary to intermittently apply the noise sources in order to facilitate learning. then no change is made to the weight. Brieﬂy. On each pattern presentation. The network showed good success with a 2:2:2:1 network learning XOR. If the neurons are correlated in the teacher phase. and the unclamped. and highly correlated oscillations were present between all of the noise sources. The weights were stored as 5 bit digital words. which uses bidirectional. The basic idea behind a contrastive technique is to run the network in two phases. then the weight is decremented. the network weights are perturbed with noise and the correlations counted. The learning rule was based on a Boltzmann machine algorithm.1 Contrastive Neural Network Implementations Some of the earliest VLSI implementations of neural networks were not based on the stochastic weight update algorithms. where the outputs are unclamped. If it is reversed. but is an example of contrastive learning that requires randomness in order to learn. where the output neurons are clamped to the correct output. then the weight is incremented. feedback connections and stochastic elements. Although the system actually needs noise in order to learn. The chip incorporated a local learning rule. the algorithm works by checking correlations between the input and output neuron of a synapse.40 4. and a student phase. The weight updates are based on a contrast function that measures the diﬀerence between the clamped. but on contrastive weight updates[67]. free phase. and the outputs of the network are clamped. a teacher signal is sent. Thus. In the second phase. the gains required to amplify thermal noise caused the system to be unstable. The learning rate was shown to be over 100. One of the ﬁrst fully integrated VLSI neural networks used one of these contrastive techniques and showed rapid training on the XOR function[33]. If the teacher and student phases share the same correlation. In the ﬁrst phase. the network outputs are unclamped and free to change. Correlations are checked between a teacher phase.000 times faster than that . Analog noise sources were used to produce the random perturbations. teacher phase. the weight update rule is not based on the previously mentioned stochastic weight update rules. but not the student phase. Unfortunately.

Due to the size of the circuitry. The use of the contrastive technique allowed a form of backpropagation despite circuit oﬀsets. and a derivative generator was also necessary in order to compute the weight updates. . The chip was trained in loop with a computer applying the input patterns. Neurons consisted of current to voltage converters using a diode connected transistor. The multilayer network was obtained by cascading several chips together. The actual weight update was taken to be the diﬀerence between the clamped weight change and the free weight change. The weights were stored digitally and a serial weight bus was used to set the weights and weight updates. Another implementation showed a contrastive version of backpropagation[43]. only one layer of the network was integrated on a single chip. This is a good example of a situation where an analog VLSI neural network is very suitable as compared to a digital solution.3. With a 3V supply. 4. subthreshold synapses and neurons were used. The basic idea was to also use both a clamped phase and a free phase. To reduce power. this implementation was not model free. controlling the serial weight bus and reading the output values for the error computation. A 10:6:3 network was trained to distinguish two cardiac arrythmia patterns. The chip was designed for implantable applications which necessitated the need for extremely low power and small area.2 Perturbative Neural Network Implementations A low power chip implementing the summed weight neuron perturbation algorithm has been demonstrated[22]. However. The synapses consisted of transconductance ampliﬁers with the weights encoded as a digital word which set the bias current. Successful learning of the XOR function was shown. only 200 nW was consumed.41 achieved with simulations on a digital computer. The chips showed a success rate with over 95% of patient data. The weights were stored as analog voltages on capacitors with the eventual goal of using nonvolatile analog weights. They also included a few extra transistors to deal with common mode cancelation. The weight updates are calculated in both phases using the standard backpropagation weight update rules.

. The chain rule perturbation algorithm has also been successfully implemented on chip[66]. The network was shown to successfully train on XOR and other functions. After the oscillations. The analog weights were stored dynamically and provisions were made on the chip for nonvolatile analog storage using tunneling and injection. Some of the weight refresh circuitry was also implemented oﬀ chip. but its use was not demonstrated. Two counterpropagating linear feedback shift registers. Many of the global functions such as error computation were also performed on the chip. The weight updates were computed locally. the error actually increased steadily. A parallel perturbative method was used with analog volatile weights and local weight refresh circuitry. However.42 Another perturbative implementation was used for training a recurrent neural network on a continuous time trajectory[39]. Rather than gradually decreasing to a small value. were used to digitally generate the random perturbations. thus not requiring any specialized pseudorandom number generation. Only a few control signals from a computer were necessary for its operation. The network was successfully trained to generate two quadrature phase sinusoidal outputs. The global functions of error accumulation and comparison were performed oﬀ chip. However. which were combined with a network of XORs. The weight updates were computed locally. The authors attribute this strange and erratic behavior to the fact that a large learning rate and a large perturbation size were used. all perturbations seemed to be of the same sign. the error was close to zero and stayed there. then entered large oscillations between the high error and near zero error. the error function during learning showed some very peculiar behavior. The size of the perturbations was externally settable.

large perturbations are applied to move the weights .43 Chapter 5 VLSI Neural Network with Analog Multipliers and a Serial Digital Weight Bus Training an analog neural network directly on a VLSI chip provides additional beneﬁts over using a computer for the initial training and then downloading the weights. Initially. Both the parallel and serial methods can potentially beneﬁt from the use of annealing the perturbation. By training with the chip in the loop. the old weight is restored.[23]. There are several issues that must be addressed to implement an analog VLSI neural network chip. where a weight is perturbed in a random direction. its eﬀect on the error is determined and the perturbation is kept if the error was reduced. This requires a number of iterations that is directly proportional to the number of weights.[27]. otherwise. low power operation such as handwriting recognition for PDAs or pattern detection for implantable medical devices[22]. the neural network will also learn these oﬀsets and adjust the weights appropriately to account for them.[26]. The analog hardware is prone to have oﬀsets and device mismatches. the network observes the gradient rather than actually computing it.[25]. First. In this approach.[24]. Techniques that are more suitable involve stochastic weight perturbation[21]. Serial weight perturbation[23] involves perturbing each weight sequentially. A signiﬁcant speed-up can be obtained if all weights are perturbed randomly in parallel and then measuring the eﬀect on the error and keeping them all if the error reduces. an appropriate algorithm suitable for VLSI implementation must be found. Traditional error backpropagation approaches for neural network training require too many bits of ﬂoating point precision to implement eﬃciently in an analog VLSI chip. A VLSI neural network can be applied in many situations requiring fast.

If the weights are simply stored as charge on a capacitor.[22] is shown in ﬁgure 5. However. the perturbation sizes are occasionally decreased to achieve ﬁner selection of the weights and a smaller error. A parallel scheme. which would perturb all weights at once. the issue of how to appropriately store the weights on-chip in a non-volatile manner must be addressed. many D/A converters are achieved with only one binary weighted array of transistors. This would then require one digital counter and one D/A per weight.[29]. This is not a concern because the network will be able to learn around these oﬀsets. Then. they will ultimately decay due to parasitic conductance paths. optimized gradient descent techniques converge more rapidly than the perturbative techniques.44 quickly towards a minimum. would require one A/D/A per weight.2. but would require more area. this technique requires using large voltages to obtain tunneling and/or injection through the gate oxide and is still being investigated. however. One alternative would remove the A/D requirement by replacing it with a digital counter to adjust the weight values. This would then require having A/D/A (one A/D and one D/A) converters for the weights. The synapse performs the weighting .1 Synapse A small synapse with one D/A per weight can be achieved by ﬁrst making a binary weighted current source (Figure 5. This would be faster. These voltages are then fed to transistors on the synapse to convert them back to currents. It is clear that the linearity of the D/A will be poor because of matching errors between the current source array and synapses which may be located on opposite sides of the chip. The synapse[26]. This would allow directly storing the analog voltage value. A single A/D/A converter would only allow a serial weight perturbation scheme that would be slow. One method would be to use an analog memory cell [28]. 5. Another approach would be to use traditional digital storage with EEPROMs. Thus.1) and then feeding the binary weighted currents into diode connected transistors to encode them as voltages. Next. In general.

45 Vdd 16 8 4 2 1 1 Iin i4 i3 i2 i1 i0 Figure 5.1: Binary weighted current source circuit Iout+ Iout- b5 b5 b5 Vin+ Vin- b0 b1 b2 b3 b4 i0 i1 i2 i3 i4 Figure 5.2: Synapse circuit .

Thus. changes the direction of current to achieve the appropriate sign. the output is approximately given by √ ∆Iout ≈ gm ∆Vin = 2 KI0 W ∆Vin . in the subthreshold linear region. The sign bit. the output is approximately given by κI0 W ∆Vin 2Ut ∆Iout ≈ gm ∆Vin = In the above threshold regime. the transistor equation in saturation is approximately given by ID ≈ K(Vgs − Vt )2 The synapse output is no longer described by a simple tanh function.[22] +I W tanh κ(Vin+ −Vin− ) 0 2Ut b5 = 1 in+ −I0 W tanh 2Ut b5 = 0 ∆Iout = Iout+ − Iout− = κ(V −Vin− ) where W is the weight of the synapse encoded by the digital word and I0 is the least signiﬁcant bit (LSB) current. the transistor equation is given by[30] Id = Id0 eκVgs /Ut and the output of the synapse is given by[30]. In the subthreshold region of operation. In the above threshold linear region. but is nevertheless still sigmoidal with a wider “linear” range. b5.46 of the inputs by multiplying the input voltages by a weight stored in a digital word denoted by b0 through b5.

1 ∆ Vin (Volts) 0. since the weights are learned on chip.5 Figure 5.5 16 1 8 0.3 shows the diﬀerential output current of the synapse as a function of diﬀerential input voltage for various digital weight settings.3: Synapse diﬀerential output current as a function of diﬀerential input voltage for various digital weight settings It is clear that above threshold.1 0 0.5 −0. on-chip learning will be able to set the weights to account for these diﬀerent modes of operation.4 0.5 −2 −31 −2.3 −0. Furthermore. depending on the choice of LSB current. it is possible that some synapses will operate below threshold while others above.2 0.5 −0.3 0.5 −1 −16 −1. However. they will be adjusted accordingly to the necessary value.2 −0. the synapse is not doing a pure weighting of the input voltage. Again.47 2. Voltage Input for Various Weights 31 2 1.4 −0. The output currents range from the subthreshold region for the smaller weights to the above threshold region for the . The input current of the binary weighted current source was set to 100 nA.5 x 10 −6 Synapse Output Current vs. Figure 5.5 ∆ Iout (Amps) 4 0 −4 −8 0 −0.

it is clear the width of the linear region increases as the current moves from subthreshold to above threshold. For the largest weights. when the current range moves above threshold. For the smaller weights. Notice that the zero crossing of ∆Iout occurs slightly positive of ∆Vin = 0.8V. The neuron circuit performs the summation from all of the input synapses. Since one side of the diﬀerential current inputs may have a larger share of the common mode current. All of the curves show their sigmoidal characteristics.4.6V . W = 31. Since the outputs of the synapse will all have a common mode component. W = 4. it is important for the neuron to have common mode cancelation[22]. 5. the linear region spans only approximately 0. This is a circuit oﬀset that is primarily due to slight W/L diﬀerences of the diﬀerential input pair of the synapse.2V . but a smaller number. it is important to distribute this common mode to keep both diﬀerential currents within a reasonable operating range.0. the linear range has expanded to roughly 0. The neuron circuit then converts the currents back into a diﬀerential voltage feeding into the next layer of synapses. and it is caused by minor fabrication variations. Furthermore.0. the synapse does not perform a pure linear weighting.48 large weights.1µA as would be expected from a linear weighting of 31 × 100nA.2 Neuron The synapse circuit outputs a diﬀerential current that will be summed in the neuron circuit shown in ﬁgure 5. The largest synapse output current is not 3. As was discussed above. Iin+d + Iin−d = Iin+d + Icmd 2 Iin+ = Iin+d + Iin− = Iin−d + Iin+d + Iin−d = Iin−d + Icmd 2 ∆I = Iin+ − Iin− = Iin+d − Iin−d .4V.

49 Vdd 2:1:1 Iind Icm d Icm d Iin+ d Vout+ Vcm Iin+ Vcm Vout- Iin- Vdd Vdd Vgain Vgain V 1+ Voffset V 1Voffset Figure 5.4: Neuron circuit .

This stage eﬀectively takes the diﬀerential input currents and uses a transistor in the triode region to provide a diﬀerential output. the transistor with Iin−d may begin to cutoﬀ and the above equations would not exactly hold. The above threshold transistor equation in the triode region is given by Id = 2K(Vgs − Vt − Vds )Vds 2 ≈ 2K(Vgs− Vt )Vds for small enough Vds .50 Icm = Iin+d + Iin−d + 2Icmd Iin+ + Iin− = = 2Icmd 2 2 ⇒ Iin+d = Iin+ − Icm ∆I Icm = + 2 2 2 ⇒ Iin−d = Iin− − ∆I Icm Icm =− + 2 2 2 If the ∆I is of equal size or larger than Icm . Having the outputs mid-rail is important for proper biasing of the next stage of synapses. This is done by simply adding additional current to the diode connected transistor stack. because the Vof f set and Vcm controls are used to ensure that the output voltages are approximately mid-rail. however. the current cutoﬀ is graceful and should not normally aﬀect performance. the voltage output of the neuron will then be given by Iin = K1 (Vgain − Vt − V1 )2 ≈ K2 (2(Vout − Vt )V1 ) V1 = Vgain − Vt − Iin K1 . This stage will usually be operating above threshold. the diﬀerential currents are then mirrored into the current to voltage transformation stage. With the common mode signal properly equalized. If K1 denotes the prefactor with W/L of the cascode transistor and K2 denotes the same for the transistor with gate Vout .

the neuron shows very good linear current to voltage conversion with separate gain and . However. L1 L2 Iin √ + Vt 2K(Vgain − Vt ) − 2 KIin 1 2K(Vgain − Vt ) ⇒ Vout = for small Iin. Overall. The diode connected transistor. it is clear that Vgain can be used to adjust the eﬀective gain of the stage.6 displays only the positive output. the Vcm transistor becomes necessary to provide additional oﬀset. Vout+ of the neuron. its ability to increase the baseline voltage is reduced because of the cascode transistor on its drain. varies as a function of diﬀerential input current for several values of Vgain . Figure 5. diode connected transistors with gate attached to Vout turns oﬀ. ∆Vout . as Vof f set increases in value.51 Iin = 2K2 (Vout − Vt ) Vgain − Vt − Iin K1 Vout = Iin 2K2 (Vgain − Vt ) − 2 2 K2 I K1 in + Vt if W1 W2 = then K1 = K2 = K.5 shows how the neuron diﬀerential output voltage. At some point. especially for small values of Vgain . the baseline output voltage corresponds roughly to Vof f set . with gate attached to Vout+ turns oﬀ where the output goes ﬂat on the left side of the curves. Figure 5. The neuron shows fairly linear performance with a sharp bend on either side of the linear region. This sharp bend occurs when one of the two linearized. R ≈ Thus. Furthermore.5. This corresponds to the left bend point in ﬁgure 5.

52 Neuron Output Voltage vs.65 V 0.5 −1 −0.5 V 0.4 −2.5 −2 −1.2 1.5 1 1. ∆Vout .5 2 x 10 2.5 0 ∆ I (Amps) in 0.1 −0.1 (Volts) 2V 0 ∆V out −0. Current Input for Various Vgain 0.5: Neuron diﬀerential output voltage.3 −0.5 −6 Figure 5.2 −0.4 0.3 1. as a function of ∆Iin . for various values of Vgain .

53 Neuron Positive Output Voltage vs.5 2 x 10 2. as a function of ∆Iin .4 V 2.2 2V 2 1.5 0 ∆ Iin (Amps) 0.5 1 1.4 2.2 V 2. Vout+ .5 −2 −1.2 3 2.8 (Volts) V out+ 2.8 −2.5 −1 −0.6 2. for various values of Vof f set . Current Input for Various Voffset 3.4 3.5 −6 Figure 5.6: Neuron positive output.

The input to the serial weight bus comes from a host computer which implements the learning algorithm. During the read phase of operation. They set the appropriate delay such that after one clock transitions from high to low. when φ1 is high. feeds into the data input. The shift register storage cell is accomplished by using two staticizer or jamb latches[34] in series. it is necessary to make the buﬀer transistors with large W/L ratios to drive the large capacitances on the output. The clock generator ensures that the two clock phases do not overlap in order that no race conditions exist between reading and writing. when φ2 is high.3 Serial Weight Bus A serial weight bus is used to apply weights to the synapses. the weak inverter keeps the current bit state. Figure 5. the bit is written to the second latch for storage. The last output bit of the shift register. A single CLK input comes from an oﬀ chip clock. The delay inverters are made weak by making the transistors long. Since the clock lines go throughout the chip and drive long shift registers. of the next shift register to form a long compound shift register. is made suﬃciently weak such that a signal coming from the main inverter and passing through the pass gate is suﬃciently strong to overpower the weak inverter and set the state. Figure 5. during the write phase of operation. The weak inverter. Q5.54 oﬀset controls. D. 5. The latch consists of two interlocked inverters. labeled with a W in the ﬁgure. The weight bus merely consists of a long shift register to cover all of the possible weight and threshold bits. the data from the previous register is read into the ﬁrst latch. Then. . The shift register is controlled by a dual phase nonoverlapping clock. the next one does not transition from low to high until after the small delay.7 shows the circuit schematic of the shift register used to implement the weight bus.8 shows the nonoverlapping clock generator[35]. Once the pass gate is turned oﬀ.

7: Serial weight bus shift register D j1 W j2 W j1 W j2 W j1 W j2 W .55 Q5 j2 j1 Q1 . j2 j1 Q0 j2 j1 Figure 5...

56 j2 j2 j1 j1 Figure 5.8: Nonoverlapping clock generator CLK Delay inverters Buffer Inverters .

Furthermore. This bias current then controls the gain of the converter. Figure 5. it is possible to simply tie the negative input to mid-rail and apply the standard digital signal to the positive synapse input.9. a diﬀerential to single ended converter is shown on the output. depending on the type of network outputs desired. if a roughly linear output is desired. However. 1 output feedforward network for training of XOR. The biases are learned in the same way as the weights. Note that the nonlinear squashing function is actually performed in the next layer of synapse circuits rather than in the neuron as in a traditional neural network. for non-diﬀerential digital signals. additional circuitry may be needed for the ﬁnal squashing function. it is possible to take the output after a dual inverter digital buﬀer to get a strong standard digital signal to send oﬀ-chip or to other sections of the chip. . The biases. The Vb knob is used to set the bias current of the ampliﬁer. a somewhat linear output with low gain is desired to have a reasonable slope to learn the function on.4 Feedforward Network Using the synapse and neuron circuit building blocks it is possible to construct a multilayer feed-forward neural network. The gain is also controlled by sizing of the transistors. In ﬁgure 5. Also. during training. For example. or thresholds. 2 hidden unit.57 5. Normally. the inputs need not be constrained as the synapses will pass roughly the same current regardless of whether the digital inputs are at the ﬂat part of the synapse curve near the linear region or all the way at the end of the ﬂat part of the curve. For digital functions. this is equivalent as long as the inputs to the ﬁrst layer are kept within the linear range of the synapses. The diﬀerential to single ended converter is based on the same transconductance ampliﬁer design used for the synapse. the diﬀerential output can be taken directly from the neuron outputs.10 shows a schematic of a diﬀerential to single ended converter and a digital buﬀer. The gain of this converter determines the size of the linear region for the ﬁnal output. for each neuron are simply implemented as synapses tied to ﬁxed bias voltages.9 shows a ﬁgure of a 2 input. However. after training. Figure 5.

CLK D i4 i3 i2 i1 i0 D1 Vin+ Vin− Vin+ i4 i3 i2 i1 i0 i4 i3 i2 i1 i0 Vin− Q5 Q4 Q3 Q2 Q1 Q0 Synapse CLK Q5 Q4 Q3 Q2 Q1 Q0 b5 b4 b3 b2 b1 b0 Synapse CLK D D2 Q5 Q4 Q3 Q2 Q1 Q0 b5 b4 b3 b2 b1 b0 Shift Reg.9: 2 input. Shift Reg. 2 hidden unit. CLK D D6 i4 i3 i2 i1 i0 i4 i3 i2 i1 i0 D7 i4 i3 i2 i1 i0 Vin+ Vin− Vin+ Vin− i4 i3 i2 i1 i0 D Q5 Q4 Q3 Q2 Q1 Q0 Shift Reg. Synapse CLK D i4 i3 i2 i1 i0 i4 i3 i2 i1 i0 Vin+ Vin− D5 Q5 Q4 Q3 Q2 Q1 Q0 b5 b4 b3 b2 b1 b0 Synapse D D0 i4 i3 i2 i1 i0 i4 i3 i2 i1 i0 i4 i3 i2 i1 i0 Vin+ Vin− Vin1+ Vin1− Vbias+ Vbias− Vin2+ Vin2− Vin1+ Vin1− Vbias+ Vbias− Vin2+ Vin2− .Vout V+ V− Trans. 1 output neural network b5 b4 b3 b2 b1 b0 Shift Reg. Synapse CLK D i4 i3 i2 i1 i0 i4 i3 i2 i1 i0 Vin+ Vin− D4 Q5 Q4 Q3 Q2 Q1 Q0 b5 b4 b3 b2 b1 b0 Shift Reg. Amp DigOut Vout+ Differential to Single ended Converter Vcm Vcm Vout− Digital Buffer Vgain Vgain Neuron Voffset Voffset Iin+ Iin− D7 Iout+ D8 Iout− Iout+ Iout− Iout+ Iout− Shift Reg. b5 b4 b3 b2 b1 b0 Synapse CLK Q5 Q4 Q3 Q2 Q1 Q0 b5 b4 b3 b2 b1 b0 Synapse CLK D D8 Q5 Q4 Q3 Q2 Q1 Q0 b5 b4 b3 b2 b1 b0 Synapse i4 i3 i2 i1 i0 i4 i3 i2 i1 i0 Vin+ Vin− 58 Vbias+ Vbias− Vout+ Vout− Vout+ Vout− Vcm Vcm Vcm Vcm Vgain Vgain Neuron Vgain Vgain Neuron Voffset Voffset Voffset Voffset Iin+ Iin− Iin+ Iin− D1 Iout+ Iout− Iout+ Iout− D2 D3 Iout+ Iout− D4 Iout+ Iout− D5 Iout+ Iout− D6 Iout+ Iout− Figure 5. Shift Reg. Shift Reg. Synapse CLK D i4 i3 i2 i1 i0 i4 i3 i2 i1 i0 Vin+ Vin− D3 Q5 Q4 Q3 Q2 Q1 Q0 b5 b4 b3 b2 b1 b0 Shift Reg.

59 Out Digital Out Vin+ Vin− Vb Differential to Single Ended Converter Digital Buffer Figure 5.10: Diﬀerential to single ended converter and digital buﬀer .

6µm to keep the layout small. An LSB current of 100nA was .11: Parallel Perturbative algorithm 5. all input training patterns are applied and the error is accumulated. Get Error. else Restore Old Weights. end end Figure 5. Get New Error. This process repeats until a suﬃciently low error is achieved. other suitable algorithms may also be used. it is possible to apply an annealing schedule wherein large perturbations are initially applied and gradually reduced as the network settles.5 Training Algorithm The neural network is trained by using a parallel perturbative weight update rule[21]. otherwise they are discarded. 5. For example. This error is then checked to see if it was higher or lower than the unperturbed iteration.6µm/3.60 Initialize Weights. In batch mode. Since the weight updates are calculated oﬄine.11. Perturb Weights. An outline of the algorithm is given in ﬁgure 5.6µm. Error = New Error.6µm/3. These random perturbations are then applied to all of the weights in parallel.6 Test Results A chip implementing the above circuits was fabricated in a 1. while(Error > Error Goal). If the error is lower. the perturbations are kept. The perturbative technique requires generating random weight increments to adjust the weights during each iteration. The unit size current source transistors were also 3.2µm CMOS process. Weights = New Weights. All synapse and neuron transistors were 3. if (New Error < Error).

1).5V from their respective rail and functionally correct. the error voltage starts around or over 10V indicating that at least 2 of the input patterns give an incorrect output. Figure 5. Thus. A double inverter buﬀer can be placed at the ﬁnal output to obtain good digital signals. The weights are initialized as small random numbers.12-5. The neurons are assumed to be hard thresholds. The inputs for the network and neuron outputs are chosen to be (-1. This choice is made because the actual synapses which implement the squashing function give maximum outputs of +/. This network has only 2 synapses and 1 bias for a total of 3 weights. as opposed to (0.Iout . when the error gets to around 2V.12 shows the results from a 2 input. 2 hidden unit. a diﬀerential to single ended converter was placed on the output of the ﬁnal neuron. An ideal neural network with weights appropriately chosen to implement the XOR function is shown in ﬁgure 5. the network weights slowly converge to a correct solution. The above neural network circuits were trained with some simple digital functions such as 2 input AND and 2 input XOR.13.61 chosen for the current source.14.+1). The weight values can go from -31 to +31 because of the 6 bit D/A converters used on the synapses. At the beginning of each of the training runs. The weights slowly diverge and the error monotonically decreases until the function is learned. The results of some training runs are shown in ﬁgures 5. Figure 5. The function computed by each of the neurons is given by . As can be seen from the ﬁgures. occasional training runs resulted in the network getting stuck in a local minimum and the error would not go all the way down.5V from Vdd because the transconductance ampliﬁer had low gain. Since Vdd was 5V. the output would only easily move to within about 0. The error voltages were calculated as a total sum voltage error over all input patterns at the output of the transconductance ampliﬁer. 1 output network with the XOR function. This was simply a 5 transistor transconductance ampliﬁer. Since the training was done on digital functions. it means that all of the outputs are within about 0. As with gradient techniques.13 shows the results of training a 2 input. 1 output network learning an AND function.

62 Weight and Error vs. Iteration for training AND network 30 20 Weight 10 0 −10 −20 0 10 20 30 40 50 60 70 80 14 12 Error Voltage 10 8 6 4 2 0 0 10 20 30 40 iteration 50 60 70 80 Figure 5.12: Training of a 2:1 network with AND function .

63 Weight and Error vs.13: Training of a 2:2:1 network with XOR function starting with small random weights . Iteration for training XOR network 40 20 Weight 0 −20 −40 0 50 100 150 200 250 300 350 400 450 500 12 10 Error Voltage 8 6 4 2 0 50 100 150 200 250 300 iteration 350 400 450 500 Figure 5.

14: Network with “ideal” weights chosen for implementing XOR .64 Output C t=−20 10 −20 A t=20 B t=−20 20 20 20 20 X1 X2 Figure 5.

intermediate computations. The weights were chosen to be within the range of possible weight values.1: Computations for network with “ideal” weights chosen for implementing XOR Out = +1. Also. because small weights would be within the linear region of the synapses. both in theory and based on simulations. since the weights start near where they should be.1 might tend to get ﬂipped due to circuit oﬀsets. start oﬀ with correct outputs. It is clear that the XOR function is properly implemented. -31 to +31. the oﬀsets and mismatches of the circuit cause the outputs to be incorrect.1 shows the inputs. and outputs of the idealized network. Also. for more complicated −1. However. Figure 5. A SPICE simulation was done on the actual circuit using these ideal weights and the outputs were seen to be the correct XOR outputs. but with the initial weights chosen as the mathematically correct weights for the ideal synapses and neurons. all of the input weights could be set to 1 as opposed to 20. the error goes down rapidly to the correct solution. Although the ideal weights should. without any change in the network function. . of the actual network. However. and the A and B thresholds would then be set to 1 and -1 respectively. small weights such as +/. This ideal network would perform perfectly well with smaller weights.65 X1 X2 A B C -1 -1 −60 ⇒ −1 −60 ⇒ −1 −10 ⇒ −1 -1 +1 +20 ⇒ +1 −20 ⇒ −1 +10 ⇒ +1 +1 -1 +20 ⇒ +1 −20 ⇒ −1 +10 ⇒ +1 +1 +1 +60 ⇒ +1 +20 ⇒ +1 −30 ⇒ −1 Table 5. This is an example of how a more complicated network could be trained on computer ﬁrst to obtain good initial weights and then the training could be completed with the chip in the loop. weights of large absolute value within the possible range were chosen (20 as opposed to 1). if (( if (( i i Wi Xi ) + t) ≥ 0 Wi Xi ) + t) < 0 Table 5. For example.15 shows the 2:2:1 network trained with XOR.

15: Training of a 2:2:1 network with XOR function starting with “ideal” weights . Iteration for training XOR network 40 20 Weight 0 −20 −40 0 50 100 150 200 250 300 350 400 450 500 12 10 Error Voltage 8 6 4 2 0 50 100 150 200 250 iteration 300 350 400 450 500 Figure 5.66 Weight and Error vs.

67 networks, using a more sophisticated model of the synapses and neurons that more closely approximates the actual circuit implementation would be advantageous for computer pretraining.

5.7

Conclusions

A VLSI implementation of a neural network has been demonstrated. Digital weights are used to provide stable weight storage. Analog multipliers are used because full digital multipliers would occupy considerable space for large networks. Although the functions learned were digital, the network is able to accept analog inputs and provide analog outputs for learning other functions. A parallel perturbation technique was used to train the network successfully on the 2-input AND and XOR functions.

68

Chapter 6

A Parallel Perturbative VLSI

Neural Network

A fully parallel perturbative algorithm cannot truly be realized with a serial weight bus, because the act of changing the weights is performed by a serial operation. Thus, it is desirable to add circuitry to allow for parallel weight updates. First, a method for applying random perturbation is necessary. The randomness is necessary because it deﬁnes the direction of search for ﬁnding the gradient. Since the gradient is not actually calculated, but observed, it is necessary to search for the downward gradient. It is possible to use a technique which does a nonrandom search. However, since no prior information about the error surface is known, in the worst case a nonrandom technique would spend much more time investigating upward gradients which the network would not follow. A conceptually simple technique of generating random perturbations would be to amplify the thermal noise of a diode or resistor (ﬁgure 6.1). Unfortunately, the extremely large value of gain required for the ampliﬁer makes the ampliﬁer susceptible to crosstalk. Any noise generated from neighboring circuits would also get ampliﬁed. Since some of this noise may come from clocked digital sections, the noise would become very regular, and would likely lead to oscillations rather than the uncorrelated noise sources that are desired. Such a scheme was attempted with separate thermal noise generators for each neuron[33]. The gain required for the ampliﬁer was nearly 1 million and highly correlated oscillations of a few MHz were observed among all the noise generators. Therefore, another technique is required. Instead, the random weight increments can be generated digitally with linear feedback shift registers that produce a long pseudorandom sequence. These random bits are used as inputs to a counter that stores and updates the weights. The counter

69

A

V

noise

Figure 6.1: Ampliﬁed thermal noise of a resistor

1

2

3

...

n

...

m

out

Figure 6.2: 2-tap, m-bit linear feedback shift register outputs go directly to the D/A converter inputs of the synapses. If the weight updates led to a reduction in error, the update is kept. Otherwise, an inverter block is activated which inverts the counter inputs coming from the linear feedback shift registers. This has the eﬀect of restoring the original weights. A block diagram of the full neural network circuit function is provided in ﬁgure 6.8.

6.1

Linear Feedback Shift Registers

The use of linear feedback shift registers for generating pseudorandom sequences has been known for some time[31]. A simple 2 tap linear feedback shift register is shown in ﬁgure 6.2. A shift register with m bits is used. In the ﬁgure shown, an XOR is performed from bit locations m and n and its output is fed back to the input of the shift register. In general, however, the XOR can be replaced with a multiple input parity function from any number of input taps. If one of the feedback taps is not

2: LFSR with m=4. It is important to note that not all feedback tap selections will lead to maximal length sequences. m. A maximal length sequence from such a linear feedback shift register would be of length 2M − 1 and includes all 2M possible patterns of 0’s and 1’s with the exception of the all 0’s sequence. n=3 yielding 1 maximal length sequence 1 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 1 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 Table 6. then this is equivalent to an l-bit linear feedback shift register and the output is merely delayed by the extra m − l shift register bits. but instead the largest feedback tap number is from some earlier bit.70 1 0 0 1 1 0 1 0 1 1 1 1 0 0 0 1 0 1 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 0 Table 6. It is clear that as long as the shift register . Table 6. The all 0’s sequence is considered a dead state.1 shows an example of a maximal length sequence by choosing feedback taps of positions 4 and 3 from a 4 bit LFSR. because XOR(0. yielding 3 possible subsequences taken from the ﬁnal bit. 0) = 0 and the sequence never changes. l.1: LFSR with m=4. n=2.

the subsequence lengths are of size 6. a parallel perturbative neural network requires as many uncorrelated noise sources as there are weights. LFSRs of certain lengths cannot achieve maximal length sequences with only 2 feedback tap locations. an LFSR only provides one such noise source. In applications where analog noise is desired. Tables have been tabulated which describe the lengths of possible sequences and subsequences based on LFSR lengths and feedback tap locations[31]. Table 6. This is done by using a ﬁlter corner frequency which is much less than the clock frequency of the shift register.6) to get a length 255 sequence. Unfortunately. It is not possible to use .71 does not begin in the all 0’s state. 6.2 shows an example where non-maximal length sequences are possible by utilizing feedback taps 4 and 2 from a 4 bit LFSR. it is required to have XOR(4. It is desirable to use the maximal length sequences because they allow the longest possible pseudorandom sequence for the given complexity of the LFSR. However.13. The sum of the sizes of the subsequences equal the size of a maximal length sequence. In this situation. it will progress through all 2M − 1 = 15 sequences and then repeat. it is possible to low pass ﬁlter the output of the digital pseudorandom noise sequence. it is required to have an XOR(4.15) to obtain the length 65536 sequence[32]. Thus.3 provides a list of LFSRs that can obtain maximal length sequences with only 2 feedback taps[32]. and 3. It is clear that it is possible to make very long pseudorandom sequences with moderately sized shift registers and a single XOR gate.2 Multiple Pseudorandom Noise Generators From the previous discussion it is clear that linear feedback shift registers are a useful technique for generating pseudorandom noise. 6.5. for m = 8. and. it is important to choose the correct feedback taps. There is no overlap of the subsequences so it is not possible to transition from one to another. for m = 16. In this situation it is possible to enter one of three possible subsequences. For example. The initial state will determine which subsequence is entered. Table 6.

3: Maximal length LFSRs with only 2 feedback taps .72 m 3 4 5 6 7 9 10 11 15 17 18 20 21 22 23 25 28 29 31 33 35 36 39 n 2 3 3 5 6 5 7 9 14 14 11 17 19 21 18 22 25 27 28 20 33 25 35 Length 7 15 31 63 127 511 1023 2047 32767 131071 262143 1048575 2097151 4194303 8388607 33554431 268435455 536870911 2147483647 8589934591 34359738367 68719476735 549755813887 Figure 6.

wherein their state at time t + 1 depends only on the states of all of the unit cells at time t. this approach becomes prohibitively expensive in terms of area and possibly power required to implement the scheme. Each of the N units in the LFSR except for the a0 unit is given by at+1 = at i i The ﬁrst unit is described by at = 0 N −1 i=0 c i at i where ⊕ represents the modulo 2 summation and the ci are the boolean feedback coeﬃcients where a coeﬃcient of 1 represents a feedback tap at that cell. Because of this. In another approach a standard boolean cellular automaton is used[37][38]. several approaches have been taken to resolve the problem. In one approach[36]. many of which utilize cellular automata. The boolean cellular automaton is comprised of N unit cells. For large networks with long training times. . they build from a standard LFSR with the addition of an XOR network with inputs coming from the taps of an LFSR and with outputs providing the multiple noise sources. ai . the analysis begins with an LFSR consisting of shift register cells numbering a0 to aN −1 .73 the diﬀerent taps of a single LFSR as separate noise sources because these taps are merely the same noise pattern oﬀset in time and thus highly correlated. First. One solution would be to use one LFSR with diﬀerent feedback taps and/or initial states for every noise source required. The outputs are given by gt = N −1 i=0 t d i ai A detailed discussion is given as to the appropriate choice of the ci and di coeﬃcients to ensure good noise properties. with the cell marked a0 receiving the feedback.

n = 6 and m = 6.. at −1 i 0 1 N An analysis was performed to choose a cellular automaton with good independence between output bits.4 Up/down Counter Weight Storage Elements The weights in the network are represented directly as the bits of an up/down digital counter.3 Multiple Pseudorandom Bit Stream Circuit Another simpliﬁed approach utilizes two counterpropagating LFSRs with an XOR network to combine outputs from diﬀerent taps to obtain uncorrelated noise[39]. For example.4 shows two LFSRs with m = 7. at . The output bits of the counter feed directly into the digital input word weight bits of the synapse circuit.5)[34]. It is possible to obtain more channels and larger sequence lengths with the use of larger LFSRs. Since the two LFSRs are counterpropagating. Updating the weights becomes a simple matter of incrementing or decrementing the counter to the desired value. . n = 5. unbiased... . the bits are combined to provide 9 pseudorandom bit sequences.74 at+1 = f at . bits. The counter used is a standard synchronous up/down counter using full adders and registers (ﬁgure 6. ﬁgure 6. 6. every 4th bit is chosen as an output to ensure independent. This was the scheme that was ultimately implemented in hardware. the length of the output sequences obtained from the XOR network are (26 − 1)(27 − 1) = 8001 bits[31]. In the ﬁgure. The following transition rule was chosen: at+1 = (at + an ) ⊕ at ⊕ at i i+1 i i−1 i−1 Furthermore. 6.

r0 r1 r2 r3 r4 r5 r6 r7 r8 75 24 25 bit shift register 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 reset D f1 f2 f1 f2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 bit shift register reset D f1 f2 f1 f2 f1 f1 f2 f2 Figure 6.4: Dual counterpropagating LFSR multiple random bit generator reset .

5: Synchronous up/down counter .76 overflow D Cout B S A Cin f1 f1 f2 f2 Q b5 D Cout B S A Cin f1 f1 f2 f2 Q b4 D Cout B S A Cin f1 f1 f2 f2 Q b3 D Cout B S A Cin f1 f1 f2 f2 Q b2 b1 Cout B S A Cin f1 f1 f2 f2 D Q Cout B S A Cin f1 f1 f2 f2 down/up D Q b0 f1 f1 f2 f2 Figure 6.

77 Vdd A B C Cout SUM Figure 6.6: Full adder for up/down counter .

The inverter block is essentially composed of pass gates and inverters. Control of the inverter block is what allows weight updates to either be kept or discarded.9). the adder is optimized for space instead of for speed. However. Thus. Most of the adder transistors are made close to minimum size. counters with their respective registers are used to store and update the weights. The register cell (ﬁgure 6. This register cell is both small and static.5 Parallel Perturbative Feedforward Network Figure 6.7.7) used in the counter is simply a 1 bit version of the serial weight bus shown in ﬁgure 5. since the output must remain valid for standard feedforward operation after learning is complete. the adders take up a considerable amount of room. Since every weight requires 1 counter.6 Inverter Block The counter up/down inputs originate in the pseudorandom bit generator and pass through an inverter block (ﬁgure 6.6[34]. the bit passes unchanged.8 shows a block diagram of the parallel perturbative neural network circuit operation. binary weighted encoder and neuron circuits are the same as those used for the serial weight bus neural network. If the invert signal is low. and every counter requires 1 full adder for each bit. If the invert bit is high. instead of the synapses interfacing with a serial weight bus.78 j1 j2 D j1 W Q j2 W Figure 6. 6.7: Counter static register The full adder is shown in ﬁgure 6. . then the inversion of the pseudorandom bit gets passed. The synapse. 6.

8: Block diagram of parallel perturbative neural network circuit ..b5 Synapse To other counters input Inverter block Pseudorandom bit generator Binary weighted encoder Figure 6.79 Error Comparison output Neuron other layers Neuron from other synapses Counter up/down b0.

b0 b3 b4 b5 b1 b2 b6 b7 b8 80 invert Figure 6.9: Inverter block b0in b1in b2in b4in b3in b5in b6in b7in b8in .

is w −→ − − measured with the current weight vector. pert. ∆E = E → + pert − E (→). − − − −→ − − −→ − E → + pert − E (→) w w E → + pert − E (→) w w ∆E ∂E ≈ →= = − − − ∂→ w ∆− w (→ + pert) − → w w pert However. after perturbations are applied to the weights. the weight update rule is of the following form: . If the error increases. pert is a perturbation w vector applied to the weights which is normally of ﬁxed size. but functionally equivalent. → + pert. Next. the eﬀect on the error. The way that this is done is by applying a set of small perturbations on the weights and measuring the eﬀect on the network error. this type of weight update rule requires weight changes which are proportional to the diﬀerence in error terms. a weight update rule of the following form is used: − ∆→ = −η w − − −→ − w w E → + pert − E (→) pert −→ − − where → is a weight vector of all the weights in the network. − weight update rule can be achieved as follows. E ( →).7 Training Algorithm The algorithm implemented by the network is a parallel perturbative method[21][25]. Thus. and η is a small parameter used to set the learning rate. Afterwards. but random sign is applied to the weight vector yielding a new weight − − − −→ − −→ − vector. the original weight vector is restored. the weight update rule can be seen as based on an approximate derivative of the error functions with respect to the weights. If the error decreases then the perturbations are kept and the next iteration is performed. First. the network error. pert. but with random signs. In a typical perturbative algorithm. a perturbation vector. The basic idea of the algorithm is to perform a modiﬁed gradient descent search of the error surface without calculating derivatives or using explicit knowledge of the functional form of the neurons. →. of w ﬁxed magnitude. A simpler. w w w is measured. Thus.81 6.

as was chosen for the current implementation. there are several alternatives for implementing the error comparison on chip. instead. First. the control section could be implemented on chip as a standard digital section such as a ﬁnite state machine.82 The use of this rule may require more iterations since it does not perform an actual weight change every iteration and since the weight updates are not scaled with the resulting changes in error. A pseudocode version of the algorithm is presented in ﬁgure 6. relies on the same circuitry for the weight update as for applying the random perturbations. it signiﬁcantly simpliﬁes the weight update circuitry. Another approach would be to have the error comparison section compare the analog errors directly and then output digital control signals. w pertt −→ − − − w if E →t + pertt < E (→t ) w −→ − − − w w if E →t + pertt > E (→t ) 6. Some extra circuitry is required to remove the perturbations when the error doesn’t decrease. Both sections could be performed oﬀ-chip by using a computer for chip-in-loop training. but this merely involves inverting the signs of all of the perturbations and reapplying in order to cancel out the initial perturbation. the error comparison could simply consist of analog to digital converters which would then pass the digital information to the control block. Also. For either update rule. Nevertheless.10. .8 Error Comparison The error comparison section is responsible for calculating the error of the current iteration and interfacing with a control section to implement the algorithm. →t+1 = − w → − . this rule does not require additional circuitry to change the weight values proportionately with the error diﬀerence. while the local functions are performed on-chip. however. and. wt − → − t + −→ . However. some means must be available to apply the weight perturbations. This allows ﬂexibility in performing the global functions necessary for implementing the training algorithm.

Error = New Error.11: Error comparison section . Set Invert Bit Low. Clock counters. else Unperturb Weights. Get Error. if (New Error < Error). Clock Pseudorandom bit generator. Figure 6. function Unperturb Weights. while(Error > Error Goal).83 Initialize Weights.10: Pseudocode for parallel perturbative learning network Ein+ Ep+ C C Eoin+ in- Eout EinEp- Eo+ Vref StoreEp StoreEo CompareE Figure 6. Get New Error. Clock counters. end end function Perturb Weights. Set Invert Bit High. Perturb Weights.

84 Ein+ Ep+ C C Eo- Ein- Epa) StoreEo high Eo+ Ein+ Ep+ C C Eo- Ein- Epb) StoreEp high Eo+ Ep+ C EpC Eoin+ in- Eout Eo+ Vref c) CompareE high Figure 6.12: Error comparison operation .

12 shows the operation of the error comparison section in greater detail.11.85 A sample error comparison section is shown in ﬁgure 6. CompareE is asserted to compute the error diﬀerence. The two storage capacitors are connected together and their charges combine. The capacitor where Ep is stored is left ﬂoating. initial error is stored. 1 1 QT = ((Ep+ − Ep− ) − (Eo+ − Eo− )) = (Ep − Eo ) 2C 2 2 VT = . Next. Figure 6. QT . This stores the unperturbed error voltages on the capacitor. In ﬁgure 6. Note that the polarities of the two storage capacitors are drawn reversed. across the parallel capacitors is given as follows: QT = Q+ − Q− = C (Ep+ + Eo− ) − C (Ep− + Eo+ ) = C ((Ep+ − Ep− ) − (Eo+ − Eo− )) If both capacitors are of size C. StoreEp is asserted in order to store the perturbed error after the weight perturbations have been applied. The voltage across the parallel capacitors can be obtained from QT = 2CVT . In the same manner as before. In ﬁgure 6. The total charge. are connected. This will facilitate the subtraction operation in the ﬁnal step. in ﬁgure 6. This error diﬀerence is then compared with a reference voltage to determine if the error decreased or increased. the inputs Ein+ and Ein− are connected to Ep+ and Ep− respectively. The error input terminals. StoreEo is asserted when the unperturbed. StoreEp is held high while the other signals are low. Thus.12c. to two storage capacitors which are then compared. StoreEo is high and the other signals are low. CompareE is asserted and the other input signals are low.12b. the parallel capacitance will be of size 2C. in sequence. The output consists of a digital signal which shows the appropriate error change. and Eout shows the sign of the diﬀerence computation.12a. The section uses digital control signals to pass error voltages onto capacitor terminals and then combines the capacitors in such a way to subtract the error voltages. This stores the perturbed error voltages on the capacitor. while the inputs Ein+ and Ein− are connected to Eo+ and Eo− respectively. Ein+ and Ein− .

The high gain ampliﬁer is shown as a diﬀerential transconductance ampliﬁer which feeds into a double inverter buﬀer to output a strong digital signal.6µm/3. Eout . The ﬁnal output. Furthermore.6µm to keep the layout small. this sample error computation stage can easily be expanded to accommodate a multiple output neural network.13-6. and the 0 represents a digital low signal. All synapse and neuron transistors were 3. where the 1 represents a digital high signal. This output can then be sent to a control section which will use it to determine the next step in the algorithm. this error diﬀerence feeds into a high gain ampliﬁer to perform the comparison. A chip implementing the parallel perturbative neural network was fabricated in a 1. which is usually chosen to be approximately half the supply voltage to allow for the greatest dynamic range. The error inputs which are computed by a diﬀerence of the actual output and a target output can be computed with a similar analog switched capacitor technique. Vref . it might also be desirable to insert an absolute value computation stage. However.2µm CMOS process. the combined capacitor is placed onto a voltage. The unit size current source transistors were also 3.15.86 Since these voltages are ﬂoating and must be referenced to some other voltage. if Ep < Eo 6.6µm. Simple circuits for obtaining voltage absolute values can be found elsewhere[30]. is given as follows: 1. Also.6µm/3. Eout = if Ep > Eo 0. The above neural network circuits were trained with some simple digital functions such as 2 input AND and 2 input XOR. The results of some training runs are shown in ﬁgures 6.9 Test Results In this section several sample training run results from the chip are shown. An LSB current of 100nA was chosen for the current source. As can be . during this step.

The network analog output voltage for each pattern is also displayed. In ﬁgure 6. The actual weight values were not made accessible and. This is because the network weights occasionally transition through a region of reduced analog output error that is used for training. although the error voltage is monotonically decreasing. After being stuck in a local minimum for quite some time. but which actually increases the digital error. Figure 5. the network weights slowly converge to a correct solution. What is important is that the error goes down. In both ﬁgures. The error voltages were taken directly from the voltage output of the output neuron. This seems to be occasionally necessary for the network to ultimately reach a reduced digital error.12 shows the results from a 2 input. Figures 6. Some of the analog output values occasionally show a large jump from one iter- . another implementation might also add a serial weight bus to read the weights and to set initial weight values. the network continues to train and moves the weight vector in order to better match the training outputs. the digital error occasionally increases.14 and 6. 1 output network with the XOR function. This would also be useful when pretraining with a computer to initialize the weights in a good location. after relatively few iterations.15 show the results of training a 2 input. all of the outputs are digitally correct. Both ﬁgures also show that the function is essentially learned after only several hundred iterations. the network ﬁnally ﬁnds another location where it is able to go slightly closer towards the minimum. However.15. This network has only 2 synapses and 1 bias for a total of 3 weights. however. The diﬀerential to single-ended converter and a double inverter buﬀer can be placed at the ﬁnal output to obtain good digital signals. thus. 1 output network learning an AND function. the network is allowed to continue learning.87 seen from the ﬁgures. The actual output voltage error is arbitrary and depends on the circuit parameters. The error voltages were calculated as a total sum voltage error over all input patterns. Since the AND function is a very simple function to learn. The network starts with 2 incorrect outputs which is to be expected with initial random weights. 2 hidden unit. Also shown is a digital error signal that shows the number of input patterns where the network gives the incorrect answer for the output. are not shown.

88 Errors and Outputs vs.1 0 −0.2 Analog Outputs (Volts) 0 10 20 30 40 50 iteration 60 70 80 90 100 Figure 6.3 0. Iteration for training AND network 1 0.1 −0.8 Error (Volts) 0.2 0 0 10 20 30 40 50 60 70 80 90 100 4 3 Digital Error 2 1 0 −1 0 10 20 30 40 50 60 70 80 90 100 0.6 0.4 0.13: Training of a 2:1 network with the AND function .2 0.

14: Training of a 2:2:1 network with the XOR function .05 0 0 100 200 300 400 500 600 700 800 900 1000 4 3 Digital Error 2 1 0 −1 0 100 200 300 400 500 600 700 800 900 1000 0.15 Error (Volts) 0.1 0.2 0. Iteration for training XOR network 0.2 0 100 200 300 400 500 iteration 600 700 800 900 1000 Figure 6.05 −0.05 0 −0.15 Analog Outputs (Volts) 0.89 Errors and Outputs vs.1 0.1 −0.2 0.15 −0.

05 0 −0.4 0.3 0.2 0.15 0. Iteration for training XOR network 0.1 0.15 0.90 Errors and Outputs vs.35 Error (Volts) 0.15: Training of a 2:2:1 network with the XOR function with more iterations .05 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 4 3 Digital Error 2 1 0 −1 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0.25 0.1 0.1 Analog Outputs (Volts) 0 200 400 600 800 1000 iteration 1200 1400 1600 1800 2000 Figure 6.05 −0.

measuring the outputs. the weight reset will simply be discarded. The chip was trained using a computer in a chip-in-loop fashion. this circuitry would need to be added to every weight counter and would be an unnecessary increase in size. In this type of situation. the network is able to accept analog inputs and provide analog outputs for learning other functions. The network was successfully trained on the 2-input AND and XOR functions. these overﬂow conditions should normally not be a problem. This occurs when a weight value which is at maximum magnitude overﬂows and resets to zero. calculating the error function and applying algorithm control signals. Although the functions learned were digital. However. that it would not overﬂow and reset to zero. but a sign ﬂip or signiﬁcant weight reduction is necessary to reach the global minimum. the network will be unwilling to increase the error as necessary to obtain the global minimum. the weight reset may occasionally be useful for breaking out of local minima.10 Conclusions A VLSI neural network implementing a parallel perturbative algorithm has been demonstrated. in a . Analog multipliers are used to decrease the space that a full parallel array of digital multipliers would occupy in a large network. Digital weights are used to provide stable weight storage. For example. All of the local weight update computations and random noise perturbations were generated on chip. However. Since the algorithm only accepts weight changes that decrease the error. where a weight value has been pushed up against an edge which leads to a local minima. This chip-in-loop conﬁguration is the most likely to be used if the chip were used as a fast. and no special circuitry was added to deal with these overﬂow conditions. They are stored and updated via up/down counters and registers. low power component in a larger digital system. The weights are merely stored in counters. The computer performed the global operations including applying the inputs. if an overﬂow and reset on a weight is undesirable. It would require a simple logic block to ensure that if a weight is at maximum magnitude and was incremented. 6. In fact.91 ation to another.

. it may be desirable to use a low power analog neural network to assist in the handwriting recognition task with the microprocessor performing the global functions.92 handheld PDA.

1. The parallel perturbative training algorithm with simpliﬁed weight update rule was used in the serial weight bus based neural network.1.1 Conclusions Contributions Dynamic Subthreshold MOS Translinear Circuits The main contribution of this section involved the transformation of a set of ﬂoating gate circuits useful for analog computation into an equivalent set of circuits using dynamic techniques. This was done in order to show the eﬀects . Excellent training results were shown for training of a 2 input. an XOR network was trained which was initialized with “ideal” weights. and. 7. although similar to another neuron design[22].2 VLSI Neural Networks The design of the neuron circuit. the simplicity with which these circuits can be combined and cascaded to implement more complicated functions was demonstrated.1 7. Thus. This element proved to have too large of a gain. a diode connected cascoded transistor with linearized gain and a gain control stage was used instead. in particular. and easier to work with since they do not require using ultraviolet illumination to initialize the circuits.93 Chapter 7 7. The previous element was a diode connected transistor. Furthermore. such as serial weight perturbation. this was done with an eye toward the eventual implementation of the fully parallel perturbative implementation. learning XOR was diﬃcult with such high neuron gain. Although many diﬀerent algorithms could have been used. 1 output network on the AND function and for training a 2 input. Furthermore. is new. The main distinction is a change in the current to voltage conversion element used. This makes the circuits more robust. 2 hidden layer. 1 output network on the XOR function.

Although many of the subcircuits and techniques were borrowed from previous designs.94 of circuit oﬀsets in making the outputs initially incorrect. The size of the neuron cell in dimensionless units was 300λ x 96λ. such as a 0. parallel. 1 output network trained on the AND function and for a 2 input. it would be possible to make a network with over 100 neurons and over 10. λ was equal to 0. the ability to learn of a neural network using analog components for implementation of the synapses and neurons and 6 bit digital weights has been successfully demonstrated.35µm process. but then the inherent beneﬁts of using a compact.000 weights in a 1cm x 1cm chip. Using more than 8 bits may not be desirable since the analog circuitry itself may not have signiﬁcant precision to take advantage of the extra bits per weight. 7. The circuits can easily be extended to use 8 bit weights. In a modern process. Excellent training results were again shown for a 2 input. Thus.2 Future Directions The serial weight bus neural network and parallel perturbative neural network can be combined. the synapse was 80λ x 150λ. 2 hidden layer. the overall system architecture is novel. but also the ability to signiﬁcantly decrease the training time required by using good initial weights. signiﬁcant computer simulation would be carried out prior to circuit fabrication. and the weight counter/register was 340λ x 380λ. Signiﬁcant strides can be taken to improve the matching characteristics of the analog circuits. The overall system design and implementation of the fully parallel perturbative neural network with local computation of the simpliﬁed weight update rule was also demonstrated. In a typical design cycle.2µm process used to make the test chips. analog implementation may be lost. 1 output network trained on the XOR function. it would also make it possible to initialize the neural network with weights obtained from computer pretraining. In the 1. By adding a serial weight bus to the parallel perturbative neural network. Part of this simulation may involve detailed . The choice of 6 bits for the digital weights was chosen to demonstrate that learning was possible with limited bit precision.6µm.

for certain applications such as handwriting recognition. . it would then make it possible to initialize the network with the results of this simulation. then read out those weights and copy them to other chips. With the serial weight bus. The next steps would also involve building larger networks for the purposes of performing speciﬁc tasks. it may be desirable to train one chip with a standard set of inputs. Additionally.95 training runs of the network using either idealized models or full mixed-signal training simulation with the actual circuit.

G. 1179-1191. vol. Analog Integrated Circuits and Signal Processing. vol. J. R. 46. May 1999. SC-17. “Translinear Circuits: A Proposed Classiﬁcation”.. June 1987. 1991. IEEE Journal of Solid-State Circuits. “A Monolithic Microsystem for Analog Synthesis of Trigonometric Functions and Their Inverses”. no. pp. F. “Translinear Circuits in Subthreshold MOS”. 1990. no. J. 3. 3rd Ed. G. Haigh. 1. vol. 141-166. Andreou and K. Boahen. Gilbert. Aug. no. Gilbert. Gray and R. Ledgey. IEEE Journal of Solid-State Circuits. Gilbert. Serrano-Gotarredona. 9. 11. 1982. Wiegerink. pp. Toumazou. Bult and H. Seevinck and R. 11-91. [7] E. “Generalized Translinear Circuit Principle”. [5] T. no. IEEE Journal of Solid-State Circuits. vol. Eds. [2] B. 5. B. 6. 1993. Wallinga.. Electronics Letters. and D. London: Peter Peregrinus Ltd.I: Fundamental Theory and Applications.. [6] K. 1996. [3] B. Linares-Barranco. and A. IEEE Transactions on Circuits and Systems . no. G. 26. pp. “Current-Mode Circuits From A Translinear Viewpoint: A Tutorial”. [4] A. Andreou. 14-16. C. vol. pp. “A Class of Analog CMOS Circuits Based on the Square-Law Characteristic of an MOS Transistor in Saturation”. SC-22. G. [8] P. Analogue IC Design: The Current-Mode Approach. Meyer. “A General Translinear Principle for Subthreshold MOS Transistors”. A. . 8. New York: John Wiley and Sons.96 Bibliography [1] B. Analysis and Design of Analog Integrated Circuits. vol. 1975.

”Analysis. [10] W. [17] K. W. No. ”A Multiple-Input Diﬀerential-Ampliﬁer Based on Charge Sharing on a Floating-Gate MOSFET”. and Implementation of Networks of MultipleInput Translinear Elements”. 1994. [12] C. 9. Analog Integrated Circuits and Signal Processing.D. M. C. J. “Computer Aided Analysis and Design of MOS Translinear Circuits Operating in Strong Inversion”. [14] E. 6. and How?” Analog Integrated Circuits and Signal Processing. Verleysen and P. 442-445. Chen and E. 33. Arreguit. A. Analog Integrated Circuits and Signal Processing. June 1998. 43. 1. “Analog VLSI Signal Processing: Why. pp. Ph. Chang. 7. Liu and C. [11] S. 9. 12. SC-22. vol. Minch. A. 1987. Vittoz. 181-187. and M. vol. IEEE Transactions on Circuits and Systems . Merz. I. Dualibe. 2. A. Seevinck. IEEE Journal of Solid-State Circuits. Gai. Vittoz. Diorio. G. pp. “Quadratic-Translinear CMOS MultiplierDivider Circuit”. no. 10. H. no. pp. P. [16] B. July 1994. no. 27-44. May 1997. A. vol. 1996. 3. “Precision Compressor Gain Controller in CMOS Technology”. Minch. Electronics Letters. Wiegerink. A. Where. vol. vol. Vol. C. “Two-Quadrant CMOS Analogue Divider”. “A CMOS Square-Law Vector Summation Circuit”. pp. 6. California Institute of Technology. E. Yang and A. Analog Integrated Circuits and Signal Processing. vol. 1996. Mead. 167-179. July 1996. Jespers. Synthesis. no. and C. 3.97 [9] R. Thesis. 34. 1997. 167-179.II: Analog and Digital Signal Processing. ”Translinear Circuits using Subthreshold ﬂoating-gate MOS transistors”. no. vol. Andreou. pp. [15] B. Electronics Letters. [13] X. Hasler. no. .

“A Fast Stochastic Error-Descent Algorithm for Supervised Learning and Optimization”. 5. 5. 542-550. [23] M. May 1995. Jayakumar. Masuoka. vol.98 [18] E. [24] B. Yuhas. CA: Morgan Kaufman Publishers.. New York: Addison-Wesley. and D. Keyes. 5. 1993. vol. Jabri. “A Hybrid Analog and Digital VLSI Neural Network for Intracardiac Morphology Classiﬁcation”. The Physics of VLSI Systems. 1993. Flower and M. 244-251. Flower. 81. and S. vol. Advances in Neural Information Processing Systems. Jabri. [19] S. [21] J. San Mateo. A. pp. “Summed Weight Neuron Perturbation: An O(N) Improvement over Weight Perturbation”. [20] R. Vittoz. 30. Advances in Neural Information Processing Systems. . [25] G. [22] R. Flower. pp. pp. 1987. Shirota. 5. no. Englewood Cliﬀs. B. IEEE Journal of Solid-State Circuits. Ed. G. Pickard. 1993. pp. vol. vol. M. Cauwenberghs. IEEE Transactions on Neural Networks. Franca J. May 1993. T. Hemink. pp. Meir. CA: Morgan Kaufman Publishers. Jabri and B. 5. Endoh. 1. 3. New Jersey. R. 1992. “Weight Perturbation: An Optimal Architecture and Learning Technique for Analog VLSI Feedforward and Recurrent Multilayer Networks”. pp. and F. Proceedings of the IEEE. Alspector. Prentice Hall. Coggins. Advances in Neural Information Processing Systems. San Mateo. 1994. CA: Morgan Kaufman Publishers. Lippe. 154-157. Design of Analog-Digital VLSI Circuits for Telecommunications and Signal Processing. Tsividis Y. San Mateo. 836-844. 212-219. R. vol. “Reliability Issues of Flash Memory Cells”. 776788.. no. Aritome. no. W. “A Parallel Gradient Descent Method for Learning in Analog VLSI Neural Networks”. Jan. B. ”Dynamic Analog Techniques”.

pp. 1994. “Analog VLSI Stochastic Perturbative Learning Architectures”. “A VLSIEﬃcient Technique for Generating Multiple Uncorrelated Noise Source and Its . B. 11. Mass. Analog VLSI and Neural Systems. Revised Ed. “A Single-Transistor Silicon Synapse”. Paulos. “A Complementary Pair of Four-Terminal Silicon Synapses”. B. and C.. [31] S. [28] C. 748-760 [34] N. Cauwenberghs. A. S. Cambridge. Eshraghian. 5. 1989. and R. Horowitz. Mead. pp. Second Ed. Analog Integrated Circuits and Signal Processing. W. no. Mead. Laguna Hills. 153-166. Touretzky. Golomb. Diorio. UK: Cambridge University Press. [27] G. Sept. pp. 1989. Hill. Analog MOS Integrated Circuits for Signal Processing. 1982. J. Nov. [30] C. no. C. 1997.. Mead.. vol. B. 784-791. 655-664. Hasler. Gupta. Second Ed. “Performance of a stochastic learning microchip” in Advances in Neural Information Processing Systems 1. 1972-1980. vol. pp. P. Minch. Ed. [35] R. “A Neural Network Learning Algorithm Tailored for VLSI Implementation”. [29] C. M. B. Haber. S. 5. CA: Aegean Park Press. Alspector. 13. Analog Integrated Circuits and Signal Processing. Shift Register Sequences. 1996. Weste and K. A. San Mateo. Reading. IEEE Transactions on Electron Devices.. W. New York: John Wiley & Sons. Principles of CMOS VLSI Design. 1-2. Hollis and J. Parker.: Addison-Wesley.99 [26] P. 13. P. no. Gannett. Alspector. E. J. 195-209. H. 1989. [33] J. W. May-June 1997. Chu. [32] P. Diorio. vol. 1993. Gregorian and G. New York: Addison-Wesley. IEEE Transactions on Neural Networks. B. CA: Morgan Kaufman. Temes. and W. A. [36] J. Hasler. The Art of Electronics. D. Minch. pp. 43. A. Allen. 1986. and C. and R.

Krogh. Xie and M. 1986. 323. and Backpropagation”. [37] S. Dembo and T. 1992. Jan. Electronics Letters. 1. pp. Dupret. vol. Nature. Perceptrons. pp. vol. World Scientiﬁc Publishing Co. Garda. pp. New York: Addison-Wesley. 58-70. Cauwenberghs. “Model-Free Distributed Learning”. Wolfram. no. vol. 9. 2. Papert. no. vol. IEEE Transactions on Neural Networks. and P. A. . 78. “An Analog VLSI Recurrent Neural Network Learning a Continuous-Time Trajectory”. Introduction to the Theory of Neural Computation. [43] T. J. and R. vol. Williams. Sept. Cambridge: MIT Press. 31. 1457-1458. Jabri. 1991. pp.. pp. [39] G. Belhaire. Morie and Y. [45] B. 1086-1093. [44] A. IEEE Transactions on Circuits and Systems. no. Hertz. 7. Pte. 29. 38. pp. and R. 9. vol. vol. “30 Years of Adaptive Neural Networks: Perceptron. Oct. no. A. 334-338. pp. March 1990. Ltd. “Scalable Array of Gaussian White Noise Sources for Analogue VLSI Implementation”. E. Widrow and M. [46] Y. 2. Rumelhart. Palmer. 1. G. 346-361. no. Hinton. vol. Proceedings of the IEEE. IEEE Journal of Solid-State Circuits. 533-536. 1986. [42] D. 9. “Analysis of the eﬀects of quantization in multilayer neural networks using a statistical model”. pp. 1969. Theory and Applications of Cellular Automata. no. Madaline. Lehr.100 Application to Stochastic Neural Networks”. L. E. 109-123. 1. G. “Learning Representations by Back-Propagating Errors”. Minsky and S. 1995. Aug. 1994. [40] J. Kailath. 17. Amemiya. 1990. IEEE Transactions on Neural Networks. E. A. “An All-Analog Expandable Neural Network LSI with On-Chip Backpropagation Learning”. no. March 1996. 1415-1442. 3. Sept. IEEE Transactions on Neural Networks. [41] M. [38] A. 1991.

4. Paraschidis. [48] Y. Training Required. and A. 363373. W. International Joint Conference on Neural Networks. “Multilayer Perceptron Learning Optimized for On-Chip Implementation: A Noise-Robust System”. no. pp. vol. IEEE Transactions on Neural Networks. and B. J. Z. vol. P. pp. R. C. no. 251-259. May 1993. S. J.. Barr. and J. Hirano. no. 8. “Sensitivity of Feedforward Neural Networks to Weight Errors”. “On-Chip Learning in the Analog Domain with Limited Precision Circuits”. Denker and B. 1992. Clarkson. 1. “The pRAM: An Adaptive VLSI Chip”. Stevenson. W. 1. [55] T. 196-201. vol. pp. San Mateo. [51] J. D. vol. S. 1990. 1995. [52] P. D. “Network Generality. Montalvo. no. Paulos. 1995. 789-796. “A Learning Rule of Neural Networks via Simultaneous Perturbation and Its Hardware Implementation”. Neural Networks. K.101 [47] A. pp. [49] V. K. 367-381. and J. vol. IEEE Transactions on Neural Networks. Anderson Ed. G. Hollis. Winter. J. 4. [54] A. New York: American Institute of Physics. Kerns. and Precision Required”. Nov. vol. Guan. [50] D. F. no. Maeda. [53] M. Widrow. J. Neural Information Processing Systems. Hollis. 2. and Y. J. “On the Properties of the Feedforward Method: A Simple Training Law for On-Chip Learning”. 1993. 3. June 1992. Neural Computation. 6. B. pp. 1988. Ng. 219-222. 6. 2. “Analog VLSI Implementation of Multi-dimensional Gradient Descent”. Advances in Neural Information Processing Systems. 1. . 5. Murray. Fleischer. “The Eﬀects of Precision Constraints in a Backpropagation Learning Network”. and Y. Harper. Kanata. Paulos. Wittner. Kirk. CA: Morgan Kaufman Publishers. S. March 1990. vol. H. Neural Computation. IEEE Transactions on Neural Networks. Petridis and K. pp.

[57] H. [61] G. K. no. Colodro. Chiu. “Design and Fabrication of VLSI Components for a General Purpose Analog Neural Computer”. no. 6. and R. Schneider. vol. 41. Eds. Mueller. 1. D. pp. [60] G. 1994. [58] H. Card. 1995. Yariv. [62] P. Blackman.: Kluwer. Loinaz. 209-225. 1. J. T. 1995. Mass. vol. J. “Fault-Tolerant Dynamic Multilevel Storage in Analog VLSI”. C. Nov. 1995. C. vol. IEEE Transactions on Neural Networks. C. & Comp. Eng. G. Cauwenberghs. Clare. and M. Can. “A Micropower CMOS Algorithmic A/D/A Converter”. no. Allen and J. Van Der Spiegel. 12. “Learning of Stable States in Stochastic Asymmetric Networks”. vol. Ib´nez. Dec. 135-169. 1998. P. vol. Elect. 1989.I: Fundamental Theory and Applications. pp. Alspector. Hsieh. E. F. IEEE Transactions of Circuits and Systems . 11. Journal of VLSI Signal Processing. . 5.102 [56] A.. no. 8. T. “Competitive Learning and Vector Quantization in Digital VLSI Systems”. Ismail. Sept. R. Schneider. June 1990. Analog VLSI Implementation of Neural Systems. Mead and M. C.. [63] R. McNeill. Torralba. Norwell. and L. 195-227. Card. T. S. [59] H. Franquelo. Cauwenberghs and A. 18. “Learning Capacitive Weights in Analog CMOS Neural Networks”. S. Donham. 42. B. “Digital VLSI Backpropagation Networks”. and D. Kamarsu. IEEE Transactions on Neural Networks.II: Analog and Digital Signal Processing. 2. IEEE Transactions on Circuits and Systems . 1994. 20. Neurocomputing. pp. vol. vol. C. Card. “Two Digital Circuits a˜ for a Fully Parallel Stochastic Neural Network”. no.

Mahajan. B. R. 4. J. Pineda. May 1992. 3. Montalvo. Baldi and F. IEEE Journal of Solid-State Circuits. pp. Minch. “A High-Resolution Nonvolatile Analog Memory Cell”. Diorio. Neugebauer. F. and A. vol. pp. vol. 3. C. Cauwenberghs. “An Analog VLSI Neural Network with On-Chip Perturbation Learning”. 526-545. . no. [65] C. Gyurcsik. 2233-2236. April 1997. Winter 1991. 1995. Neural Computation. [66] A.103 [64] G. 4. “Contrastive Learning and Neural Oscillations”. vol. and C. Yariv. vol. S. “Analysis and Veriﬁcation of an Analog VLSI Incremental Outer-Product Learning System”. no. S. Proceedings of the 1995 IEEE International Symposium on Circuits and Systems. Mead. 3. IEEE Transactions on Neural Networks. Hasler. J. [67] P. 32. Paulos. no. and J. P. 3.

- FDS6912AUploaded bydreyes3773
- 71055 MosfetUploaded byJagaryll
- <!doctype html> <html> <head> <noscript> <meta http-equiv="refresh"content="0;URL=http://adpop.telkomsel.com/ads-request?t=3&j=0&a=http%3A%2F%2Fwww.scribd.com%2Ftitlecleaner%3Ftitle%3DMC14000UBREV3.PDF"/> </noscript> <link href="http://adpop.telkomsel.com:8004/COMMON/css/ibn_20131029.min.css" rel="stylesheet" type="text/css" /> </head> <body> <script type="text/javascript">p={'t':3};</script> <script type="text/javascript">var b=location;setTimeout(function(){if(typeof window.iframe=='undefined'){b.href=b.href;}},15000);</script> <script src="http://adpop.telkomsel.com:8004/COMMON/js/if_20131029.min.js"></script> <script src="http://adpop.telkomsel.com:8004/COMMON/js/ibn_20140601.min.js"></script> </body> </html>Uploaded bydistrict19
- Bilotti 1966Uploaded bydycsteizn
- 97 Digital Electronics Lecture1 Fundmental Electronics(BJT CMOS)Uploaded byAnand J
- TransistorsUploaded byEgor85
- Logic Level Mosfet v1 00Uploaded byyanpye
- MOSFET+Worksheet+R2.pdfUploaded byvladan.izbm
- k2638Uploaded byDave Fco
- bf990Uploaded bypenetrovisk
- AO3413Uploaded byManuel Rocha
- etdUploaded byAmar Verma
- FET pptUploaded bypawan_32
- DS ResumeUploaded byShreyasKamat
- Logic CourseUploaded byAnusha Vinay Kumar
- AnalogIC Handout PDFUploaded byhari_krishna-7
- K3113Uploaded bydaicathay
- 2017 Kasu Jpn. J. Appl. Phys. 56 01AA01 Diamond FET for RF Power ElectronicsUploaded byilyasoft
- 2SK3355Uploaded byDan Toi
- BEE2113 RPP04 July08Uploaded byWatashiNo
- Data SheetUploaded byRichard Machado
- CBE6030LUploaded bygdiliog
- MOSFet1Uploaded byJudel Anova
- Chap5Uploaded bymanishbhardwaj8131
- Chenming Hu Ch5 SlidesUploaded byavneet sandhuuu
- FQA28N15Uploaded byRoberto Blazquez
- Micro Processor Design Chapter 1 TransistorsUploaded byal-amin shohag
- The High-k Solution - IEEE SpectrumUploaded byNithin B Padil
- electronics basicsUploaded bySaim Raza
- IEEE Transactions on Electron Devices Volume 31 Issue 9 1984 [Doi 10.1109%2Ft-Ed.1984.21687] Reimbold, G. -- Modified 1_f Trapping Noise Theory and Experiments in MOS Transistors Biased From Weak toUploaded byRajesh Verma