/  10
 
1
The Lutonium: A Sub-Nanojoule Asynchronous8051 Microcontroller
Alain J. Martin, Mika Nystr¨om, Karl Papadantonakis, Paul I. P´enzes, Piyush Prakash, Catherine G. Wong,Jonathan Chang, Kevin S. Ko, Benjamin Lee, Elaine Ou, James Pugh, Eino-Ville Talvala,James T. Tong, Ahmet TuraDepartment of Computer ScienceCalifornia Institute of TechnologyPasadena, CA 91125
 Abstract
We describe the Lutonium, an asynchronous 8051 microcon-troller designed for low
Et
2
. In 0.18-
µ
m CMOS, at nominal 1.8 V,we expect a performance of 0.5 nJ per instruction at 200 MIPS.At 0.5 V, we expect 4 MIPS and 40 pJ/instruction, correspondingto 25,000 MIPS/Watt. We describe the structure of a fine-grainpipeline optimized for
Et
2
efficiency, some of the peripherals im-plementation, and the advantages of an asynchronous implemen-tation of a deep-sleep mechanism.
I. I
NTRODUCTION
The Lutonium is a quasi delay-insensitive (QDI) asyn-chronous 8051-architecture microcontroller designed for en-ergy efficiency. It is the demonstration vehicle of a DARPA-sponsored research project investigating the energy advantagesof asynchronous design. At the moment of writing, the designis notcomplete,but all unitsare describedat thetransistorlevel,and we can already estimate the performance. In TSMC’s 0.18-
µ
m CMOS process offered via MOSIS, we expect an energyconsumptionof 500 pJ per instruction and an instruction rate of 200 MIPS at the nominal 1.8 V. At 0.9 V, we expect 140 pJ and66 MIPS.The energy efficiency of the Caltech asynchronous QDI de-sign approach was already apparent in the performance of the1989 Caltech Asynchronous Microprocessor [2] and of the1997CaltechMiniMIPS[4],anasynchronousMIPS R3000mi-croprocessor. In addition to increased energy efficiency at thenominal operating voltage, experimental data from these chipsrevealed that the robustness of QDI circuits to delay variationallows these circuits to run at very low voltages, even slightlybelow the transistor threshold voltage. In designing the Lu-tonium, we have made full use of our ability to adjust
and
t
through voltage scaling: the voltage-independent
Et
2
is themetric we are striving to minimize, where
is the average en-ergy per instruction, and
t
is the cycle time. The Lutonium istherefore not, strictly speaking, designed for low power but forthe best trade-off between energy and cycle time. This metrichas beenintroducedandjustified in severalpapers, andwe shallrecapitulate the argument in the next section.We cannot justify the choice of the 8051 ISA entirely on thebasis of energy efficiency. It is a complex and irregular instruc-tion set, which is bound to increase the energy cost of fetchinganddecodingtheinstructions. Alsotheuseofregistersis highlyirregular, another source of energy consumption. The choice of the 8051is justified by the fact that it is the most popularmicro-controller, hence it is often found in applications where energyefficiency is important.The paper is organized as follows. We first briefly recapitu-late the arguments in favor of the
Et
2
metric. We then describethe general design style and the design methodology,both verysimilar to the ones introduced for the MiniMIPS, and explainthe main changes and refinements introducedfor the Lutonium.We then present the instruction set and the general organizationof the pipeline. We describe several parts of the architecture inmore detail: the fetch loop and the decode, the deep-sleep pro-tocol, the tree buses, and the interrupt mechanism. Finally, wediscuss the performance of the prototype and compare it withother implementations of the 8051.II. T
HE
Et
2
M
ETRIC AND
E
NERGY
-
EFFICIENT
D
ESIGN
In first approximation, the (dynamic) energy dissipated dur-ing a transition in a CMOS system is proportional to
CV 
2
, andthe transition delay is inversely proportional to
. Thereforeenergy and delay can be traded against each other simply bychangingthe supply voltage, and it would be foolish not to takeadvantage of this freedom; this means that when we can adjustthe supply voltage, we should not simply optimize one of thetwo metrics and ignore the other. The best way to combine en-ergy and delay in a single figure of merit is to use the product
Et
2
sinceit is independentofthe voltageto first approximation.Giventwo designs
A
and
B
, if the
Et
2
of 
A
is lowerthanthat of 
B
, then
A
is indeed a better design: for equal cycle time
t
, theenergy of 
A
is lower than that of 
B
, and for equal energy, thecycle time of 
A
is lower than that of 
B
. In several papers, wehave given experimental evidence of the range and limits of theassumptions under which the
Et
2
metric is valid [8]. SPICEsimulations also show that under normal operating conditionsan
Et
2
-better circuit will give better energy performance andbetter cycle-time performance.More sophisticated formulas than
Et
2
are possible, but theydo not necessarily work better in practice: the main practicaldeviations from our simple theory are the effect of the transis-tor threshold and the effect of velocity saturation on the speedof operation. Since these two effects largely counteract eachother,
Et
2
works very well around the design voltage of most
 
2
modern CMOS technologies. Furthermore, it is a much easiermetric to handle than one that tries to incorporate the thresholdor velocity saturation directly.We have developed a theory of 
Et
2
-optimal designs, the re-sults of which have influenced the design of the Lutonium, par-ticularly in the areas of high throughput, slack matching, tran-sistor sizing, and conditionalcommunication. (However,we donot claim that the Lutonium design is
Et
2
-optimal, as such aclaim is beyond the current state of the art.)One of the theoretical results is that an
Et
2
-optimal pipelineis short. For the Lutonium, we chose the highest throughputpossible givennonspeculativeexecution,and a minimally spec-ulative fetch loop. We realized that the common-case criticalpath in such a design would be the “fetch loop” (see below),and we chose our throughput target to be the fastest possiblecycle time of this loop.Optimizing the throughput of an asynchronous system in-volves
slack-matching
it. The simplest way of slack-matchingasystem is to treat it like a synchronous retiming problem: havethe same number of pipeline stages on all paths. However, be-cause the slack-matching buffers are almost always faster thanthe elements that perform computations, they can absorb tim-ing variations, and it is not always necessary for all paths togo through an absolutely equal number of pipeline stages inorder to optimize throughput. This has repercussions for
Et
2
-optimal design: whereas a designer who is only interested in
t
does not mind adding more buffers than strictly necessary,the designer that wants to optimize
Et
2
must take into accountthe energy cost of the extra slack-matching buffers. Therefore,slack-matching has more subtleties for an
Et
2
-optimal systemthan for a
t
-optimal system; the optimized system has fewerslack-matching buffers in it. The MiniMIPS was over–slack-matched in this regard.Transistor sizing for optimal
Et
2
is achieved when the sum
of all gate capacitances is approximately
2
where
is thetotal parasitic capacitance [8].
Sizing for optimal energy con-sumption is not minimal sizing!
(Optimal energy consumptionis not minimal energy consumption.)Extensive simulations of the MiniMIPS gave us the opportu-nityto gatherinvaluableinformationaboutthe energybudgetof an asynchronous microprocessor. In particular, it appears thatonly 10% of the total energy consumed in the processing of an instruction goes in the actual execution of the instruction—for instance the actual energy spent in the ALU during an
add
instruction. The remaining 90% of the energy is spent in com-munication: fetching the instruction and decoding it consumes45% of the energy (this happens even for a
NOP
), and the final45% goes into moving parameters and the result between ex-ecution units and registers. These results are relevant becausethe MiniMIPS design style is similar to the Lutonium designstyle.We concluded that we should reduce communication at theexpense of adding local computation. For example, we useda Huffman code based on instruction frequencies to controlthe main buses, which were segmented according to the cor-responding Huffman tree.III. L
OGIC
F
AMILY AND
D
ESIGN
M
ETHOD
The logic family of the Lutonium is essentially the sameas that of the MiniMIPS. Why a logic family chosen for highthroughput should be optimal for
Et
2
has not been establishedrigorously, and may not be true! That it would give good
Et
2
performance compared to other asynchronous logic familiescan be argued and is supported by experimental evidence.
Computation(pulldown)enen
 L? L? R!
rere
rvlv
le
encompletionvalidityvalidity
Figure 1. PCHB circuit template.
The basis of the logic family is the buffer described in CHPas:
[
L
?
x
;
R
!
(
x
)]
The buffer may have several inputs and outputs including con-ditional ones. The above instance is the simplest one in thefamily. In traditional QDI design, the standard implementa-tion consists of separating the control part from the datapath.This design style is simple and general and was used for theCaltech Asynchronous Microprocessor. But it puts high lowerbounds on the cycle time, forward latency, and energy per cy-cle. First, in the control part the four-phasehandshakeon
L
andthe four-phase handshake on
R
are totally ordered, putting all8 transitions in sequence. Secondly, the completion-tree delayis included twice in the handshake cycle between two adjacentbufferstages, and it is proportionalto the logarithmof the num-ber of bits in the datapath. Finally, the explicit storing of thevariable
x
in a register adds considerable overhead in terms of both energy and forward latency.A solution to the problem was proposed for the design of the MiniMIPS, where we introduced a logic family based onvery fine-grainpipeline stages implementedas “prechargehalf-buffers,” or PCHB. First, in order to reduce the completion treedelay and keep the design QDI, the datapath has to be parti-tioned into independent portions, each with their own comple-tion tree. The size of the portions is chosen in such a way thatthe completion-tree delays fit within the alloted cycle-time de-lays. Secondly, each partial buffer processes a portion of datasmall enough that control and datapath do not have to be sep-arated. Thirdly, the explicit storing of the input data in a localregister is eliminated by “reshuffling” the handshake sequenceon
L
and
R
in such a way that the data can be processed di-rectly from the input wires of 
L
to the output wires of 
R
. Theterm PCHB refers to the specific reshuffling chosen; anotheradvantage of this reshuffling is that it has only two transitions
 
3
on the forward latency, i.e., between the data’s being valid on
L
and the result’s being valid on
R
. The basic CMOS implemen-tation of a PCHB stage is shown in figure 1. Observe that allcomputation is done in the pulldown network with
L
as input.The choice of such a fine-grain pipeline stage has drasticconsequences on the organization of the whole system. Be-cause each stage can do only a modest amount of computa-tion (essentially limited by the size of the pull-down circuitry),any non-trivial computation must be decomposed into a net-work of PCHB stages, with important repercussions on latency,throughput, and energy.While the small size of a PCHB stage helps keep the cycletime low, it may also increase the global latency on computa-tion cycles with feedback, in particular the so-called
fetch loop
,which is a critical part of the Lutonium pipeline. The relation-ship between a stage period
p
, a stage latency
l
, and a pipelinecycle-length
n
(in terms of the numberof stages) is givenby theequation
 p
=
n
l
. A crucialstepin the designprocessis choos-ing the individual stages and the length
n
of a pipeline in sucha way that the equation is satisfied for
p
and
l
corresponding tothe optimal cycle time and latency of each stage.In the design of the Lutonium, we chose
p
to be equal to 22elementary transitions (compared to 18 in the MiniMIPS). Thischoice was dictated by the complexity of the fetch loop, whichwe expectedto have a lengthof 11 stages. Once this choice wasmade,all pipelinestages thatwe requiredto beableto runat thefull throughput of the processor (i.e., at least all pipeline stagesthat have to operate once per instruction, and some others) hadto be designed for a cycle period close to 22 transitions, and allcycles had to be
slack-matched 
to be approximately 11 stageslong.IV. T
HE
8051 ISAThe 8051 microcontroller has 255 variable-length instruc-tions, from one to three bytes. We have added a 256th instruc-tion for writing instruction memory during bootstrapping. Theopcodeof an instructionis always encodedin the first byte. Thesecond and third bytes are operands. They might specify a rela-tive or absolute branch target, an indirect address, or an imme-diate operand. The 8051 is a Harvard architecture: instructionmemory and data memory are separate.The instruction set provides six addressing modes: (1) in di-rect addressing, the operand is specified by an 8-bit addressfield in the instruction representing an address in the internaldata RAM or a special-function register (SFR); (2) in indi-rect addressing, the instruction specifies a register containingthe address of the operand in either internal or external RAM;(3) in banked addressing, the register banks containing regis-ters
R0
through
R7
can be addressed by a 3-bit field in the op-code; (4) some instructions operate on specific registers; (5) inimmediate-constant mode, the constant operand value is partof the instruction; (6) the indexed addressing mode is used toread the program memory (the address is obtained by addingthe accumulator to a base pointer).The 8051 peripherals include logic ports, timers and coun-ters, four I/O ports, and an interrupt controller.V. T
HE
L
UTONIUM
D
ESIGN
Instructions in the 8051 architecture may have implicitoperands as well as explicit operands. An example of an 8051instruction that involves many implicit operands is
ADDC A,@Ri
,or
addwith carryaccumulatorto indirec
Ri
.
Thisinstruc-tion requests that the contents of the memoryaddresspointedtoby register
Ri
be added to the accumulator, with a carry-in, andstored in the accumulator. In the 8051 ISA, register
Ri
is itself indirect because the “register bank” is under software control.Therefore,
ADDC A, @Ri
involves the following actions: readthe carry-in
PSW.C
out of the processor status word (
PSW 
);read the register-bank selector
PSW.RS
out of the
PSW
; combine
PSW.RS
with
Ri
to make the memory address pointed to by
Ri
;read the contents of 
Ri
; read the contents of the memory ad-dress pointed to by
Ri
(i.e.,
@Ri
); read the accumulator; add theaccumulatorand
@Ri
; store the eightlow-orderbits of the resultin the accumulator; store the carry-outin
PSW.C
; store the over-flow bit of the add in
PSW.OV
; and finally store the carry-out of bit 3 in
PSW.AC
. (
PSW.AC
is involved in implementing base-tenarithmetic.)The obvious approach to the nightmare of executing such aninstruction is to microcode the instruction set and execute iton a conceptually simpler machine. Unfortunately, this wouldlead to a very slow implementation with many machine cyclesper instruction execution. Therefore, rather than microcodinginstructions with many implicit operands, we decided to havespecial-purpose channels for implicit operands. This allowssuch an instruction to execute in one cycle, and reduces depen-dence on general-purpose buses, which have large fanouts andhence high energy cost.Most of the design effort in the Lutonium has gone into as-suring good
Et
2
-efficiency. For instance, the Lutonium imple-mentation is highly pipelined for speed, but it is still nonspecu-lative: the instruction-fetch unit only keeps filling the pipelineas long as it knows that the instructions are definitely goingto be executed, and although branches are executed in one cy-cle, that cycle is “stretched.” We took great pains to minimizeswitching activity: no register or execution unit receives con-trol unless it will process data for a given instruction (hence ithas no switching circuit nodes unless it is used on a particularinstruction), and interrupts and pins only cause switching whenaccessed by software or an input pin switches. We also local-ize activity as much as possible: special registers (
SP
,
PSW
,
B
,
DPTR
) have their own channels and function units (instead of using the main buses and units) for energy and time savings;e.g., the 16-bit
DPTR
can be incremented in one cycle withoutusing the main buses at all. Also, infrequentlyused units do notadd to the fanin and fanout of the main buses (see section VIII),for energy and time savings. The time savings would be lostif a clock were used, since we have improved the average-caseenergy and delay at the expense of somewhat worsening theworst-case delays.Now let us consider all communications caused by the one-byteinstruction
ADDC A, @Ri
.
RuptArb
first sends0to
Fetch 
,indicating the absence of an interrupt request. Then
Fetch 
sends the instruction byte to
Decode
. The following actions arespecifically required by the
ADDC A, @Ri
instruction:
Decode
sends control messages to
PSW 
,
ALU 
,
A
, and
RegFile
. Then

Share & Embed

More from this user

Recent Readcasters

Add a Comment

Characters: ...