3
on the forward latency, i.e., between the data’s being valid on
L
and the result’s being valid on
R
. The basic CMOS implemen-tation of a PCHB stage is shown in figure 1. Observe that allcomputation is done in the pulldown network with
L
as input.The choice of such a fine-grain pipeline stage has drasticconsequences on the organization of the whole system. Be-cause each stage can do only a modest amount of computa-tion (essentially limited by the size of the pull-down circuitry),any non-trivial computation must be decomposed into a net-work of PCHB stages, with important repercussions on latency,throughput, and energy.While the small size of a PCHB stage helps keep the cycletime low, it may also increase the global latency on computa-tion cycles with feedback, in particular the so-called
fetch loop
,which is a critical part of the Lutonium pipeline. The relation-ship between a stage period
p
, a stage latency
l
, and a pipelinecycle-length
n
(in terms of the numberof stages) is givenby theequation
p
=
n
∗
l
. A crucialstepin the designprocessis choos-ing the individual stages and the length
n
of a pipeline in sucha way that the equation is satisfied for
p
and
l
corresponding tothe optimal cycle time and latency of each stage.In the design of the Lutonium, we chose
p
to be equal to 22elementary transitions (compared to 18 in the MiniMIPS). Thischoice was dictated by the complexity of the fetch loop, whichwe expectedto have a lengthof 11 stages. Once this choice wasmade,all pipelinestages thatwe requiredto beableto runat thefull throughput of the processor (i.e., at least all pipeline stagesthat have to operate once per instruction, and some others) hadto be designed for a cycle period close to 22 transitions, and allcycles had to be
slack-matched
to be approximately 11 stageslong.IV. T
HE
8051 ISAThe 8051 microcontroller has 255 variable-length instruc-tions, from one to three bytes. We have added a 256th instruc-tion for writing instruction memory during bootstrapping. Theopcodeof an instructionis always encodedin the first byte. Thesecond and third bytes are operands. They might specify a rela-tive or absolute branch target, an indirect address, or an imme-diate operand. The 8051 is a Harvard architecture: instructionmemory and data memory are separate.The instruction set provides six addressing modes: (1) in di-rect addressing, the operand is specified by an 8-bit addressfield in the instruction representing an address in the internaldata RAM or a special-function register (SFR); (2) in indi-rect addressing, the instruction specifies a register containingthe address of the operand in either internal or external RAM;(3) in banked addressing, the register banks containing regis-ters
R0
through
R7
can be addressed by a 3-bit field in the op-code; (4) some instructions operate on specific registers; (5) inimmediate-constant mode, the constant operand value is partof the instruction; (6) the indexed addressing mode is used toread the program memory (the address is obtained by addingthe accumulator to a base pointer).The 8051 peripherals include logic ports, timers and coun-ters, four I/O ports, and an interrupt controller.V. T
HE
L
UTONIUM
D
ESIGN
Instructions in the 8051 architecture may have implicitoperands as well as explicit operands. An example of an 8051instruction that involves many implicit operands is
ADDC A,@Ri
,or
addwith carryaccumulatorto indirect
Ri
.
Thisinstruc-tion requests that the contents of the memoryaddresspointedtoby register
Ri
be added to the accumulator, with a carry-in, andstored in the accumulator. In the 8051 ISA, register
Ri
is itself indirect because the “register bank” is under software control.Therefore,
ADDC A, @Ri
involves the following actions: readthe carry-in
PSW.C
out of the processor status word (
PSW
);read the register-bank selector
PSW.RS
out of the
PSW
; combine
PSW.RS
with
Ri
to make the memory address pointed to by
Ri
;read the contents of
Ri
; read the contents of the memory ad-dress pointed to by
Ri
(i.e.,
@Ri
); read the accumulator; add theaccumulatorand
@Ri
; store the eightlow-orderbits of the resultin the accumulator; store the carry-outin
PSW.C
; store the over-flow bit of the add in
PSW.OV
; and finally store the carry-out of bit 3 in
PSW.AC
. (
PSW.AC
is involved in implementing base-tenarithmetic.)The obvious approach to the nightmare of executing such aninstruction is to microcode the instruction set and execute iton a conceptually simpler machine. Unfortunately, this wouldlead to a very slow implementation with many machine cyclesper instruction execution. Therefore, rather than microcodinginstructions with many implicit operands, we decided to havespecial-purpose channels for implicit operands. This allowssuch an instruction to execute in one cycle, and reduces depen-dence on general-purpose buses, which have large fanouts andhence high energy cost.Most of the design effort in the Lutonium has gone into as-suring good
Et
2
-efficiency. For instance, the Lutonium imple-mentation is highly pipelined for speed, but it is still nonspecu-lative: the instruction-fetch unit only keeps filling the pipelineas long as it knows that the instructions are definitely goingto be executed, and although branches are executed in one cy-cle, that cycle is “stretched.” We took great pains to minimizeswitching activity: no register or execution unit receives con-trol unless it will process data for a given instruction (hence ithas no switching circuit nodes unless it is used on a particularinstruction), and interrupts and pins only cause switching whenaccessed by software or an input pin switches. We also local-ize activity as much as possible: special registers (
SP
,
PSW
,
B
,
DPTR
) have their own channels and function units (instead of using the main buses and units) for energy and time savings;e.g., the 16-bit
DPTR
can be incremented in one cycle withoutusing the main buses at all. Also, infrequentlyused units do notadd to the fanin and fanout of the main buses (see section VIII),for energy and time savings. The time savings would be lostif a clock were used, since we have improved the average-caseenergy and delay at the expense of somewhat worsening theworst-case delays.Now let us consider all communications caused by the one-byteinstruction
ADDC A, @Ri
.
RuptArb
first sends0to
Fetch
,indicating the absence of an interrupt request. Then
Fetch
sends the instruction byte to
Decode
. The following actions arespecifically required by the
ADDC A, @Ri
instruction:
Decode
sends control messages to
PSW
,
ALU
,
A
, and
RegFile
. Then
Add a Comment
hossam anwarleft a comment