You are on page 1of 12

COPING WITH LATENCY IN

SOC DESIGN
LATENCY-INSENSITIVE DESIGN IS THE FOUNDATION OF A CORRECT-BY-

CONSTRUCTION METHODOLOGY FOR SOC DESIGN. THIS APPROACH CAN

HANDLE LATENCY’S INCREASING IMPACT ON DEEP-SUBMICRON

TECHNOLOGIES AND FACILITATE THE REUSE OF INTELLECTUAL-PROPERTY

CORES FOR BUILDING COMPLEX SYSTEMS ON CHIPS, REDUCING THE NUMBER

OF COSTLY ITERATIONS IN THE DESIGN PROCESS.

As commercial demand for system- reach in a clock cycle—not the number that
on-a-chip (SOC)-based products grows, the designers can integrate on a chip—will drive
effective reuse of existing intellectual-property the design process. The “From computation-
design modules (also known as IP cores) is to communication-bound design” sidebar dis-
essential to meet the challenges posed by deep- cusses this trend. The strategic importance of
submicron (DSM) technologies and to com- developing a communication-based design
plete a reliable design within time-to-market methodology naturally follows. Researchers
constraints. Originally, IP cores were mostly have proposed the principle of orthogonal-
Luca P. Carloni functional blocks built for previous design ization of concerns as essential to dealing with
generations within the same company. Fre- the complexity of systems-on-a-chip design.1
quently, today’s IP cores are optimized mod- Here, we address the orthogonalization of
ules marketed as off-the-shelf components by communication versus computation—in par-
Alberto L. specialized vendors. ticular, separating the design of the block
An IP core must be both flexible—to col- functionalities from communication archi-
Sangiovanni- laborate with other modules within different tecture development.
environments—and independent from the An essential element of communication-
Vincentelli particular details of one-among-many possi- based design is the encapsulation of pre-
ble implementations. The prerequisite for an designed functional modules within
University of California, easy trade, reuse, and assembly of IP cores is automatically generated interface structures.
the ability to assemble predesigned compo- Such a strategy ensures a correct-by-con-
Berkeley nents with little or no effort. The consequent struction composition of the system. Laten-
challenge is addressing the communication cy-insensitive design and the recycling
and synchronization issues that naturally arise paradigm are a step in this direction.
while assembling predesigned components.
We believe that the semiconductor industry Coping with volatile latency
will experience a paradigm shift from com- High-end-microprocessor designers have tra-
putation- to communication-bound design: ditionally anticipated the challenges that ASIC
The number of transistors that a signal can designers are going to face in working with the

24 0272-1732/02/$17.00  2002 IEEE


From computation- to communication-bound design
According to the “2001 Technology Roadmap for Semiconductors,” the taining a gloomy forecast of how on-chip interconnect latency (predicted
historical half-pitch technology gap between microprocessors and ASICs to soon measure in the tens of clock cycles) will hamper Moore’s law. Since
on the one hand and DRAM on the other will disappear by 2005.1 As we then, researchers have fiercely debated the real magnitude of the wire-
proceed into deep-submicron (DSM) technologies, the 130-nm micro- delay impact and the consequent need for revolutionizing established
processor half-pitch in 2002 will shrink to the projected 32-nm length in design methodologies already challenged by the timing-closure problem.
2013, allowing the integration of more than 1 billion transistors on a sin- The debate’s core has been about the scaling properties of interconnect
gle die. At the same time, the ITRS predicts that the on-chip local-clock wires relative to gate scaling. On one side, some researchers share an
frequency will rise from today’s 1 to 2 GHz, to 19 to 20 GHz. The so-called optimistic view based on the fact that wires that scale in length togeth-
interconnect problem, however, threatens the outstanding pace of tech- er with gate lengths offer approximately a constant resistance and a
nological progress that has shaped the semiconductor industry. Despite falling capacitance. Hence, as long as designers adopt a modular design
the increase in metal layers and in aspect ratio, the resistance-capaci- approach and treat functional modules of up to 50,000 gates as the main
tance delay of an average metal line with constant length is becoming components, these researchers’ position is that current design flows can
worse with each process generation.2-3 The current migration from alu- sustaining the challenges of DSM design.6
minum to copper metallization is compensating for this trend by reducing On the other side of the debate, researchers argue that the previous
the interconnect resistivity. The introduction of low-k dielectric insula- argument does not account for the presence in SOCs of many global wires
tors may also alleviate the problem, but these one-time improvements that cannot scale in length. As Figure A7 (next page) shows, such wires
will not suffice in the long run as feature size continues to shrink.4 must span multiple modules to connect distant gates, so aren’t scalable.
The increasing resistance-capacitance delays combined with the As Table A3 illustrates, the intrinsic interconnect delay of a 1-mm length
increases in operating frequency, die size, and average interconnect length wire for a 35-nm technology will be longer than the transistor delay by two
cause interconnect delay to become the largest fraction of the clock cycle orders of magnitude. Some researchers argue that, even under the best
time. In 1997, Computer magazine published a study by D. Matzke5 con- continued on p. 26

Table A. Interconnect delay of 1-mm line vs. transistor delay for various process technologies.3

Minimum, scaled, Reverse, scaled, Ratio of the


1-mm interconnect 1-mm interconnect wire size to
Technology (type of MOSFET* switching intrinsic delay, intrinsic delay, the minimum
wire and substrate) delay, approximate (ps) approximate (ps) approximate (ps) lithographic size
10 µm (Al, SiO2) 20 5 5 1
0.1 µm (Al, SiO2) 5 30 5 1.5
35 nm (Cu, low-k) 2.5 250 5 4.5

*Metal-oxide-semiconductor field-effect transistor

next process generation. Latency is increasing- latency early in the design cycle because sev-
ly affecting the design of state-of-the-art micro- eral phenomena affect it. Process variations,
processors. (Latency, for example, drove the crosstalk, and power supply drop variations
design of so-called drive stages in the new hyper- all affect interconnect latency. Furthermore,
pipelined Netburst microarchitecture2 of Intel’s their combined effect can vary across chip
Pentium 4). The “impact of latency in the one- regions and periods of chip operation. Hence,
billion-transistor microprocessor design” side- finding the exact delay value for a global wire
bar (on p. 27) discusses latency as a force is often impossible. On the other hand, rely-
shaping this future architecture, which is ing on conservative estimates to establish value
expected within the next 10 years. intervals can lead to suboptimal designs.
We have argued that interconnect latency We believe that recently proposed design
will have a significant impact on the design of flows advocating new CAD tools that couple
the communication architecture among the logic synthesis and physical design will suffer
modules of a system on a chip. However, it is from the impracticality of accurately estimat-
difficult to estimate the actual interconnect ing the latency of global wires. Indeed, logic

SEPTEMBER–OCTOBER 2002 25
SOC DESIGN LATENCY

continued from p. 25 cally decreasing. Designs will soon reach a point where more gates can fit
conditions, the latency to transmit a signal across the chip in a top-level on a chip that can communicate in one cycle.5,7 Hence, instead of being lim-
metal wire will vary between 12 and 32 cycles, depending on the clock rate.8 ited by the number of transistors integratable on a single die (computation
In fact, although the number of gates reachable in a cycle will not change bound), designs will be limited by the amount of state and logic reachable
significantly and the on-chip bandwidth will continue to grow, the percent- within the required number of clock cycles (communication bound).
age of the die reachable within one clock cycle is inexorably and dramati-

References
1. A. Allan et al., “2001 Technology Roadmap for Semiconduc-
tors,” Computer, vol. 35, no. 1, Jan. 2002, pp. 42-53.
2. M.T. Bohr, “Interconnect Scaling: The Real Limiter to High
Performance ULSI,” Proc. 1995 IEEE Int’l Electron Devices
Meeting, IEEE Press, Piscataway, N.J., 1995, pp. 241-244.
Scaling 3. J.A. Davis et al., “Interconnect Limits on Gigascale Integration
(GSI) in the 21st Century,” Proc. IEEE, vol. 89, no. 3, Mar.
2001, pp. 305-324.
4. M.J. Flynn, P. Hung, and K.W. Rudd, “Deep-Submicron
(1) (2)
Microprocessor Design Issues,” IEEE Micro, vol. 19, no. 4,
July 1999, pp. 11-13.
Figure A. Local wires scale in length; global wires do not. 5. D. Matzke, “Will Physical Scalability Sabotage Performance
The figure shows the different impact of technology scal- Gains?” Computer, vol. 30, no. 9, Sept. 1997, pp. 37-39.
ing on devices, local wires, and global wires. The newer 6. D. Sylvester and K. Keutzer, “Rethinking Deep-Submicron
technology process in SOC (2) allows a higher level of Circuit Design,” Computer, vol. 32, no. 11, Nov. 1999, pp. 25-
integration than SOC (1) as more (and smaller) devices 33.
and modules are accommodated on the chip. Local wires 7. R. Ho, K. Mai, and M. Horowitz, “The Future of Wires,” Proc.
connect devices within a module and shrink with the IEEE, vol. 89, no. 4, Apr. 2001, pp. 490-504.
module. Global wires connect devices located in different 8. V. Agarwal et al., “Clock Rate versus IPC: The End of the Road
modules and do not shrink because they need to span for Conventional Microarchitectures,” Proc. 27th Ann. Int’l
significant proportions of the die.7 Symp. Computer Architecture (ISCA 00), ACM Press, New
York, 2000, pp. 248-259.

synthesis tools presently suffer from several we mean each signal path leaving a latch and
drawbacks. They are traversing only combinational logic and
wires to reach another latch. The slowest
• inherently unstable, and combinational path (critical path) dictates
• based on the synchronous design method- the maximum operating frequency for the
ology. system. However, it is often the case that the
desired operating frequency represents a fixed
Small variations to the hardware descrip- design constraint. Then, once designers
tion language’s input specification—like those derive the final layout, every path with a
a designer might make to fix a slow path— delay longer than the desired clock period
can lead to major variations in the output simply represents a design exception to be
netlist and, consequently, in the final layout. fixed. To fix these exceptions, designers use
The synchronous design methodology is techniques from wire buffering and transis-
the foundation of the design flows for the tor resizing to rerouting wires, replacing
majority of commercial chips today, but, if modules, and even redesigning entire por-
left unchanged, will lead to an exacerbation tions of the system.
of the timing closure problem for tomorrow’s Replacing, rerouting, and redesigning clear-
design flows. In fact, the main assumption ly do not alleviate the timing closure problem.
in synchronous design is that the delay of Buffering is an efficient technique but carries
each combinational path is smaller than the precise limitations, because there is a limit to
system clock period. By combinational path, the number of buffers that designers can insert

26 IEEE MICRO
The impact of latency in the one-billion-transistor microprocessor design
The evolution toward communication-bound design in microprocessors • Unlike modern multiprocessors with their all-or-nothing locality, laten-
implies that the amount of state reachable in a clock cycle, and not the cy on the 2009 multicomputer varies continuously with distance.
number of transistors, becomes the major factor limiting the growth of • This approach controls latency by placing data near their point of
instruction throughput. Furthermore, the increasing interconnect latency use, not just at their point of use.
will particularly penalize current memory-oriented microprocessor archi-
tectures that strongly rely on the assumption of low-latency communica-
tion with structures such as caches, register files, and rename/reorder References
tables. Recent studies employ cache-delay analysis tools that account for 1. V. Agarwal et al., “Clock Rate versus IPC: The End of the Road
cache parameters as well as technology generation figures. These stud- for Conventional Microarchitectures,” Proc. 27th Ann. Int’l
ies predict that in a 35-nm design running at 10 GHz and accessing even Symp. Computer Architecture (ISCA 00), ACM Press, New
a 4-Kbyte level-one cache will require three clock cycles.1 In fact, this trend York, 2000, pp. 248-259.
has already started to affect design: The architects of the Alpha 21264 2. P. Glaskowski, “Pentium 4 (Partially) Previewed,”
adopted clustered functional units and a partitioned register file to con- Microprocessor Report, vol. 14, no. 8, Aug. 2000, pp. 10-13.
tinue offering high computational bandwidth despite longer wire delays. 3. W.J. Dally and S. Lacy, “VLSI Architecture: Past, Present and
Similarly, Intel’s Pentium 4 microprocessor presents two so-called drive Future,” Proc. Advanced Research in VLSI Conf., IEEE Press,
stages in its new hyperpipelined Netburst microarchitecture; designers N.J., 1999, pp. 232-241.
dedicated these stages purely to instruction distribution and data move- 4. K. Mai et al., “Smart Memories: A Modular Reconfigurable
ment.2 Exposing interconnect latency to the microarchitecture and possi- Architecture,” Proc. 27th Ann. Int’l Symp. Computer Archi-
bly to the instruction set will be key in controlling system performance. tecture (ISCA 00), ACM Press, New York, 2000, pp. 161-171.
Several research teams have already started investigating this idea.3-8 5. C.E. Kozyrakis et al., “Scalable Processors in the Billion-
In particular, Bill Dally proposes a model for a hypothetical 2009 mul- Transistor Era: IRAM,” Computer, vol. 30, no. 9, Sept. 1997,
ticomputer chip. He divides the proposed chip into 64 tiles (each con- pp. 75-78.
taining a processor and a memory), where the memory processor’s 6. R. Ho, K. Mai, and M. Horowitz, “The Future of Wires,” Proc.
round-trip intratile communication latency is two cycles (1 ns). The worst- IEEE, vol. 89, no. 4, Apr. 2001, pp. 490-504.
case latency for a corner-to-corner communication with the most distant 7. E. Waingold et al., “Baring It All to Software: Raw Machines,”
memory is 56 cycles. More importantly, Dally points out the following Computer, vol. 30, no. 9, Sept. 1997, pp. 86-93.
features of his approach: 8. R. Nagarajan et al., “A Design Space Evaluation of Grid
Processor Architectures,” Proc. 34th Ann. Int’l Symp.
• On-chip communication bandwidth is not an issue because of the Microarchitecture (Micro-34), IEEE CS Press, Los Alamitos,
many wiring tracks. Calif., 2001, pp. 40-51.

on a given wire and still reduce delay. In fact, nents assumes a certain latency, then design-
as a last resort, designers often must break long ers must rework it to account for additional
wires by inserting latches, which is similar to pipeline stages. Such rework has serious con-
the insertion of new stages in a pipeline. sequences on design productivity.
This operation, which trades off fixing a On the one hand, the increasing impact of
wire exception with increasing its latency by latency variations will drive architectures
one or more clock cycles, will become perva- toward modular designs with an explicit glob-
sive in deep-submicron design, where most al latency mechanism. In the case of multi-
global wires will be heavily pipelined anyway. processor architectures, latency variation will
In fact, latency (measured in clock cycles) lead the designers to expose computation/com-
between SOC components will vary consid- munication tradeoffs to the software compil-
erably based on the components’ reciprocal ers. At the same time, the focus on latency will
distances—even without considering the need open the way to new synthesis tools that can
for increasing latency to further pipeline long automatically generate the hardware interfaces
global wires. Inserting latches (stateful responsible for implementing the appropriate
repeaters) has a different impact on the sur- synchronization and communication strate-
rounding control logic with respect to insert- gies, such as channel multiplexing, data
ing buffers (stateless repeaters). If the interface coherency, and so on. All these considerations
logic design of two communication compo- lead us to propose a design methodology that

SEPTEMBER–OCTOBER 2002 27
SOC DESIGN LATENCY

guarantees the robustness of the system’s func- desired clock period into shorter segments by
tionality and performance with respect to arbi- inserting special memory elements called relay
trary latency variations. stations. Hence, we divide the latency-insen-
sitive design flow into four basic steps:
Latency-insensitive design
The foundation of latency-insensitive design 1. Specification of synchronous compo-
is the theory of latency-insensitive protocols.3 nents. Designers must first specify the
A latency-insensitive protocol controls com- system as a collection of synchronous
munication among components of a patient components, relying on the synchronous
system—a synchronous system whose func- hypothesis. These components can be
tionality depends only on the order of each sig- custom modules or IP cores; we refer to
nal’s events and not on their exact timing. them as pearls.
Designers can model a synchronous system as 2. Encapsulation. Next, it’s necessary to
a set of modules. These modules communi- encapsulate each pearl within an auto-
cate by exchanging signals on a set of point- matically generated shell. A shell is a col-
to-point channels. The protocol guarantees lection of buffering queues—one for
that a system, if composed of functionally cor- each port—plus the control logic that
rect modules, behaves correctly, independent interfaces the pearl with the latency-
from delays in the channels connecting the insensitive protocol.
modules. Consequently, it’s possible to auto- 3. Physical layout. In this traditional step,
matically synthesize a hardware implementa- designers use logic synthesis, and place-
tion of the system such that its functional and-route tools to derive the layout of the
behavior is robust with respect to large varia- chip implementing the system.
tions in communication latency.4 In practice, 4. Relay station insertion. Designers seg-
the channel implementation can vary and does ment every wire whose latency is greater
not necessarily follow the point-to-point struc- than the clock period by distributing the
ture. Still, this methodology orthogonalizes necessary relay stations. Relay stations are
computation and communication because it similar to regular pipeline latches and,
separates the module design from the com- therefore, inherently different from tra-
munication architecture options, while ditional buffering repeaters.
enabling automatic synthesis of the interface
logic. This separation is useful in two ways: The only precondition this methodology
requires is that the module’s pearls are stal-
• It simplifies module design because design- lable, that is, they can freeze their operation
ers can assume the synchronous hypothe- for an arbitrary time without losing their
sis. That is, intermodule communication internal state. This is a weak requirement
will take no time (or, in other words, it’s because most hardware systems can be made
completed within one virtual clock cycle). stallable by, for instance, implementing a
• It permits the exploration of tradeoffs in gated clock mechanism. But, this precondi-
deriving the communication architecture tion is sufficient to permit the automatic syn-
up to the design process’ late stages, thesis of a shell around the pearl such that the
because the protocol guarantees that the shell-pearl combination satisfies the neces-
interface logic can absorb arbitrary laten- sary, and much stronger, patience property.3
cy variations. Figure 1 illustrates a system where five auto-
matically generated shells encapsulate five
In an earlier work, we discussed the appli- pearls to make them patient. The resulting
cation of the latency-insensitive methodology shell-pearl components communicate by
to the case where the channels are simply means of eight point-to-point communica-
implemented as sets of point-to-point metal tion channels that implement the back-pres-
wires.3 Here, our discussion addresses an sure mechanism described in the next section.
immediate advantage of this methodology: Finally, the insertion of six relay stations
After physical design, it lets designers pipeline enables channel pipelining to meet the sys-
any long wire having a delay larger than the tem clock period.

28 IEEE MICRO
Channels and back-pressure Relay Relay
Channels are point-to- station station
Pearl 1
point unidirectional links
Shell 1
between a source and a sink
module. Data are transmitted
on a channel by means of Pearl 3
Relay Relay
packets that consist of a vari- Pearl 2 Shell 3
station station
able number of fields. Here, Shell 2
we consider only two basic Relay
fields: payload, which contains station
the transmitted data; and void, Pearl 4
Relay
Pearl 5
a one-bit flag that, if set to 1, Shell 4 station Shell 5
denotes that no data are pre-
sent in the packet. If a packet Figure 1. Shell encapsulation, relay station insertion, and channel back-pressure.
does contain meaningful pay-
load data (that is, it has void
set to 0), we call it a true packet.
A channel consists of wires and relay sta- Latency-insensitive versus asynchronous design
tions. The number of relay stations in a chan- The latency-insensitive design methodology is clearly reminiscent of many ideas that the
nel is finite and represents the channel’s asynchronous-design community has proposed during the past three decades.1-2 Informally,
buffering capability. At each clock cycle, the asynchronous design’s underlying assumption is that the delay between two subsequent
source module can either put a new true pack- events on a communication channel is completely arbitrary. In the case of a latency-insen-
et on the channel or, when no output data are sitive system, designers constrain this arbitrary delay to be a multiple of the clock period. The
available, put a void packet on it. Conversely, key point is that this type of discretization lets designers leverage well-accepted design tools
at each clock cycle the sink module retrieves and methodologies for the design and validation of synchronous circuits. In fact, the basic
the incoming packet from the channel. It then distinction between the two approaches is the specification of a latency-insensitive system
discards or stores the packet on the input as a synchronous system. Notice that we say specified because, from an implementation
channel queue for later use, basing its deci- point of view, you could realize a latency-insensitive protocol using handshake-driven sig-
sion on the void field value. naling techniques (for example, request-acknowledge mechanisms), which are typically asyn-
A source module might not be ready to chronous. On the other hand, synchronous systems specify a complex system as a collection
send a true packet, and a sink module might of modules whose state is updated collectively in one zero-time step. This update is natu-
not be ready to receive it if, for instance, its rally simpler than specifying the same system as the interaction of many components whose
input queue is full. However, the latency- state is updated following an intricate set of interdependency relations. IC designers could
insensitive protocol demands fully reliable legitimately see latency-insensitive design as a compromise that accommodates important
communication among the modules, and properties of asynchronous circuits within the familiar synchronous design framework.
does not allow lossy communication links; it
requires the proper delivery of all packets.
Consequently, the sink module must interact References
with the channel (and ultimately with the cor- 1. W.A. Clark and C.E. Molnar, “The Promise of Macromodular Systems,” Digest
responding source module) to momentarily of Papers 6th Ann. IEEE Computer Soc. Int’l Conf., IEEE Press, Piscataway,
stop the communication flow and avoid the N.J., 1972, pp. 309-312.
loss of any packet. Therefore, we slightly relax 2. I.E. Sutherland, “Micropipelines,” Comm. ACM, vol. 32, no. 6, June 1989, pp.
our definition of a channel as unidirectional 720-738.
to allow a bit of information (the channel stop
flag) to move in the opposite direction. The
dashed wires in Figure 1 represent this signal, the channel that it can’t receive the next pack-
which is similar to the NACK signal in et, and the channel must hold the packet until
request/acknowledge protocols of asynchro- the sink module resets the stop flag. Because
nous design. See the “Latency-insensitive ver- the sink module and channel have limited
sus asynchronous design” sidebar for a buffering resources, a channel dealing with a
discussion of the two approaches. sink module that requires a long stall might fill
By setting the stop flag equal to one during up all its relay stations and have to send a stop
a certain clock cycle, the sink module informs flag to the source module to stall its packet pro-

SEPTEMBER–OCTOBER 2002 29
SOC DESIGN LATENCY

DataIn DataOut
Channel 1

Channel 4
Pearl module
voidIn voidOut

stopOut stopIn

DataIn DataOut
Channel 2

Channel 5
voidIn voidOut
Clock
stopOut stopIn

dataIn
Channel 3

voidIn

stopOut

Figure 2. Shell encapsulation: Making an IP core patient.

duction. This back-pressure mechanism con- 1. It obtains incoming packets from the
trols the flow of information on a channel while input channels, filters away the void
guaranteeing that no packets are lost. packets, and extracts the input values for
M from the payload fields of the true
Shell encapsulation packets.
Given particular module M, we can devise 2. When all input values are available for
a method to automatically synthesize an the next computation, it passes them to
instance of a shell as a wrapper to encapsulate M and initiates the computation.
M. We can further interface the shell with the 3. It obtains the computation results from M.
channels so that M becomes a patient system. 4. If no output channel has previously
To do so, the only necessary condition is that raised a stop flag, it routes the result into
M be stallable. At each clock cycle, the mod- the output channels.
ule’s internal computation must fire only if all
inputs have arrived. The module shell’s first Figure 2 illustrates a simple unoptimized
task is to guarantee this input synchronization. implementation of a shell wrapping a mod-
Its second task is output propagation: At each ule with three input and two output channels.
clock cycle, if M has produced new output val- It’s also possible to insert queues between the
ues and no output channel has previously shell I/Os and M’s I/Os to increase buffering
raised a stop flag, then the shell can transmit and avoid fragmented stalling. Using queues
these output values by generating new true lets designers optimize implementations.
packets. If the shell does not verify either of Because such implementations follow the pro-
these two conditions, then it must retransmit tocol outlined previously, the shell-pearl com-
the packet transmitted in the previous cycle as bination operates in a way similar to that of
a void packet. In summary, a shell for module an actor in a static data flow. Hence, design-
M cyclically performs the following actions: ers can take advantage of useful properties,

30 IEEE MICRO
such as the ability to statically size the shell
queues and to statically compute the perfor- Main
mance of the overall system. register

Relay stations
A relay station is a patient process commu-
dataIn dataOut
nicating with two channels, ci and co. Let si
and so be the signals associated with the chan- voidIn voidout
Auxiliary
nels. Further, let I(l, k, si), l ≤ k, denote the register
sequence of informative events (true packets)
of si between the lth and kth clock cycle. If
these conditions are true, then si and so are
latency equivalent and for all k

I[1, (k − 1), si] − I(1, k, so) ≥ 0


I(1, k, si) − I [1, (k − 1), so] ≤ 2
Control logic
The following is an example of relay station
behavior, where τ denotes a stalling event Clock
(void packet) and ιi a generic informative
event:

si = ι1 ι2 ι3 τ τ ι4 ι5 ι6 τ τ τ ι7 τ ι8 ι9 ι10 … stopOut stopIn


so = τ ι1 ι2 ι3 τ τ τ ι4 τ τ τ ι5 ι6 ι7 τ ι8 ι9 ι10 …
Figure 3. Example relay station implementation.
Notice that we don’t further specify signals si
and so; not even, for example, saying that si is
the input and so is the output. The definition
stopIn + voidIn stopIn
of relay station simply involves a set of rela-
tions—a protocol, between si and so without
any implementation detail. Still, it’s clear that
Processing stopIn Stalling
each informative event received on channel ci
is later emitted on co, while the presence of a
stalling event on co might induce a stalling event
on ci in a later cycle. In fact, an informative stopIn * voidIn stopIn
event takes at least one clock cycle to pass
through a relay station (minimum forward stopIn stopIn
latency equal to one), and at most two infor-
ReadAux stopIn WriteAux
mative events can arrive on ci while no infor-
mative events are emitted on co (internal storage
capacity equal to two). Finally, one extra stalling Figure 4. Finite state machine abstracting the relay station
event on co will move into ci in at least one cycle logic from Figure 3.
(minimum backward latency equal to one).
The double storage capacity of a relay station
permits, in the best case, communication with segment it, thereby increasing the channel’s
maximum throughput (equal to one). latency. Figure 3 illustrates a possible relay sta-
Because relay stations are patient process- tion implementation. Researchers have recent-
es, and patience is a compositional property, ly proposed other interesting relay station
the insertion of a relay station in a patient sys- implementations.5
tem guarantees that the system remains Figure 4 shows a finite state machine rep-
patient.3 Further, because relay stations have resenting the control logic in the relay station
minimum latencies equal to one, designers of Figure 3. Basically, the relay station switch-
can repetitively insert them into a channel to es between two main states: processing, when

SEPTEMBER–OCTOBER 2002 31
SOC DESIGN LATENCY

a flow of true packets traverses the relay sta- the graph’s maximum cycle mean λ(G),
tion; and stalling, when communication is defined as maxC ∈ G{[w(C) + |C|] / |C|},
interrupted. Switching through states where C is a cycle of G and w(C) the sum of
WriteAux and ReadAux governs the transition the weights of the arcs of C. Increasing λ(G)
between these two states. These states ensure corresponds to increasing the cycle time of the
proper control of the auxiliary register to avoid system modeled by G and, symmetrically,
losing a packet, because the stop flag takes a decreasing its throughput ϑ(G).6 Throughput
cycle to propagate backward through the relay always decreases if any arc a with augmented
station. weight w(a) belongs to the set of critical cycles
of G, that is, those cycles whose mean coin-
Recycling cides with the maximum cycle mean. Increas-
In general, an automatic tool should com- ing the weights of some arcs might make a
plete the insertion of relay stations as part of noncritical cycle of G become a critical cycle
the physical design process (similarly to the of recycled graph G′. In any case, after com-
buffer insertion techniques available in cur- pleting the recycling transformation, we can
rent design flows). In fact, the latency-insen- exactly compute the consequent throughput
sitive methodology’s main advantage is the degradation ∆ϑ(G, G′) = ϑ(G) − ϑ(G′) using
freedom offered to designers in moving the one of the following methods:
latency after deriving the final implementa-
tion. Designers can fix problematic layouts • Solve the maximum cycle mean problem
without changing the design of individual for graph G ′, and simply set ϑ(G ′) =
modules and also explore and optimize laten- 1/[λ(G′)].
cy-throughput tradeoffs up to the late stages • After establishing set Α of cycles having
of the design process. The recycling paradigm at least one arc with an augmented
formally captures the latency variations of the weight, increment the cycle mean of each
communication channels as you add (and element C of Α by the quantity 1/|C| ×
move) relay stations. It lets you exactly com- ∑i ∆w(ai), where ai are the arcs of C that
pute the system’s final throughput.6 have been corrected. Let λ* be the max-
For recycling, we model a latency-insensi- imum among all these cycle means, then
tive system as a directed (possibly cyclic) graph
G that associates each vertex with a shell-pearl ∆ϑ (G, G′) = 0 if λ* ≤ λ(G) or [λ* − λ(G)] /
pair and in which each arc corresponds to a [λ(G) × λ*] otherwise.
channel. We annotate each arc a with length
l(a) and weight w(a). The length denotes the The critical cycle dictates overall through-
smallest multiple of the desired clock period put because the rest of the system must slow
that is larger than the delay of the corre- down to wait for it and avoid packet loss. In
sponding channel. The weight represents the general, if G contains more than one strong-
number of relay stations inserted on the chan- ly connected component, you must decom-
nel. We call arc a illegal if w(aj) < l(aj) − 1. The pose the recycling transformation in two steps:
presence of illegal wires in the layout implies
that the final implementation is incorrect. 1. Legalization. Legalize G by augmenting
Thanks to the latency-insensitive methodol- the weights of the wires by the appropri-
ogy, we can correct the final layout by intro- ate quantity.
ducing relay stations to ensure that each wire’s 2. Equalization. Compute maximum
delay is less than the desired clock period. This throughput ϑ(Sk) = ak/bk ∈ ]0, 1] that is
insertion corresponds to incrementing the sustainable by each strongly connected
weight of each illegal arc a by the quantity component Sk ∈ G′; recall that ϑ(Sk) is
∆w(a) = l(a) − 1 − w(a). equal to the inverse of the maximum cycle
Although recycling is an easy way to cor- mean λ(Sk). Equalize the throughputs by
rect the system’s final implementation and sat- adding quantity nk ∈ Ζ* to the denomi-
isfy the timing constraints imposed by the nator of each ϑ(Sk). This corresponds to
clock, it does have a cost. In fact, augmenting augmenting by quantity nk the weight of
the weights of some arcs of G could increase the critical cycle Ck ∈ Sk, that is, to dis-

32 IEEE MICRO
Regulator

Discrete Variable
+ Quantizer
cosine length code
– transform (Q)
encoder

Frame Inverse
memory quantizer
−1
(Q)

Buffer
Predictive frames

Inverse
discrete
Preprocessing

cosine transform

+ Output

Motion Frame Motion vectors


compensation memory
Input

Motion
estimation

Figure 5. Block diagram of an MPEG-2 video encoder.

tribute nk extra relay stations among the • All modules should put comparable tim-
corresponding paths. Solving an opti- ing constraints on the global clock (that is,
mization problem gives the quantities nk.6 delays of the longest combinatorial paths
inside each module should be similar).
One key to avoiding large performance loss- • Modules whose corresponding vertices
es during recycling is to avoid augmenting the belong to the same cycle should be kept
weights of those arcs belonging to critical cycle close to each other while deriving the
C of G. Furthermore, from the definition of final implementation.
maximum cycle mean, we have that for the
same w(C), the smaller the cycle’s cardinality Figure 5 illustrates the functional diagram
|C|, the greater the loss in throughput for G. of an MPEG-2 video encoder whose corre-
The worst case is clearly represented by self- sponding graph, reported in Figure 6 (next
loops. Designers must keep these considera- page), contains six distinct cycles:
tions in mind while partitioning the system
functionality into tasks assigned to different • C1 = {a9, a10, a12}
IP cores. It’s true that the latency-insensitive • C2 = {a9, a11, a14, a12}
methodology guarantees that no matter how • C3 = {a16, a17, a18, a19, a20}
poor the final system implementation (in • C4 = {a4, a5,, a6, a7, a8, a9, a10, a13}
terms of wire lengths in the communication • C5 = {a4, a5,, a6, a7, a8, a9, a11, a14, a13}
architecture), it’s always possible to fix the sys- • C6 = {a6, a7, a8, a9, a11, a15, a17, a18, a19, a20}
tem by adding relay stations. Still, to achieve
acceptable performance, designers should Most arcs are common to more than one
adopt a design strategy based on the following cycle, for example, a9 is part of C1, C2, C4, C5,
guidelines: and C6, but others are contained only in one

SEPTEMBER–OCTOBER 2002 33
SOC DESIGN LATENCY

tem performance. However, can we compute


V15 a19 performance degradation in advance for the
a5
a20
a16
arcs belonging to one or more cycles? Figure
a4
V3 V12 7 shows the results of an analysis based on the
V4 V5
a2 recycling approach. The six curves in the chart
a6 a17
represent the previously mentioned cycles.
V4 V6 Each point of curve Ci shows the degradation
VV
1313
in system throughput detected after setting
a13 a7
a18 total sum w(Ci) of the weights of Ci’s arcs
a1 V7 equal to integer x, with x ∈ [0, 20]. Obvious-
a3 a8 V14 ly, the underlying assumption is that Ci is a
a12 critical cycle of G, limiting the choice of those
V1 V8
arcs of Ci with augmentable weights. For
a9 example, assume that w(C2) = 5, as a result of
aS V10 a10 V9 summing w(a9) = 4, w(a11) = 1, w(a14) = 0, and
at
w(a12) = 0. In this case, C2 is definitely not a
a14
a11 critical cycle. In fact, its cycle mean is λ(C2) =
a15 (5 + 4)/4 = 9/4 = 2.250. Even if w(a10) = w(a12)
Source V11 Target
= 0, cycle Ci—with w(Ci) = w(a9) = 4—has a
larger cycle mean: exactly λ(C1) = (4 + 3)/3 =
a Arc 7/3 = 2.333).
V Vertices
As Figure 7 confirms, the best way to avoid
losing performance is to increase the weights
Figure 6. Latency-insensitive design graph of the MPEG-2 of those arcs that belong to longer cycles. Per-
video encoder shown in Figure 5. The vertices of the graph forming the recycling transformation this way
correspond to blocks in the MPEG diagram. Source and Tar- might or might not be possible, for example,
get represent respectively the input and output of the if the length of arc a10 is large, cycle C1 will
MPEG. ultimately dictate the system throughput.
However, in general the latency-insensitive
methodology lets us move around relay sta-
1.0 tions without redesigning any module. This
Cycle capability could be useful, for example, in
0.9 1
2
reducing the length of an arc such as a9, which
0.8 3 belongs to both large and small cycles, while
4 in exchange increasing the length of a15, which
0.7 5
6
is only part of C6. This example assumes that,
ϑ(G)

0.6 before rebalancing, l(a9) = 3 and l(a15) = 1,


0.5 where, after rebalancing, they become respec-
tively 1 and 3. It also assumes that all other
0.4 arcs have unit lengths, giving a final through-
0.3 put of 10 / (10 + 2) = 0.833 instead of 3 / (3
+ 2) = 0.6, a 38 percent improvement.
0.2

0.1
0 5 10
w(Ci )
15 20
D eep-submicron technologies suffer—and
in the foreseeable future will continue to
suffer—from delays on long wires that often
Figure 7. Analysis of throughput degradation for the MPEG-2 encoder. force costly redesigns. Latency among various
SOC components will vary considerably and
in a manner that is difficult to predict. Laten-
cycle; for example, a15 is only part of C6. Final- cy-insensitive design is the foundation of a cor-
ly, some arcs, such as a3, are not contained in rect-by-construction methodology that lets us
any cycle. We already know that increasing increase the robustness of a design implemen-
the weight of these arcs does not affect the sys- tation. This approach enables recovering the

34 IEEE MICRO
arbitrary delay variations of the wires by chang- Alberto L. Sangiovanni-Vincentelli holds the
ing their latency, leaving overall system func- Buttner Chair in electrical engineering and
tionality unaffected. A promising avenue for computer science at the University of Cali-
further research is the application of these con- fornia, Berkeley. His research interests include
cepts to the optimization of the computa- all aspects of computer-aided design of elec-
tion/communication tradeoffs that arise while tronic systems, hybrid systems and control,
designing software compilers for machines and embedded system design. He cofounded
having a communication architecture with Cadence and Synopsys and received the Kauf-
variable latency. MICRO man Award from the Electronic Design
Automation Consortium for pioneering con-
References tributions to EDA. He is a fellow of the IEEE
1. K. Keutzer et al., “System Level Design: and a member of the National Academy of
Orthogonalization of Concerns and Platform- Engineering.
Based Design,” IEEE Trans. Computer-Aided
Design of Integrated Circuits and Systems, Direct questions and comments about this
vol. 19, no. 12, Dec. 2000, pp. 1523-1543. article to Luca P. Carloni, 211-10 Cory Hall,
2. P. Glaskowski, “Pentium 4 (Partially) University of California at Berkeley, Berkeley
Previewed,” Microprocessor Report, vol. 14, CA 94720; lcarloni@ic.eecs.berkeley.edu.
no. 8, Aug. 2000, pp. 10-13.
3. L.P. Carloni, K.L. McMillan, and A.L. For further information on this or any other
Sangiovanni-Vincentelli, “Theory of Latency- computing topic, visit our Digital Library at
Insensitive Design,” IEEE Trans. Computer- http://computer.org/publications/dlib.
Aided Design, vol. 20, no. 9, Sept. 2001, pp.
1059-1076.
4. L.P. Carloni et al., “A Methodology for
‘Correct-by-Construction’ Latency Insensitive
Design,” Proc. Int’l Conf. Computer-Aided Good news for your in-box.
Design (ICCAD 99), IEEE Press, Piscataway,
N.J., 1999, pp. 309-315.
5. T. Chelcea and S. Novick, “Robust Interfaces
Sign Up Today
for Mixed-Timing Systems with Application
to Latency-Insensitive Protocols,” Proc. 38th
for the IEEE
Design Automation Conf. (DAC 01), ACM Computer
Press, New York, 2001, pp. 21-26.
6. L.P. Carloni and A.L. Sangiovanni-Vincentelli, Society’s
“Performance Analysis and Optimization of
Latency Insensitive Systems,” Proc. 37th e-News
Design Automation Conf. (DAC 00), IEEE Be alerted to
Press, Piscataway, N.J., 2000, pp. 361-367.
• articles and
special issues
Luca P. Carloni is a PhD candidate in the
Electrical Engineering and Computer Science • conference news
Department of the University of California, • submission
Berkeley. His research interests include com- and registration
puter-aided design and design methodologies deadlines
for electronic systems, embedded system
• interactive
design, logic synthesis, and combinatorial forums
optimization. He has a Laurea in electrical
engineering from the University of Bologna, Available for
Italy, and an MS in electrical engineering and FREE to members.
computer science from the University of Cal- computer.org/e-News
ifornia, Berkeley.

SEPTEMBER–OCTOBER 2002 35

You might also like