This action might not be possible to undo. Are you sure you want to continue?
Rethinking Timing Optimization to Target Clocks and Logic at the Same Time
Paul Cunningham en, Steev Wilcox , Marc Swinn
Ten yeais ago, the EBA inuustiy faceu a ciippling uiveigence in timing between RTL synthesis anu placement
causeu by iapiuly iising wiie capacitances ielative to gate capacitances. Without some ieasonable level of
placement knowleuge to give cieuible estimates of wiie length it was becoming impossible to measuie uesign
timing with any accuiacy uuiing RTL synthesis. At this time, placement tools weie not uiiectly awaie of
timing anu focuseu insteau on metiics inuiiectly ielateu to timing such as total wiie length. As chip uesigns
scaleu to “ueep sub‐micion” geometiies (18unm anu 1Sunm), the change in timing aiounu placement became
so significant anu unpieuictable that even manual iteiations between synthesis anu placement no longei
conveigeu. The solution was to ie‐invent placement, making it both uiiectly awaie of timing anu also weaving
in many of the logic optimization techniques exploiteu uuiing RTL synthesis, foi example gate sizing anu net
buffeiing. This piocess was not easy, anu ultimately saw a majoi tuinovei in the backenu uesign tool
lanuscape as a new geneiation of “physical optimization” tools weie uevelopeu, ieleaseu anu piolifeiateu
thioughout the chip uesign community.
Touay timing is uiveiging once again, but foi a uiffeient set of ieasons (on‐chip vaiiation, low powei, anu
uesign complexity) anu at a uiffeient point in the uesign flow (CTS). While this uiveigence has so fai ieceiveu
little meuia attention, this papei shows that the uiveigence is seveie – so much so that we believe it is having
a ciitical impact on the economic viability of migiation to the S2nm piocess noue. Clock concuiient
optimization is a ievolutionaiy new appioach to timing optimization which compiehensively auuiesses this
uiveigence by meiging physical optimization into CTS anu simultaneously optimizing both clock uelays anu
logic uelays using a single unifieu cost metiic.
This papei begins with a biief oveiview of some basic concepts in clock baseu uesign, anu a biief oveiview of
the tiauitional iole of CTS within uigital uesign flows. It then explains why anu by how much uesign timing is
uiveiging aiounu CTS. The concept of clock concuiient optimization is then intiouuceu anu its key uefining
featuies outlineu. The papei concluues with a summaiy of the key benefits of clock concuiient optimization
anu an explanation of why it compiehensively auuiesses the uiveigence in uesign timing aiounu CTS.
A Brief Overview of Clock-Based Design
Setup and Hold Constraints
Clocking was one of the gieat innovations which enableu the semiconuuctoi inuustiy to piogiess to wheie it
is touay. Clocking elegantly quantizes time, enabling tiansistois to be abstiacteu to sequential state machines
anu fiom state machines into a simple anu intuitive piogiamming paiauigm foi chip uesign, the iegistei
tiansfei language (RTL).
The funuamental assumption maue by sequential state machines, anu hence by any RTL specification, is the
assumption of sequential execution, i.e. that all paits of a state machine stay “in‐step” with iespect to each
C 2uu9 Azuio, Inc. 1
2 C 2uu9 Azuio, Inc.
othei. This assumption tianslates to a set of constiaints on uelay which must be met by any clock‐baseu
uesign if it is to function coiiectly. These constiaints split into two classes:
• Constiaints which ensuie that eveiy flip‐flop always makes a foiwaiu step fiom state n to state n+1
whenevei the clock ticks. These constiaints aie typically iefeiieu to as setup constiaints.
• Constiaints which ensuie that no flip‐flop evei makes moie than one foiwaiu step fiom state n to state
n+2 on a single clock tick. These constiaints aie typically iefeiieu to as hold constiaints.
Setup Constraint: + C L + u
Hold Constraint: L + u
Figure 1: Setup and hold constraints in clock based design.
0ne setup anu one holu constiaint aie iequiieu foi eveiy paii of flip‐flops in a uesign which have at least one
functional logic path between them. Figuie 1 summaiizes the setup anu holu constiaints foi a paii of flip‐
flops, A anu B, tiiggeieu by a clock with peiiou T. The clock uelay to A is uenoteu by L foi “launching” clock
anu the clock uelay to B is uenoteu by C foi “captuiing” clock. u
uenote the minimum anu maximum
logic path uelays between the two flip‐flops. Foi simplicity, anu because it makes no uiffeience to the
aiguments we make in this papei, we have assumeu the setup time, holu time, anu clock‐to‐0 uelay foi the
flip‐flops aie all zeio.
The setup constiaint is ieau as follows: the woist‐case time taken foi a clock tick to ieach A, anu piopagate a
new value to the input of B, must be less than the time taken foi the next clock tick to ieach B. If this isn’t tiue
then it is possible that B be clockeu when its input uoes not yet holu the coiiect next‐state value.
The holu constiaint is ieau as follows: the best‐case time taken foi a clock tick to ieach A, anu piopagate a
new value to the input of B, must be gieatei than the time taken foi that same clock tick to ieach B. If this isn’t
tiue then it is possible that a next state value on the input to A may piopagate all the way thiough to the
output of B in one clock tick, in essence causing B to skip eiioneously fiom state n to state n+2 in one clock
Ideal and Propagated Clocks Timing
In the context of mouein uigital chip uesign flows, the setup anu holu constiaints outlineu above aie iefeiieu
to as a propagated clocks mouel of timing since the constiaints stait fiom the ioot of the clock anu incluue
the time taken foi the clock euge to piopagate thiough the clock tiee to each flip‐flop. The piopagateu clocks
C 2uu9 Azuio, Inc. S
mouel of timing is the uefinitive ciiteiia foi coiiect chip function, anu is the one useu by timing sign‐off tools
esign flows. in u
An ideal clocks mouel of timing simplifies the piopagateu clocks mouel of timing by assuming that the launch
anu captuie clock paths have the same uelay, i.e. that L=C. In this case the setup anu holu constiaints simplify
Prop ated ag Clocks Ideal Clocks
Setup: L + u
< T + C ( assume L = C ) u
Hold: L + u
> C ( assume L = C ) u
>u is to a fiist appioximation always tiue, assuming that L=C simplifies the entiie pioblem of
ensuiing that a clock baseu uesign will function coiiectly to u
< T. In this univeise theie is no neeu to
woiiy about clock uelays oi about minimum logic uelays. All that matteis is making suie that the maximum
logic path uelay in the uesign, typically iefeiieu to as the “ciitical path”, is fastei than the clock peiiou. In
essence, clocks have been canceleu out of the timing optimization pioblem.
The concept of iueal clocking is so uiamatic anu poweiful as to have enableu an entiie ecosystem of
“fiontenu” engineeis anu uesign tools living in a woilu of iueal clocks, anu the oiigins of iueal clocking aie so
ueep iooteu in the histoiy books of the semiconuuctoi inuustiy that clock baseu uesign is itself often iefeiieu
to as “synchionous uesign” even though theie is nothing funuamentally synchionous about clock baseu
Clock Skew and Clock Tree Synthesis
If chip uesign begins in a woilu wheie clocks aie iueal but enus in a woilu wheie clocks aie piopagateu it
follows that at some point in the uesign flow a tiansition must be maue between these two woilus. This
tiansition happens at the clock tiee synthesis (CTS) step in the flow wheie clocks aie physically built anu
inseiteu into a uesign: see Figuie 2.
Since iueal clocks assumes L=C foi all setup anu holu constiaints it follows that the tiauitional puipose of CTS
is to builu clocks such that L=C. If this can be achieveu then piopagateu clocks timing will match iueal clocks
timing anu the uesign flow will be “conveigent”.
If a clock tiee has n sinks anu a set of paths P|1] to P|n] fiom its souice to each sink then the “skew” of that
clock is uefineu as the uiffeience between the shoitest anu longest of these paths: see Figuie S.
Nainstieam CTS tools aie aichitecteu piimaiily to builu highly efficient “balanceu” buffei tiees to a veiy laige
numbei of sinks with a small skew. Twenty yeais ago, the motivation anu benefits of builuing balanceu clocks
was cleai: clock skew was an uppei bounu on the woist uiffeience between L anu C foi any paii of flip‐flops,
i.e. an uppei bounu on |L‐C| foi any possible setup oi holu constiaint which coulu apply to a uesign. If a small
skew coulu be achieveu ielative to the clock peiiou then a high uegiee of similaiity between iueal anu
piopagateu clocks timing was guaianteeu. But it is impoitant to iemembei that clock skew anu the woist
uiffeience between L anu C aie not the same thing anu that foi a mouein SoC uesign at nanometei piocess
nificantly gieatei than the clock skew. noues it is entiiely possible (in fact veiy common) foi |L‐C| to be sig
4 C 2uu9 Azuio, Inc.
Clock Tree Synthesis
Figure 2: Traditional balanced clocks design flow
Skew = max(P|1],P|2],…,P|n]) – min(P|1],P|2],…,P|n])
Figure 3: Skew of a Clock T
C 2uu9 Azuio, Inc. S
The uistinction between clock skew anu |L‐C| is a ciitical founuation stone foi this papei. Clock skew is a
concept uefineu in teims of woist uiffeiences in uelay between souice‐to‐sink paths in buffei tiees. L anu C
aie uelay vaiiables in the setup anu holu constiaints of a piopagateu clocks mouel of timing anu aie not ieally
uiffeient in this context fiom the othei uelay vaiiables, u
. The somewhat slippeiy natuie of this
uistinction anu the ease with which a uiscussion can begin in the context of timing anu then migiate
mistakenly into a context of skew is one of the piimaiy ieasons why we believe the uiveigence in uesign
timing aiounu CTS has foi so long gone ielatively unnoticeu by the chip uesign anu EBA communities.
The puipose of this papei is not to aigue that tight skews cannot be achieveu foi mouein nanometei uesigns.
Noi is it to aigue that the skew minimization techniques useu by mainstieam CTS tools no longei woik foi
mouein nanometei uesigns. The puipose of this papei is to aigue that the ability of tight clock skews to binu
iueal clock timing to piopagateu clocks timing is bioken – we estimate it bioke in a commeicial sense aiounu
the 6Snm noue. No tweak oi iefinement to the uefinition of skew can fix this. The only solution is to give up
entiiely on the concept of skew anu focus CTS insteau on the funuamental piopagateu clocks timing
constiaints that will mattei post‐CTS in the flow. But in this context theie is no longei any mateiial uistinction
between clock paths (L anu C) anu logic paths (u
). Constiuctively exploiting this obseivation is the
inspiiation behinu the clock concuiient appioach optimization.
The Clock Timing Gap
Theie is no question that clock baseu uesign, iueal clocks timing, the use of RTL to specify a chip, anu the
concepts of fiontenu vs. backenu uesign aie all vital founuation stones foi the continueu success of the
semiconuuctoi inuustiy – theii collective ability to enable uesign automation anu stieamline engineeiing
piouuctivity woulu almost ceitainly be impossible to achieve by any othei means.
Bowevei, the tiauitional iole of CTS to builu buffei tiees with tight skew only makes sense if achieving these
tight skews ieasonably binus iueal clocks timing to piopagateu clocks timing. If this is not the case then
timing uecisions maue using iueal clocks have only limiteu value. Accommouating a change in timing
lanuscape aftei CTS iequiies eithei accepting uegiauation in chip speeu oi accepting uelay in time to maiket
uue to incieaseu iteiations back to RTL synthesis anu physical optimization. If it can be aigueu that the
uiveigence between iueal clocks anu piopagateu clocks is both significant anu funuamental, i.e. one wheie
theie coulu nevei be any foimula oi metiic which coulu binu them, then the only solution becomes to iethink
ve. CTS as a timing optimization step in the flow which uiiectly taigets piopagateu clocks timing as its objecti
In this section we attempt to uefine anu measuie the magnituue of the gap between iueal anu piopagateu
clocks timing. Foi a paiticulai timing constiaint i with launch clock uelay L|i], captuie clock uelay C|i],
minimum anu maximum logic path uelays u|i]
, the uiffeience between iueal anu piopagateu
clocks timing foi that constiaint can easily be seen to be the magnituue of C|i]‐L|i]. Foi example if i weie a
setup constiaint then we have:
Propagated clocks timing: L|i] + u|i]
< T + C|i]
= u|i] < T
Ideal clocks timing: u|i]
Difference = L|i]‐C|i]
A similai ieasoning gives the opposite, L|i]‐C|i], foi holu constiaints. We uefine the clock timing gap foi a
paiticulai set of timing constiaints (eithei setup oi holu oi oth) as: a mixtuie of b
Clock Timing Gap = nI
6 C 2uu9 Azuio, Inc.
0ui choice of stanuaiu ueviation, σ, on L|i] – C|i] iathei than aveiage oi woist |L|i] – C|i]| is impoitant: we uo
not want to measuie a laige clock timing gap if the uelta between iueal anu piopagateu clocks timing is
systematic oi applies only to a veiy small numbei of timing constiaints. If this weie the case then the gap
woulu not be a funuamental gap anu coulu easily be woikeu aiounu by applying a global safety maigin (aka
global unceitainty) to the iueal clocks timing mouel oi by manually applying a few inuiviuual sink pin offsets
to CTS. What we want to measuie aie tiue unsystematic uiveigences between iueal anu piopagateu clocks
timing which apply to a significant proportion of the timing constiaints. These uiveigences will nevei be
iesolvable with a small amount of manual effoit oi with any geneializations to the concept of skew.
We uiviue L|i]‐C|i] by the clock peiiou T to noimalize oui metiic so that it is expiesseu as a peicentage of the
clock peiiou. This enables us to meaningfully compaie the aveiage clock timing gap acioss a laige numbei of
uesigns acioss a iange of clock fiequencies anu piocess noues.
Figuie 4 below summaiizes the aveiage clock timing gap foi the top 1u% woist violateu setup constiaints
acioss a poitfolio of ovei 6u ieal woilu commeicial chip uesigns fiom 18unm to 4u¡4Snm anu fiom 2uuk to
1.26N placeable instances. It shows that while at 18unm the clock timing gap is small at aiounu 7% of the
clock peiiou, at 4u¡4Snm the gap has wiueneu to aiounu Su% of the clock peiiou. A gap of this magnituue is
sufficient to completely tiansfoim the timing lanuscape of a uesign beyonu iecognition between befoie anu
aftei CTS. Since oui measuie is one of stanuaiu ueviation anu not aveiage oi woist uiffeience this gap tiuly is
a funuamental uiveigence which can only be auuiesseu by a funuamental iethink of the iole of CTS in the
uesign flow. Builuing clocks to meet a tight skew taiget no longei achieves its puipose noi will any othei
inuiiect metiic evei binu iueal clocks timing to piopagateu clocks timing since the uiveigence is unsystematic
anu laige foi a significant numbei of woist violating timing enupoints. The only solution is to uiiectly taiget
the piopagateu clocks timing constiaints anu tieat the launch anu captuie clock paths (L anu C) as
optimization vaiiables with the same significance anu similai uegiees of fieeuom to logic path vaiiables (u
). This is what clock concuiient optimization is all about.
18unm 1Sunm 6Snm 4u¡4Snm
Figure 4: Clock Timing Gap across a portfolio of over 60 commercial designs
Explaining the Clock Timing Gap
Theie aie thiee key unueilying inuustiy tienus which aie causing iueal anu piopagateu clocks timing to
uiveige, anu it is the ielatively simultaneous onset of all thiee tienus that has causeu the clock timing gap to
open up so uiamatically at the 6Snm noue anu below. These thiee tienus aie on‐chip vaiiation, clock gating,
anu clock complexity.
On Chip Variation
0n chip vaiiation (0Cv) is a manufactuiing uiiven phenomenon. Two wiies oi two tiansistois which aie
uesigneu to be iuentical almost ceitainly won’t be once piinteu in silicon uue to the lithogiaphic challenges of
piinting featuies smallei than the wavelength of light useu to uiaw them. As a iesult the peifoimance of two
supposeuly iuentical tiansistois can uiffei by an unpieuictable amount. This pioblem is a significant anu
u%. giowing one, anu at 4Snm these ianuom manufactuiing vaiiations can impact logic path uelays by up to 2
0Cv is paiticulaily ielevant foi clock paths since the length of clock paths (i.e. the inseition uelay of clock
tiees) is iising exponentially with iespect to clock peiious. This is in pait because the numbei of flip‐flops in a
uesign continues to iise exponentially but also because iesistances aie iising so fast with successive piocess
shiinks that buffeiing acioss long uistances, as is typically necessaiy in the clock, iequiies moie anu moie
buffeiing pei unit length. At 4Snm it is not uncommon to see S‐S times the clock peiiou woith of uelay in
launch anu captuie clock paths. Even if the impact of 0Cv is only 1u% of path uelay this still amounts to a
potential change in timing pictuies of Su‐Su% of the clock peiiou between iueal anu piopagateu clock
mouels. The only ieason why 0Cv has not alieauy giounu chip uesign completely to a halt is the fact that it
can be ignoieu on the common poition of the launch anu captuie clock paths using a technique known as
common path pessimism iemoval (CPPR) oi clock ieconveigence pessimism iemoval (CRPR): see Figuie S.
) < PR+u.9(RB) Setup: P0+1.1(0B+BC
) < P0+u.9(0C)
) > PR+1.1(RB) Hold: P0+u.9(0B+BC
) > P0+1.1(0C)
C 2uu9 Azuio, Inc. 7
Figure 5: Propagated clocks timing with ±10% OCV derates and CPPR
CPPR is highly constiaint uepenuent, impacting one paii of flip‐flops completely uiffeiently fiom anothei
since it uepenus ciucially on wheie in the clock tiee the launch anu captuie clock paths conveige foi a
paiticulai paii of flip‐flops.
8 C 2uu9 Azuio, Inc.
A tiauitional measuie of clock skew ignoies 0Cv, so even if clock skew is zeio, once 0Cv ueiates anu CPPR
aie applieu to a uesign the magnituue of L – C can be laige foi a significant numbei of logic paths. Also, since
CPPR makes the impact of 0Cv on launch anu captuie clock paths constiaint uepenuent, theie is no
meaningful way to pieuict oi mouel this impact piioi to CTS. In this sense 0Cv moueling, most specifically the
use of CPPR, is a uiiect anu funuamental contiibutoi to the clock timing gap.
Powei has foi many mouein chip uesign teams become as significant an economic uiivei as chip speeu anu
aiea, anu the clock netwoik is often the biggest single souice of uynamic powei uissipation in a uesign. It has
become stanuaiu piactice foi almost all mouein uesigns to manage clock powei aggiessively thiough the
extensive use of clock gating to shut uown the clock to poitions of a uesign which uo neeu to be clockeu in a
paiticulai clock cycle. Clock gating can be at a veiy high level, foi example shutting uown the entiie
applications piocessoi on a mobile phone when it is not neeueu, oi at a veiy fine giaineu level, foi example
shutting uown the top 8‐bits of a 16‐bit countei when they aie not counting. Nouein systems on a chip can
contain tens of thousanus of clock gating elements.
Fiom a timing peispective, a clock gate is just like a flip‐flop, biinging with it its own setup anu holu
constiaint. But, unlike a flip‐flop, clock gates exist insiue a clock tiee anu not at its sinks: see Figuie 6. In an
iueal clocks timing mouel a clock gate typically looks exactly the same as a flip‐flop since the clock aiiives
instantaneously eveiywheie (i.e. L=C=u), but in a piopagateu clocks timing mouel theie can be a massive
uiffeience between the launch anu captuie clock path uelays foi a clock gate, especially foi aichitectuial clock
gates high up in the clock tiee.
Setup Constraint: + C L + u
Hold Constraint: L + u
Figure 6: clock gate enable timing using a propagated clocks model of timing
Eveiy clock gate auueu to a uesign auus to the clock timing gap on that uesign, anu the moie aggiessive the
uesign team is in managing powei, the moie this gap is felt. This is because the most common woikaiounu foi
the pain causeu by clock gating timing uiveiging post‐CTS is to foice all clock gates to the bottom of the clock
tiee (by “cloning” them) so that the captuie clock path uelay is as close as possible to the launching clock
uelay theieby iestoiing conveigence between iueal anu piopagateu clocks timing on clock gates. This is
howevei, the woist possible stiategy foi saving powei.
C 2uu9 Azuio, Inc. 9
Without insisting that all clock gates lie at the veiy bottom of the clock tiee theie is no meaningful way to
pieuict the ielationship between L anu C foi clock gates, anu theiefoie clock gating is also a uiiect anu
funuamental contiibutoi to the clock timing gap.
Nouein system‐on‐a‐chip (SoC) uesigns aie big – tens of millions of gates. They also exploit extensive IP
ieuse, wheie pseuuo‐geneiic mouules such as ARN coies, 0SB inteifaces, PCI inteifaces, memoiy contiolleis,
BSPs, basebanu mouems, anu giaphics piocessois aie instantiateu on a single uie, configuieu to iun in
paiticulai moues, anu stitcheu togethei to uelivei some foim of integiateu system capability.
The clock netwoik in these SoCs is not simple. In fact it has become phenomenally complex – often well ovei a
hunuieu inteilinkeu clock signals that can be biancheu anu meigeu thousanus of times. Pait of the
complexity is simply a iesult of stitching togethei the many IP blocks, but much of the complexity is also
inheient to each of these IP blocks: the same low‐powei impeiatives which uiive clock gating aie also uiiving
the ueployment of a wiue vaiiety of clocks anu clocking schemes at a vaiiety of fiequencies anu voltages to
fuithei contiol anu manage clock activity. Powei consumption uuiing built‐in system test anu time on the
testei uuiing piouuction fuithei impact clock complexity as scan chains continue to be sliceu anu uiceu anu
evei moie intiicate clocking schemes aie useu to captuie anu shift quickly anu powei efficiently.
The enu iesult is a uense spaghetti netwoik of clock muxes, clock xois, anu clock geneiatois, entwineu with
clock gating elements fiom the highest levels in the clock tiee wheie they shut uown entiie sub‐chips, to the
lowest levels in the clock tiee, wheie they may shut uown only a hanuful of flip‐flops, see Figuie 7.
Figure 7: Clock network on a modern nanometer SoC
In this woilu of vast clock complexity, the uefinition of clock skew is itself non‐obvious: iathei than a tiee oi
set of tiees we have a netwoik with hunuieus of souices anu hunuieus of thousanus of sinks. In such a woilu
it is easy, anu inueeu even common, to finu oneself constiucting scenaiios wheie making L=C foi all flip‐flops
is mathematically impossible. Anu even in the cases wheie it is theoietically possible to achieve this objective
the sheei size of the clock uelays that woulu be iequiieu to achieve this woulu be so laige, e.g. 1u+ times the
1u C 2uu9 Azuio, Inc.
clock peiiou, as to make timing impossible to meet even with the tiniest of 0Cv maigins in place. The
iesulting uynamic powei consumption anu IR‐uiop woulu also almost ceitainly ienuei the entiie uesign
The only way to implement uesigns of this complexity using a tiauitional uesign flow is to spenu months of
manual effoit caiefully ciafting a complex set of oveilapping balancing constiaints, typically iefeiieu to as
“skew gioups”, which iequiie that CTS balance ceitain sets of clock paths with othei sets of clock paths. It can
take hunuieus of such skew gioups to be caiefully ciafteu befoie any ieasonable piopagateu clocks timing
can be achieveu anu such timing is nevei in piactice achieveu by ensuiing that L=C foi all paiis of flip‐flops.
In essence, the impact of clock complexity on the clock timing gap has alieauy bioken the tiauitional iole of
CTS in uesign flows, anu uesign teams aie alieauy manually woiking aiounu it at gieat cost by caiefully
ciafting highly complex sets of balancing constiaints which aie uesigneu to achieve acceptable piopagateu
clocks timing anu not L=C foi all flip‐flops.
Detailed Breakdown of the Clock Timing Gap
Figuie 7 below shows the same clock timing gap giaph as was shown in Figuie S but fuithei bieaks out the
ielative contiibutions of 0Cv, clock gating anu intei‐clock timing to this gap. The figuie makes it cleai that
each of the thiee tienus contiibutes mateiially to the clock timing gap, anu also highlights that clock
complexity has the most significant impact on this gap. The giaph also highlights oui obseivation that clock
skew, as tiauitionally uefineu only between iegisteis on a single clock tiee, is not bioken. The clock timing
gap iestiicteu only to setup anu holu constiaints between paiis of flip‐flops in the same clock tiee is at most
1u% of the clock peiiou, even at 4u¡4Snm. But theie aie othei factois ignoieu by the concept of clock skew
which aie piogiessively eiouing its ability to binu iueal to piopagateu clocks timing in the uesign flow.
18unm 1Sunm 6Snm 4u¡4Snm
Figure 7: Breakdown of trends contributing to the Clock Timing Gap
The message is simple: the iole of CTS in the uesign flow must change. It can no longei be about minimizing
clock skews: it must somehow be about uiiectly tiansitioning a uesign fiom iueal to piopagateu clocks timing
anu using eveiy possible tiick theie is to counteiact the suipiises that occui as the tiue piopagateu clocks
timing pictuie emeiges, incluuing clock gates, intei‐clock paths anu 0Cv maigins.
Clock Concurrent Optimization
Befoie iunning CTS in the uesign flow theie aie no ieal launch anu captuie clock paths anu timing
optimization focuses exclusively on the slowest logic path, u
, using an iueal clocks mouel of timing.
Pioviueu clocks can be implementeu such that the launch anu captuie clock paths aie almost the same foi all
setup anu holu constiaints, L=C, this focus on u
makes sense. But, as this papei has shown, 0Cv, clock
gating, anu clock complexity have maue balancing L anu C, oi inueeu of ueteimining any systematic
ielationship between L anu C, an impossible goal.
The only meaningful goal foi clock constiuction must theiefoie be to uiiectly taiget the piopagateu clocks
timing constiaints, selecting L anu C specifically foi the puipose of ueliveiing the best possible piopagateu
clocks timing pictuie post‐CTS. But the “best possible” L anu C uepenu on the values of u
eisa. have a chicken‐anu‐egg pioblem: which comes fiist. Bo we pick u
then set L anu C, oi vice‐v
Clock concuiient optimization, oi CC‐0pt foi shoit, is the teim we use to uesciibe a new class of timing
optimization tools that meige physical optimization with CTS anu that uiiectly contiol all foui vaiiables in the
piopagateu clocks timing constiaint equations (L, C, u
, anu u
) at the same time.
Figuie 8 visualizes the conceptual uistinction between a tiauitional appioach to timing optimization anu a
clock concuiient appioach to timing optimization. We use the teim “clock concuiient uesign” oi “clock
concuiient flow” to iefei to any uesign methouology oi uesign flow employing the use of clock concuiient
Clock concurrent optimization (CCOpt) merges physical optimization with CTS and directly
controls all four variables in the propagated clocks timing constraint equations
(L, C, G
, and G
) at the same time.
< T Skew
< T + C
Traditional Physical Optimization Clock Concurrent Optimization
Figure 8: Illustration of the difference between Physical Optimization and CCOpt
C 2uu9 Azuio, Inc. 11
12 C 2uu9 Azuio, Inc.
Since CC‐0pt tieats both clock uelays anu logic uelays as flexible paiameteis, the maximum possible speeu
that a chip can be clockeu at is no longei limiteu by the slowest logic path in a uesign. CC‐0pt allows the
captuie clock path to be longei than the launch clock path, in which case the logic path may have moie than
the clock peiiou to compute its iesult. But this extia time is not a fiee lunch: if C is biggei than L then time has
been boiioweu eithei fiom eithei the pieceuing oi subsequent pipeline stages: see Figuie 9.
Such time boiiowing is iteiative acioss multiple logic stages: if time can be boiioweu fiom logic stage n+1 to
logic stage n, then time can also be boiioweu fiom logic stage n+2 to logic stage n+1 anu then again fiom logic
stage n+1 to logic stage n, anu so on both foiwaius anu backwaius fiom logic stage n, see Figuie 1u. Bowevei,
the time boiiowing is not unlimiteu, anu must stop eithei when the chain of logic stages loops back on itself
i when it ieaches an I0 to the chip, see Figuie o
T – Δ
T + Δ
T – Δ
Figure 9: Time borrowing
T T + Δ
T T – Δ
T – Δ
Figure 10: Multistage time borrowing
Input Tadpole Chain Output Tadpole Chain
Figure 11: Different types of logic chain
In a woilu wheie launch anu captuie clock paths aie flexible optimization paiameteis, it is these chains of
logic functions which most influence the maximum possible clock speeu: a chain with n logic stages has at
most n clock peiious of total time available iiiespective of the clock uelays to each iegistei in the chain.
Pioviueu the woist total logic uelay thiough the entiie chain, Σ
), is less than nT, it will be possible to
come up with a set of clock aiiival times foi each iegistei on the chain that meets piopagateu clocks setup
constiaints. We iefei to this ielatio shi a se ain constiaint: n p as tup ch
Setup chain constraint: ∑ u|i]
0nly in the highly unusual situation wheie the most efficient uistiibution of uelay along the chain is one
wheie each stage has exactly the same uelay will the optimum clock netwoik be a balanceu clock netwoik.
The tiauitional assumption of foicing each stage on the loop to have exactly the same amount of time by
balancing clocks is funuamentally unnecessaiy – anu fuitheimoie, as this papei has shown, it is also
impossible to achieve on mouein chips uue to clock complexity, on‐chip‐vaiiation, anu clock gating.
Setup Slack and Sequential Slack
The slowest logic function in a uesign is typically ueteimineu by computing the “setup slack” foi each gate in a
uesign anu finuing the logic path which compiises those gates with the lowest slack value. If we use the teim
Paths|g] to mean the set of all logic paths which pass thiough a logic gate g, anu foi each path p in Paths|g] we
use the teims L|p], C|p], anu u|p] to iefei to the launch clock uelay, captuie clock uelay, anu logic uelay foi
path p iespectively, then:
Setup Constraint for a path p: + u|p] < T L|p] + C|p]
Setup Slack for gate g: mtn
p In Paths|g]
((T + C|p]) – (L|p] + u|p]))
Setup slack is in essence the woist case maigin by which all setup constiaints which pass thiough g have been
met. If the setup slack at a gate is negative then a setup constiaint must have been violateu, anu the
magnituue of the negativity uenotes the amount by which that setup constiaint has been violateu. The
sequence of gates with the smallest setup slacks uenotes the logic path which is most limiting chip speeu anu
C 2uu9 Azuio, Inc. 1S
14 C 2uu9 Azuio, Inc.
is typically iefeiieu to as the woist negative path, woist violateu path, oi ciitical path. The slack value of the
gates on the ciitical path is typically iefeiieu to as the Woist Negative Slack (WNS).
It is possible to geneialize the concept setup slack to setup chain constiaints, giving the concept of sequential
slack |Pan98,Conguu], wheie the teim “sequential” is useu to emphasize the notion that these slacks can cioss
iegistei bounuaiies. If we use the teim Chains|g] to mean the set of all logic chains passing thiough a gate g,
anu foi each chain c we use the teim n|c] to iefei to the numbei of logic stages in chain c anu u|c,i] to iefei to
the woist logic uelay at stage i in chain c, then:
Setup Chain Constraint for a chain c: u|c, i]
Sequential slack for gate g: mtn
c |n Cha|nx|g]
In|c]T - ∑ u|c, i]
It is also helpful to noimalize the sequential slack ielative to the length of the chain by uiviuing by n|c] so that
the maigin can be thought of as an aveiage maigin pei logic stage which is theiefoie inuepenuent of chain
length. It also means that sequential slacks aie iepoiteu on the same scale as tiauitional setup slacks – if the
noimalizeu sequential slack is 1uups it means that the aveiage setup slack along that chain will also be 1uups,
iiiespective of the clock aiiival times at each stei on the iegi chain.
Normalized Sequential slack for gate g: mtn
c |n Cha|nx|g]
IT -∑ u|c, i]
If the smallest sequential slack in a uesign is negative, then the amount by which the sequential slack is
negative, iefeiieu to as the Woist Negative Sequential Slack (WNSS), uenotes how fai off the ciicuit is fiom
achieving its uesiieu clock speeu. We use the teim ciitical chain to uesciibe the logic chain with the WNSS,
although the teim ciitical cycle is also useu in the liteiatuie |Buistu4] to uesciibe the ciitical chain. We piefei
chain to cycle since it emphasizes that the sequence of gates with the WNSS neeu not foim a loop anu may
equally, anu inueeu often uoes, teiminate in a uesign I0.
If sequential slack is negative then tiauitional setup slacks will nevei be positive iiielevant of how the clocks
aie implementeu. If sequential slack is positive then it will be possible to builu a clock netwoik which ueliveis
clocks to each flip‐flop such that tiauitional setup slacks will also be positive – although this netwoik almost
ceitainly will not be a balanceu netwoik. In this sense, using CC‐0pt, the maximum speeu at which a chip can
be clockeu is limiteu by the ciitical chain anu not the ciitical path. This uenotes a funuamental new uegiee of
fieeuom which can be exploiteu by CC‐0pt above anu beyonu physical optimization to make chips fastei oi
smallei oi lowei powei.
It is howevei impoitant to note that if sequential slacks aie positive this uoes not imply that setup slacks can
also be maue positive: fiistly, once clocks aie built the sequential slacks themselves will change uue to 0Cv,
clock gating, anu intei‐clock timing, anu it is entiiely possible that this change will make the sequential slacks
negative again. Seconuly, the clock aiiival times necessaiy to achieve positive setup slacks may not be
achievable with a feasibly sizeu clock tiee in teims of aiea anu powei.
Using CCOpt the maximum possible clock speed is limited by the critical chain not the
Sequential Optimization and Useful Skew
Besign optimization methous which exploit sequential slacks aie usually teimeu sequential optimization
methous. These bioauly fall into two camps: retiming appioaches, which physically move logic acioss iegistei
C 2uu9 Azuio, Inc. 1S
bounuaiies, anu clock scheduling appioaches, which intelligently apply uelays to the clock tiee to impiove
setup slacks. Retiming was intiouuceu ovei twenty yeais ago |Leiseison84, Leiseison91], but automatic
ietiming appioaches aie flow‐invasive because of theii impact on foimal veiification anu testability, anu have
not gaineu wiuespieau acceptance. Scheuuling appioaches have also been aiounu foi almost twenty yeais
|Fishbuin9u], anu aie moie applicable in touay’s flows.
Acauemic papeis on clock scheuuling tenu to split the pioblem into scheuule calculation |e.g. Kouitev99b,
RavinuianuS] anu scheuule implementation |e.g. Kouitev99, Xi97], although some papeis |such as BeluuS]
tackle both halves of the pioblem sepaiately in the same papei. Commeicial EBA tools exploiting clock
scheuuling typically favoi moie iobust anu uiiect algoiithms which inciementally auu buffeis to an alieauy
balanceu clock tiee to boiiow time fiom positive‐slack stages to aujacent negative‐slack logic stages without
evei pie‐computing a uesiieu scheuule. Although the teim useful skew was oiiginally applieu to the two pait
calculate‐then‐implement appioach |Xi97, Xi99], it is geneially useu touay to mean any CTS appioach that
iesults in an unbalanceu tiee foi timing ieasons, even if the appioach useu uoesn’t evei explicitly calculate a
uesiieu clock scheuule. In any case, the key featuie in common is that timing ultimately uiives the clock
aiiival times at iegisteis anu not a set of CTS balancing constiaints.
The best way to think about how CTS can be geneializeu to be uiiven by timing anu not by a set of balancing
constiaints is to think of each flip‐flop as having foui constiaints on the aiiival time of its clock which aie a
simple ie‐aiiangement of the piopagateu clocks timing constiaints on its B‐pin anu on its 0‐pin: see Figuie
12. These foui constiaints constiain the peimissible aiiival time to be within a winuow which uepenus on the
aiiival times of flops in the logical fan‐in anu fan‐out.
max((P|i‐1] + u|i‐1]
– T), (P|i+1]
)) < P[i] < min((P|i+1]
+ T), (P|i‐1]
P|i‐1] + u|i‐1]
– T < P[i]
P|i‐1] + u|i‐1]
P[i] > P|i+1] – u|i]
P[i] < P|i+1] – u|i]
Figure 12: Timing driven clock arrival time windows
The uepenuency on the neighboiing flops makes it easy to see that these clock aiiival time winuows aie
globally inteitwineu. If a clock netwoik can be built which ueliveis the clock to all flip‐flops within theii
peimissible aiiival time winuows then setup anu holu time constiaints will be met. This concept of winuows
16 C 2uu9 Azuio, Inc.
can also easily be geneializeu to incluue 0Cv ueiates anu CPPR, anu also can be extenueu to apply to othei
timing enupoints such as clock gates, clock muxes, anu clock geneiatoi blocks. It can also be applieu to
inteinal noues in the clock tiee by a simple inteisection of the winuows of sub‐noues.
Although the concept of winuowing sounus veiy poweiful, it has one key limitation which we alluueu to in the
pievious section: it is isolateu fiom the physical optimization of logic paths. Sepaiating the steps of
. optimizing the logic anu builuing the clock tiee causes two key pioblems on ieal woilu commeicial uesigns
The fiist is the clock timing gap, which as we have shown iequiies that the uesiieu scheuule be baseu on a
tiue piopagateu clocks mouel of timing in oiuei to piopeily account foi 0Cv, intei‐clock paths anu clock gate
enable timings. If scheuule calculation happens befoie clocks aie built then it cannot be baseu on a
piopagateu clocks mouel of timing!
The seconu pioblem is that clocks aie not fiee. They cost aiea anu powei, anu any inciease in inseition uelay
causes moie setup anu holu timing uegiauation uue to 0Cv on clock paths. While positive sequential slacks
imply that a clock netwoik can theoietically be built to make tiauitional setup slacks positive, this uoes not
mean that such a netwoik woulu in piactice have acceptable aiea anu powei, anu noi uoes it mean that setup
slacks coulu be met if 0Cv ueiates aie being applieu to clock paths as well as logic paths. In fact, when 0Cv
ueiates aie consiueieu, it is entiiely possible to get into a vicious spiial of incieasing inseition uelays causing
incieasingly tight winuows, which cause fuithei incieases in inseition uelay in oiuei to meet the winuows,
culminating in the situation wheie it is impossible to make setup slacks positive.
CC‐0pt uiffeientiates itself fiom tiauitional appioaches to useful skew by biinging both clock scheuuling anu
physical optimization togethei unuei a unifieu aichitectuie anu basing all uecisions on a tiue piopagateu
clocks mouel of timing, incluuing intei‐clock paths, 0Cv anu clock gate timing. CC‐0pt tieats both clocks anu
logic as equally impoitant classes of citizen anu unueistanus that in piactice eithei can be the limiting factoi
on achievable chip speeu. CC‐0pt must somehow entei the piopagateu clocks woilu as soon as possible anu
then globally optimize both the clock anu logic uelays accoiuing to some coheient optimization objective
which can be bounueu by eithei logic consiueiations oi clock consiueiations.
The close maiiiage of physical optimization anu clock constiuction, togethei with knowleuge of the siue‐
effects of vaiious uecisions in each uomain, is the most uifficult component of the CC‐0pt pioblem to solve
well. But it is also the key enablei foi mainstieam commeicial auoption of CC‐0pt. The ielaxation of the
iequiiement to balance clocks unleashes significant fieeuom, but this fieeuom is commeicially useless if it is
not exploiteu wisely anu in the context of a full piopagateu clocks mouel of timing. Key signs of the failuie to
exploit this fieeuom piopeily aie clock tiees that aie too laige, inseition uelays that aie too long, anu
significant holu timing pioblems which iesult in an unieasonable inciease in uesign aiea once holu fix buffeis
have been inseiteu.
CCOpt’s decisions are always based on a true propagated clocks measure of timing
including clock gates, interclock paths, OCV derates, and CPPR.
CCOpt will never close timing at the expense of creating an unreasonably large clock
network or an unreasonable number of hold time violations.
Clock Concurrent Design Flow
Inseiting CC‐0pt into the uesign flow uoes not iequiie any changes othei than to ieplace physical
optimization anu CTS with CC‐0pt, anu skipping the tiauitional post‐CTS optimization step which becomes
C 2uu9 Azuio, Inc. 17
ieuunuant: see Figuie 1S. What is uone befoie clock concuiient optimization anu what is uone aftei iemain
the same. Theie is no change in timing sign‐off oi in foimal veiification oi in gate level simulation. No special
uata stiuctuies oi file foimats aie neeueu, anu complex CTS configuiation sciipts become ieuunuant as theie
is no longei any neeu to specify any balancing constiaints. Theie is a potential impact on the magnituue of
scan chain holu violations but this can easily be manageu by enhancing scan chain stitching algoiithms to
uiiectly consiuei holu slacks anu not just scan chain wiie length. Foi example, Azuio’s Rubix™ CC‐0pt tool
alieauy incluues such a scan chain ie‐stitching capability.
Clock Tree Synthesis
Traditional Design Flow Clock Concurrent Design Flow
Figure 13: Clock Concurrent Design Flow
18 C 2uu9 Azuio, Inc.
Key Benefits of Clock Concurrent Optimization
This section oveiviews the key benefits that CC‐0pt can biing to the uigital chip uesign community.
1. Increased chip speed or reduced chip area and power
0sing CC‐0pt the maximum possible clock speeu is limiteu by the ciitical chain anu not the ciitical path in a
uesign. This is a funuamental new uegiee of fieeuom to help close timing above anu beyonu tiauitional uesign
flows. If the uesiieu chip speeu is alieauy achievable without CC‐0pt then this same auuitional uegiee of
fieeuom can be exploiteu to ieuuce chip aiea oi powei. At 6Snm anu below the achievable incieases in clock
speeu can be as much as 2u%.
2. Reduced IRdrop
Since CC‐0pt uoes not balance clocks the peak cuiient uiawn by the clock netwoik is significantly ieuuceu. In
fact it is entiiely feasible to extenu CC‐0pt to uiiectly consiuei peak cuiient (oi some ieasonable estimate of
peak cuiient) as an optimization paiametei anu specifically skew clocks anu aujust logic path uelays to
ensuie that peak cuiient is contiolleu to be within a pie‐specifieu limit. At auvanceu piocess noues IR‐uiop
can have a ciitical impact on timing sign‐off anu chip packaging cost. CC‐0pt unshackles chip uesigneis fiom
the tiauitional conflict of inteiest between tight skew being goou foi timing but teiiible foi IR‐uiop.
3. Increased productivity and accelerated time to market
Theie aie two uistinct ways in which CC‐0pt incieases uesignei piouuctivity anu acceleiates time to maiket.
The fiist is uue to a lack of any iequiiement to configuie clock tiee balancing constiaints oi manually set
inseition uelay offsets foi timing ciitical sink pins. Foi complex SoC uesigns composing a complete set of
balancing constiaints can take moie than a month, anu much of this effoit often neeus iepeating eveiy time a
new netlist is pioviueu by the fiontenu uesign team.
The seconu way in which CC‐0pt incieases uesignei piouuctivity anu acceleiates time to maiket is uue to a
significant ieuuction in the numbei of iteiations between the fiontenu anu backenu uesign teams. Nany of
these iteiations aie foi the sole puipose of asking the fiontenu uesign team to manually move logic acioss
iegistei bounuaiies by changing the RTL. This manual moving of logic is in essence a foim of sequential
optimization being peifoimeu manually anu veiy inefficiently. Since the entiie flow must be ie‐iun to
incoipoiate the RTL changes, theie is no guaiantee that the satisfactoiy aspects of the post‐placement timing
pictuie will peisist. Theiefoie the iequesteu changes may not have the intenueu benefit. Nost of the neeu to
uo this manual logic moving is completely eliminateu using CC‐0pt since it can simply skew the clocks
insteau. The time saving fiom these ieuuceu iteiations can be many months.
4. Accelerated migration to 45nm and below
The ability of CC‐0pt to peifoim timing optimization is not uegiaueu by the giowing clock timing gap. This is
because all uecisions it makes aie uiiectly baseu on a piopagateu clocks mouel of timing. If aichitecteu
coiiectly, the motto foi CC‐0pt is “if I can time it then I can optimize it”. Clock gates, complex clock muxing
configuiations, 0Cv ueiates, CPPR, multi‐coinei, anu multi‐moue shoulu all fall out in the wash so long as the
timing analysis engine is able to consiuei them. Without the use of a clock concuiient flow the clock timing
gap incieasingly ciipples timing closuie pie to post‐CTS, anu timing optimization steps uownstieam fiom CTS
just uon’t have the hoisepowei to iecovei fiom the uamage. 0sing CC‐0pt, migiation to auvanceu piocess
noues can happen fastei anu with significantly less pain.
Clocking lies at the heait of commeicial chip uesign flows anu is almost as cential to the uigital chip uesign
community as the tiansistoi itself. But the tiauitional assumption that if clocks aie balanceu then piopagateu
clocks timing will miiioi iueal clocks timing is funuamentally bioken. Clock gating, clock complexity anu on‐
chip vaiiation aie the key inuustiy tienus causing this uiveigence, which we iefei to as the clock timing gap.
At 4u¡4Snm the clock timing gap can be as much as Su% of the clock peiiou iesulting in an almost complete
iewiite of the timing lanuscape between iueal anu piopagateu clocks timing.
Clock concuiient optimization gives up on the iuea of clock balancing as both iestiictive anu unhelpful at
auvanceu piocess noues. It meiges CTS with physical optimization builuing both clocks anu optimizing logic
uelays at the same time baseu uiiectly on a piopagateu clocks mouel of timing. This unleashes a funuamental
new uegiee of fieeuom to boiiow time acioss iegistei bounuaiies iesulting in chip speeu becoming limiteu
by ciitical chains not ciitical paths.
0nuei the hoou, the key challenge which CC‐0pt must tackle is the potential foi clock netwoiks to become
unieasonably laige, anu auuiessing this challenge iequiies the clock constiuction algoiithms to become veiy
tightly bounu with the logic optimization algoiithms. The intimate ielationship between clock constiuction
anu logic optimization is what uiffeientiates CC‐0pt fiom the tiauitional techniques of sequential
optimization anu useful skew.
CC‐0pt ueliveis foui key types of benefit to the uigital chip uesign community: incieaseu chip speeu oi
ieuuceu chip aiea anu powei: ieuuceu IR‐uiop: incieaseu piouuctivity anu acceleiateu time‐to‐maiket: anu
acceleiateu migiation to 4Snm anu below.
C 2uu9 Azuio, Inc. 19
2u C 2uu9 Azuio, Inc.
[Cong00] I. Cong anu S. K. Lim, “Physical planning with ietiming,” in Bigest of Technical Papeis of the
IEEE¡ACN Inteinational Confeience on Computei‐Aiueu Besign, (San Iose, CA), pp. 1–7, Novembei 2uuu
[Fishburn90] I. P. Fishbuin, “Clock Skew 0ptimization”, IEEE Tians. on Computeis, vol S9 pp 94S–9S1, 199u
[Held03] S. Belu, B. Koite, I. Naßbeig, N. Ringe anu I. vygen, “Clock scheuuling anu clocktiee constiuction foi
high peifoimance ASICs”, Pioceeuings of the 2uuS IEEE¡ACN inteinational confeience on Computei‐aiueu
[Hurst04] A. P. Buist, P. Chong, A. Kuehlmann, “Physical placement uiiven by sequential timing analysis.”
Pioc. ICCAB 'u4, pp. S79‐S86.
[Kourtev99] I. S. Kouitev anu E. u. Fiieuman, “Synthesis of clock tiee topologies to implement nonzeio clock
skew scheuule,” in IEE Pioceeuings on Ciicuits, Bevices, Systems, vol. 146, pp. S21–S26, Becembei 1999.
[Kourtev99b] I. S. Kouitev anu E. u. Fiieuman, “Clock Skew Scheuuling foi Impioveu Reliability via 0uauiatic
Piogiamming”, Pioc. ICCAB 1999.
[Leiserson83] C. Leiseison anu I. Saxe, “0ptimizing synchionous systems,” Iouinal of vLSI anu Computei
Systems, vol. 1, pp. 41–67, Ianuaiy 198S.
[Leisers 1 on91] C. Leiseison anu I. Saxe, “Retiming synchionous ciicuitiy,” Algoiithmica, vol. 6, pp. S–SS, 199
[Pan98] P. Pan, A. K. Kaianuikai, anu C. L. Liu, “0ptimal clock peiiou clusteiing foi sequential ciicuits with
ietiming”, in IEEE Tians. on CAB, pp 489‐498, 1998
[Ravindran03] A. K. K. Ravinuian anu E. Sentovich, “Nulti‐uomain clock skew scheuuling,” in Pioceeuings of
the 21th Inteinational Confeience on Computei Aiueu Besign, ACN, 2uuS
[Xi97] I. u. Xi anu W. W.‐N. Bai, “0seful‐skew clock iouting with gate sizing foi low powei uesign,” I. vLSI
Signal Piocess. Syst., vol. 16, no. 2‐S, pp. 16S–179, 1997.
[Xi99] I. u. Xi anu B. Staepelaeie, "0sing Clock Skew as a Tool to Achieve 0ptimal Timing," Integiateu System
Besign, Apiil 1999.
other. This assumption translates to a set of constraints on delay which must be met by any clock‐based design if it is to function correctly. These constraints split into two classes:
Constraints which ensure that every flip‐flop always makes a forward step from state n to state n+1 whenever the clock ticks. These constraints are typically referred to as setup constraints. Constraints which ensure that no flip‐flop ever makes more than one forward step from state n to state n+2 on a single clock tick. These constraints are typically referred to as hold constraints.
L Gmin logic Gmax
Setup Constraint: L + Gmax < T + C Hold Constraint: L + Gmin > C Figure 1: Setup and hold constraints in clock based design.
One setup and one hold constraint are required for every pair of flip‐flops in a design which have at least one functional logic path between them. Figure 1 summarizes the setup and hold constraints for a pair of flip‐ flops, A and B, triggered by a clock with period T. The clock delay to A is denoted by L for “launching” clock and the clock delay to B is denoted by C for “capturing” clock. Gmin, Gmax denote the minimum and maximum logic path delays between the two flip‐flops. For simplicity, and because it makes no difference to the arguments we make in this paper, we have assumed the setup time, hold time, and clock‐to‐Q delay for the flip‐flops are all zero. The setup constraint is read as follows: the worst‐case time taken for a clock tick to reach A, and propagate a new value to the input of B, must be less than the time taken for the next clock tick to reach B. If this isn’t true then it is possible that B be clocked when its input does not yet hold the correct next‐state value. The hold constraint is read as follows: the best‐case time taken for a clock tick to reach A, and propagate a new value to the input of B, must be greater than the time taken for that same clock tick to reach B. If this isn’t true then it is possible that a next state value on the input to A may propagate all the way through to the output of B in one clock tick, in essence causing B to skip erroneously from state n to state n+2 in one clock cycle.
Ideal and Propagated Clocks Timing
In the context of modern digital chip design flows, the setup and hold constraints outlined above are referred to as a propagated clocks model of timing since the constraints start from the root of the clock and include the time taken for the clock edge to propagate through the clock tree to each flip‐flop. The propagated clocks
© 2009 Azuro, Inc.
typically referred to as the “critical path”. This transition happens at the clock tree synthesis (CTS) step in the flow where clocks are physically built and inserted into a design: see Figure 2. If a small skew could be achieved relative to the clock period then a high degree of similarity between ideal and propagated clocks timing was guaranteed. An ideal clocks model of timing simplifies the propagated clocks model of timing by assuming that the launch and capture clock paths have the same delay. and the origins of ideal clocking are so deep rooted in the history books of the semiconductor industry that clock based design is itself often referred to as “synchronous design” even though there is nothing fundamentally synchronous about clock based design itself! Clock Skew and Clock Tree Synthesis If chip design begins in a world where clocks are ideal but ends in a world where clocks are propagated it follows that at some point in the design flow a transition must be made between these two worlds. clocks have been canceled out of the timing optimization problem. In this case the setup and hold constraints simplify significantly: Propagated Clocks Setup: L + Gmax < T + C Hold: L + Gmin > C ( assume L = C ) ( assume L = C ) Ideal Clocks Gmax < T Gmin > 0 Since Gmin>0 is to a first approximation always true. In essence. an upper bound on |L‐C| for any possible setup or hold constraint which could apply to a design. If a clock tree has n sinks and a set of paths P to P[n] from its source to each sink then the “skew” of that clock is defined as the difference between the shortest and longest of these paths: see Figure 3. assuming that L=C simplifies the entire problem of ensuring that a clock based design will function correctly to Gmax < T. Inc.e. Mainstream CTS tools are architected primarily to build highly efficient “balanced” buffer trees to a very large number of sinks with a small skew. 3 . © 2009 Azuro. All that matters is making sure that the maximum logic path delay in the design.e. Since ideal clocks assumes L=C for all setup and hold constraints it follows that the traditional purpose of CTS is to build clocks such that L=C. But it is important to remember that clock skew and the worst difference between L and C are not the same thing and that for a modern SoC design at nanometer process nodes it is entirely possible (in fact very common) for |L‐C| to be significantly greater than the clock skew. i. the motivation and benefits of building balanced clocks was clear: clock skew was an upper bound on the worst difference between L and C for any pair of flip‐flops. is faster than the clock period. If this can be achieved then propagated clocks timing will match ideal clocks timing and the design flow will be “convergent”. that L=C. The concept of ideal clocking is so dramatic and powerful as to have enabled an entire ecosystem of “frontend” engineers and design tools living in a world of ideal clocks. i. In this universe there is no need to worry about clock delays or about minimum logic delays. Twenty years ago. model of timing is the definitive criteria for correct chip function. and is the one used by timing sign‐off tools in design flows.
. RTL Synthesis Ideal Clocks Timing Floorplan Initial Placement Physical Optimization Clock Tree Synthesis PostCTS Optimization Routing PostRoute Optimization Signoff Verification Final Layout Propagated Clocks Timing Figure 2: Traditional balanced clocks design flow clock P[n] P P Skew = max(P.….P[n]) Figure 3: Skew of a Clock Tree 4 © 2009 Azuro. Inc.P.….P.P[n]) – min(P.
The purpose of this paper is to argue that the ability of tight clock skews to bind ideal clock timing to propagated clocks timing is broken – we estimate it broke in a commercial sense around the 65nm node. and the concepts of frontend vs. backend design are all vital foundation stones for the continued success of the semiconductor industry – their collective ability to enable design automation and streamline engineering productivity would almost certainly be impossible to achieve by any other means. The purpose of this paper is not to argue that tight skews cannot be achieved for modern nanometer designs. The somewhat slippery nature of this distinction and the ease with which a discussion can begin in the context of timing and then migrate mistakenly into a context of skew is one of the primary reasons why we believe the divergence in design timing around CTS has for so long gone relatively unnoticed by the chip design and EDA communities. one where there could never be any formula or metric which could bind them. For example if i were a setup constraint then we have: Propagated clocks timing: L[i] + G[i]max < T + C[i] = G[i]max < T ‐ (L[i]‐C[i]) Ideal clocks timing: G[i]max < T Difference = L[i]‐C[i] A similar reasoning gives the opposite. L[i]‐C[i]. capture clock delay C[i]. Gmin and Gmax. 5 . The only solution is to give up entirely on the concept of skew and focus CTS instead on the fundamental propagated clocks timing constraints that will matter post‐CTS in the flow. No tweak or refinement to the definition of skew can fix this.e. But in this context there is no longer any material distinction between clock paths (L and C) and logic paths (Gmin and Gmax). i. the difference between ideal and propagated clocks timing for that constraint can easily be seen to be the magnitude of C[i]‐L[i]. Constructively exploiting this observation is the inspiration behind the clock concurrent approach optimization. For a particular timing constraint i with launch clock delay L[i]. L and C are delay variables in the setup and hold constraints of a propagated clocks model of timing and are not really different in this context from the other delay variables. for hold constraints. The Clock Timing Gap There is no question that clock based design. ideal clocks timing. If it can be argued that the divergence between ideal clocks and propagated clocks is both significant and fundamental. We define the clock timing gap for a particular set of timing constraints (either setup or hold or a mixture of both) as: Clock Timing Gap = Li C i T © 2009 Azuro. If this is not the case then timing decisions made using ideal clocks have only limited value. Accommodating a change in timing landscape after CTS requires either accepting degradation in chip speed or accepting delay in time to market due to increased iterations back to RTL synthesis and physical optimization. minimum and maximum logic path delays G[i]min and G[i]max. the use of RTL to specify a chip. Inc. In this section we attempt to define and measure the magnitude of the gap between ideal and propagated clocks timing. However. the traditional role of CTS to build buffer trees with tight skew only makes sense if achieving these tight skews reasonably binds ideal clocks timing to propagated clocks timing. Nor is it to argue that the skew minimization techniques used by mainstream CTS tools no longer work for modern nanometer designs. The distinction between clock skew and |L‐C| is a critical foundation stone for this paper. Clock skew is a concept defined in terms of worst differences in delay between source‐to‐sink paths in buffer trees. then the only solution becomes to rethink CTS as a timing optimization step in the flow which directly targets propagated clocks timing as its objective.
Inc. on L[i] – C[i] rather than average or worst |L[i] – C[i]| is important: we do not want to measure a large clock timing gap if the delta between ideal and propagated clocks timing is systematic or applies only to a very small number of timing constraints. What we want to measure are true unsystematic divergences between ideal and propagated clocks timing which apply to a significant proportion of the timing constraints. This enables us to meaningfully compare the average clock timing gap across a large number of designs across a range of clock frequencies and process nodes.26M placeable instances. σ. Since our measure is one of standard deviation and not average or worst difference this gap truly is a fundamental divergence which can only be addressed by a fundamental rethink of the role of CTS in the design flow. 60% Average Clock Timing Gap for 10% worst violated setup constraints 50% 40% 30% 20% 10% 0% 180nm 130nm Process Node 65nm 40/45nm Figure 4: Clock Timing Gap across a portfolio of over 60 commercial designs 6 © 2009 Azuro. . We divide L[i]‐C[i] by the clock period T to normalize our metric so that it is expressed as a percentage of the clock period. If this were the case then the gap would not be a fundamental gap and could easily be worked around by applying a global safety margin (aka global uncertainty) to the ideal clocks timing model or by manually applying a few individual sink pin offsets to CTS. The only solution is to directly target the propagated clocks timing constraints and treat the launch and capture clock paths (L and C) as optimization variables with the same significance and similar degrees of freedom to logic path variables (Gmin and Gmax). Building clocks to meet a tight skew target no longer achieves its purpose nor will any other indirect metric ever bind ideal clocks timing to propagated clocks timing since the divergence is unsystematic and large for a significant number of worst violating timing endpoints. Figure 4 below summarizes the average clock timing gap for the top 10% worst violated setup constraints across a portfolio of over 60 real world commercial chip designs from 180nm to 40/45nm and from 200k to 1. It shows that while at 180nm the clock timing gap is small at around 7% of the clock period. These divergences will never be resolvable with a small amount of manual effort or with any generalizations to the concept of skew. Our choice of standard deviation. A gap of this magnitude is sufficient to completely transform the timing landscape of a design beyond recognition between before and after CTS. This is what clock concurrent optimization is all about. at 40/45nm the gap has widened to around 50% of the clock period.
OCV is particularly relevant for clock paths since the length of clock paths (i. the insertion delay of clock trees) is rising exponentially with respect to clock periods. and at 45nm these random manufacturing variations can impact logic path delays by up to 20%. Even if the impact of OCV is only 10% of path delay this still amounts to a potential change in timing pictures of 30‐50% of the clock period between ideal and propagated clock models. clock P PQ PR Q QC QB RA A R RB B logic BCmax C logic AB max Setup: PR+1. This is in part because the number of flip‐flops in a design continues to rise exponentially but also because resistances are rising so fast with successive process shrinks that buffering across long distances.9(RB) Hold: PR+0. and it is the relatively simultaneous onset of all three trends that has caused the clock timing gap to open up so dramatically at the 65nm node and below. Inc.e.1(RA+ABmax) < PR+0. © 2009 Azuro. Explaining the Clock Timing Gap There are three key underlying industry trends which are causing ideal and propagated clocks timing to diverge.1(RB) Setup: PQ+1. At 45nm it is not uncommon to see 3‐5 times the clock period worth of delay in launch and capture clock paths. These three trends are on‐chip variation. 7 .1(QB+BCmax) < PQ+0. The only reason why OCV has not already ground chip design completely to a halt is the fact that it can be ignored on the common portion of the launch and capture clock paths using a technique known as common path pessimism removal (CPPR) or clock reconvergence pessimism removal (CRPR): see Figure 5.9(RA+ABmax) > PR+1. On Chip Variation On chip variation (OCV) is a manufacturing driven phenomenon. This problem is a significant and growing one. as is typically necessary in the clock. impacting one pair of flip‐flops completely differently from another since it depends crucially on where in the clock tree the launch and capture clock paths converge for a particular pair of flip‐flops. requires more and more buffering per unit length. clock gating.1(QC) Figure 5: Propagated clocks timing with ±10% OCV derates and CPPR CPPR is highly constraint dependent. and clock complexity. Two wires or two transistors which are designed to be identical almost certainly won’t be once printed in silicon due to the lithographic challenges of printing features smaller than the wavelength of light used to draw them.9(QC) Hold: PQ+0. As a result the performance of two supposedly identical transistors can differ by an unpredictable amount.9(QB+BCmax) > PQ+1.
It has become standard practice for almost all modern designs to manage clock power aggressively through the extensive use of clock gating to shut down the clock to portions of a design which do need to be clocked in a particular clock cycle. . most specifically the use of CPPR. In this sense OCV modeling. In an ideal clocks timing model a clock gate typically looks exactly the same as a flip‐flop since the clock arrives instantaneously everywhere (i. there is no meaningful way to predict or model this impact prior to CTS. the more this gap is felt. Clock gating can be at a very high level. especially for architectural clock gates high up in the clock tree. Modern systems on a chip can contain tens of thousands of clock gating elements.e. Inc. A traditional measure of clock skew ignores OCV. 8 © 2009 Azuro. a clock gate is just like a flip‐flop. or at a very fine grained level. and the more aggressive the design team is in managing power. once OCV derates and CPPR are applied to a design the magnitude of L – C can be large for a significant number of logic paths. Clock Gating Power has for many modern chip design teams become as significant an economic driver as chip speed and area. But. for example shutting down the top 8‐bits of a 16‐bit counter when they are not counting. clock gates exist inside a clock tree and not at its sinks: see Figure 6. unlike a flip‐flop. so even if clock skew is zero. This is because the most common workaround for the pain caused by clock gating timing diverging post‐CTS is to force all clock gates to the bottom of the clock tree (by “cloning” them) so that the capture clock path delay is as close as possible to the launching clock delay thereby restoring convergence between ideal and propagated clocks timing on clock gates. since CPPR makes the impact of OCV on launch and capture clock paths constraint dependent. clock L C Gmin Gmax logic L≠C en CG Setup Constraint: L + Gmax < T + C Hold Constraint: L + Gmin > C Figure 6: clock gate enable timing using a propagated clocks model of timing Every clock gate added to a design adds to the clock timing gap on that design. This is however. and the clock network is often the biggest single source of dynamic power dissipation in a design. From a timing perspective. L=C=0). bringing with it its own setup and hold constraint. the worst possible strategy for saving power. for example shutting down the entire applications processor on a mobile phone when it is not needed. is a direct and fundamental contributor to the clock timing gap. Also. but in a propagated clocks timing model there can be a massive difference between the launch and capture clock path delays for a clock gate.
They also exploit extensive IP reuse. The clock network in these SoCs is not simple. USB interfaces. e. And even in the cases where it is theoretically possible to achieve this objective the sheer size of the clock delays that would be required to achieve this would be so large. entwined with clock gating elements from the highest levels in the clock tree where they shut down entire sub‐chips. and clock generators. to the lowest levels in the clock tree. 9 . DSPs. but much of the complexity is also inherent to each of these IP blocks: the same low‐power imperatives which drive clock gating are also driving the deployment of a wide variety of clocks and clocking schemes at a variety of frequencies and voltages to further control and manage clock activity. PCI interfaces. The end result is a dense spaghetti network of clock muxes. Power consumption during built‐in system test and time on the tester during production further impact clock complexity as scan chains continue to be sliced and diced and ever more intricate clocking schemes are used to capture and shift quickly and power efficiently. and indeed even common.000 FFs 10. Without insisting that all clock gates lie at the very bottom of the clock tree there is no meaningful way to predict the relationship between L and C for clock gates.g. and graphics processors are instantiated on a single die. to find oneself constructing scenarios where making L=C for all flip‐flops is mathematically impossible. Clock Complexity Modern system‐on‐a‐chip (SoC) designs are big – tens of millions of gates.000 FFs Clk‐B Clk‐C Figure 7: Clock network on a modern nanometer SoC In this world of vast clock complexity.000 FFs 1. Inc. In such a world it is easy. configured to run in particular modes. In fact it has become phenomenally complex – often well over a hundred interlinked clock signals that can be branched and merged thousands of times. and therefore clock gating is also a direct and fundamental contributor to the clock timing gap. the definition of clock skew is itself non‐obvious: rather than a tree or set of trees we have a network with hundreds of sources and hundreds of thousands of sinks.000 FFs 2. baseband modems. Part of the complexity is simply a result of stitching together the many IP blocks. 10+ times the © 2009 Azuro. see Figure 7. AOI2 500 FFs Clk‐A Clk‐D 1. and stitched together to deliver some form of integrated system capability. clock xors. where pseudo‐generic modules such as ARM cores. memory controllers. where they may shut down only a handful of flip‐flops.
and design teams are already manually working around it at great cost by carefully crafting highly complex sets of balancing constraints which are designed to achieve acceptable propagated clocks timing and not L=C for all flip‐flops. which require that CTS balance certain sets of clock paths with other sets of clock paths. The only way to implement designs of this complexity using a traditional design flow is to spend months of manual effort carefully crafting a complex set of overlapping balancing constraints. inter‐clock paths and OCV margins. it must somehow be about directly transitioning a design from ideal to propagated clocks timing and using every possible trick there is to counteract the surprises that occur as the true propagated clocks timing picture emerges. Inc. clock period. the impact of clock complexity on the clock timing gap has already broken the traditional role of CTS in design flows. including clock gates. It can no longer be about minimizing clock skews. But there are other factors ignored by the concept of clock skew which are progressively eroding its ability to bind ideal to propagated clocks timing in the design flow. 60% Average Clock Timing Gap for 10% worst violated setup constraints ocv 50% 40% 30% 20% 10% 0% 180nm 130nm Process Node 65nm 40/45nm interclock clock gates reg‐to‐reg Figure 7: Breakdown of trends contributing to the Clock Timing Gap The message is simple: the role of CTS in the design flow must change. It can take hundreds of such skew groups to be carefully crafted before any reasonable propagated clocks timing can be achieved and such timing is never in practice achieved by ensuring that L=C for all pairs of flip‐flops. is not broken. The figure makes it clear that each of the three trends contributes materially to the clock timing gap. Detailed Breakdown of the Clock Timing Gap Figure 7 below shows the same clock timing gap graph as was shown in Figure 3 but further breaks out the relative contributions of OCV. . The clock timing gap restricted only to setup and hold constraints between pairs of flip‐flops in the same clock tree is at most 10% of the clock period. The resulting dynamic power consumption and IR‐drop would also almost certainly render the entire design useless. as traditionally defined only between registers on a single clock tree. and also highlights that clock complexity has the most significant impact on this gap. In essence. even at 40/45nm. clock gating and inter‐clock timing to this gap. 10 © 2009 Azuro. The graph also highlights our observation that clock skew. as to make timing impossible to meet even with the tiniest of OCV margins in place. typically referred to as “skew groups”.
is the term we use to describe a new class of timing optimization tools that merge physical optimization with CTS and that directly control all four variables in the propagated clocks timing constraint equations (L. and clock complexity have made balancing L and C. Figure 8 visualizes the conceptual distinction between a traditional approach to timing optimization and a clock concurrent approach to timing optimization. Clock Concurrent Optimization Before running CTS in the design flow there are no real launch and capture clock paths and timing optimization focuses exclusively on the slowest logic path. 11 . Gmax. using an ideal clocks model of timing. as this paper has shown. But. Inc. The only meaningful goal for clock construction must therefore be to directly target the propagated clocks timing constraints. this focus on Gmax makes sense. or vice‐versa? Clock concurrent optimization. Provided clocks can be implemented such that the launch and capture clock paths are almost the same for all setup and hold constraints. an impossible goal. clock gating. C. But the “best possible” L and C depend on the values of Gmax and Gmin so we have a chicken‐and‐egg problem: which comes first? Do we pick Gmax and Gmin then set L and C. We use the term “clock concurrent design” or “clock concurrent flow” to refer to any design methodology or design flow employing the use of clock concurrent optimization. C. selecting L and C specifically for the purpose of delivering the best possible propagated clocks timing picture post‐CTS. and Gmax) at the same time. Gmin. clock T clock T L C Skew Gmax Gmax < T Skew variable fixed fixed Gmax L+ Gmax < T + C variable variable fixed variable Clock Concurrent Optimization Traditional Physical Optimization Figure 8: Illustration of the difference between Physical Optimization and CCOpt © 2009 Azuro. Gmin. Clock concurrent optimization (CCOpt) merges physical optimization with CTS and directly controls all four variables in the propagated clocks timing constraint equations (L. OCV. L=C. or CC‐Opt for short. or indeed of determining any systematic relationship between L and C. and Gmax) at the same time.
But this extra time is not a free lunch: if C is bigger than L then time has been borrowed either from either the preceding or subsequent pipeline stages: see Figure 9. see Figure 10. . Such time borrowing is iterative across multiple logic stages: if time can be borrowed from logic stage n+1 to logic stage n. clock T T – Δ 1 T + Δ 1+ Δ 2 T – Δ 2 Δ1 Δ2 Figure 9: Time borrowing T clock T – Δ 1 T T + Δ 1+ Δ 2 T T – Δ 2 Δ1 Δ2 Figure 10: Multistage time borrowing 12 © 2009 Azuro. and so on both forwards and backwards from logic stage n. the time borrowing is not unlimited. CC‐Opt allows the capture clock path to be longer than the launch clock path. However. Inc. Logic Chains Since CC‐Opt treats both clock delays and logic delays as flexible parameters. the maximum possible speed that a chip can be clocked at is no longer limited by the slowest logic path in a design. and must stop either when the chain of logic stages loops back on itself or when it reaches an IO to the chip. then time can also be borrowed from logic stage n+2 to logic stage n+1 and then again from logic stage n+1 to logic stage n. see Figure 11. in which case the logic path may have more than the clock period to compute its result.
it will be possible to come up with a set of clock arrival times for each register on the chain that meets propagated clocks setup constraints. Setup Slack and Sequential Slack The slowest logic function in a design is typically determined by computing the “setup slack” for each gate in a design and finding the logic path which comprises those gates with the lowest slack value. Looping Chain Input Tadpole Chain Output Tadpole Chain Input Output Input IO Chain Output Figure 11: Different types of logic chain In a world where launch and capture clock paths are flexible optimization parameters. Inc. and clock gating. capture clock delay. and logic delay for path p respectively. Provided the worst total logic delay through the entire chain. C[p]. and for each path p in Paths[g] we use the terms L[p]. If we use the term Paths[g] to mean the set of all logic paths which pass through a logic gate g. as this paper has shown. is less than nT. Σi(G[i]max). 13 . it is these chains of logic functions which most influence the maximum possible clock speed: a chain with n logic stages has at most n clock periods of total time available irrespective of the clock delays to each register in the chain. If the setup slack at a gate is negative then a setup constraint must have been violated. it is also impossible to achieve on modern chips due to clock complexity. and the magnitude of the negativity denotes the amount by which that setup constraint has been violated. The traditional assumption of forcing each stage on the loop to have exactly the same amount of time by balancing clocks is fundamentally unnecessary – and furthermore. The sequence of gates with the smallest setup slacks denotes the logic path which is most limiting chip speed and © 2009 Azuro. and G[p] to refer to the launch clock delay. on‐chip‐variation. then: Setup Constraint for a path p: L[p] + G[p] < T + C[p] Setup Slack for gate g: P T C – L G Setup slack is in essence the worst case margin by which all setup constraints which pass through g have been met. We refer to this relationship as a setup chain constraint: Setup chain constraint: ∑ G nT Only in the highly unusual situation where the most efficient distribution of delay along the chain is one where each stage has exactly the same delay will the optimum clock network be a balanced clock network.
and it is entirely possible that this change will make the sequential slacks negative again. and indeed often does. Inc. We prefer chain to cycle since it emphasizes that the sequence of gates with the WNSS need not form a loop and may equally. The slack value of the gates on the critical path is typically referred to as the Worst Negative Slack (WNS). We use the term critical chain to describe the logic chain with the WNSS. In this sense. /n If the smallest sequential slack in a design is negative. using CC‐Opt. If sequential slack is positive then it will be possible to build a clock network which delivers clocks to each flip‐flop such that traditional setup slacks will also be positive – although this network almost certainly will not be a balanced network.i] to refer to the worst logic delay at stage i in chain c. . It is possible to generalize the concept setup slack to setup chain constraints. although the term critical cycle is also used in the literature [Hurst04] to describe the critical chain. These broadly fall into two camps: retiming approaches. n c T n T ∑ G . irrespective of the clock arrival times at each register on the chain. then the amount by which the sequential slack is negative. then: Setup Chain Constraint for a chain c: ∑ Sequential slack for gate g: G c. or critical path. the maximum speed at which a chip can be clocked is limited by the critical chain and not the critical path. worst violated path. and for each chain c we use the term n[c] to refer to the number of logic stages in chain c and G[c. referred to as the Worst Negative Sequential Slack (WNSS). terminate in a design IO. denotes how far off the circuit is from achieving its desired clock speed. Sequential Optimization and Useful Skew Design optimization methods which exploit sequential slacks are usually termed sequential optimization methods. which physically move logic across register 14 © 2009 Azuro. Secondly. is typically referred to as the worst negative path. Normalized Sequential slack for gate g: T ∑ G . clock gating.Cong00]. giving the concept of sequential slack [Pan98. This denotes a fundamental new degree of freedom which can be exploited by CC‐Opt above and beyond physical optimization to make chips faster or smaller or lower power. It is however important to note that if sequential slacks are positive this does not imply that setup slacks can also be made positive: firstly. It also means that sequential slacks are reported on the same scale as traditional setup slacks – if the normalized sequential slack is 100ps it means that the average setup slack along that chain will also be 100ps. Using CCOpt the maximum possible clock speed is limited by the critical chain not the critical path. It is also helpful to normalize the sequential slack relative to the length of the chain by dividing by n[c] so that the margin can be thought of as an average margin per logic stage which is therefore independent of chain length. If we use the term Chains[g] to mean the set of all logic chains passing through a gate g. and inter‐clock timing. where the term “sequential” is used to emphasize the notion that these slacks can cross register boundaries. If sequential slack is negative then traditional setup slacks will never be positive irrelevant of how the clocks are implemented. once clocks are built the sequential slacks themselves will change due to OCV. the clock arrival times necessary to achieve positive setup slacks may not be achievable with a feasibly sized clock tree in terms of area and power.
which intelligently apply delays to the clock tree to improve setup slacks. and are more applicable in today’s flows. Xi99]. (P[i‐1] + G[i‐1]min)) Figure 12: Timing driven clock arrival time windows The dependency on the neighboring flops makes it easy to see that these clock arrival time windows are globally intertwined.g. Commercial EDA tools exploiting clock scheduling typically favor more robust and direct algorithms which incrementally add buffers to an already balanced clock tree to borrow time from positive‐slack stages to adjacent negative‐slack logic stages without ever pre‐computing a desired schedule. Inc. Ravindran03] and schedule implementation [e.g. Scheduling approaches have also been around for almost twenty years [Fishburn90]. Leiserson91]. and clock scheduling approaches. Kourtev99b. Retiming was introduced over twenty years ago [Leiserson84. Academic papers on clock scheduling tend to split the problem into schedule calculation [e. Although the term useful skew was originally applied to the two part calculate‐then‐implement approach [Xi97. and have not gained widespread acceptance. it is generally used today to mean any CTS approach that results in an unbalanced tree for timing reasons. T Hold P[i‐1] + G[i‐1] min > P[i] clock P[i1] P[i+1] Setup P[i] < P[i+1] – G[i] max + T P[i] G[i1]min G[i]min G[i1]max Setup P[i‐1] + G[i‐1] max – T < P[i] G[i]max Hold P[i] > P[i+1] – G[i] min max((P[i‐1] + G[i‐1]max – T). although some papers [such as Held03] tackle both halves of the problem separately in the same paper. Xi97]. even if the approach used doesn’t ever explicitly calculate a desired clock schedule. the key feature in common is that timing ultimately drives the clock arrival times at registers and not a set of CTS balancing constraints. These four constraints constrain the permissible arrival time to be within a window which depends on the arrival times of flops in the logical fan‐in and fan‐out. In any case. but automatic retiming approaches are flow‐invasive because of their impact on formal verification and testability. This concept of windows © 2009 Azuro. (P[i+1] – G[i]min)) < P[i] < min((P[i+1] – G[i]max + T). boundaries. If a clock network can be built which delivers the clock to all flip‐flops within their permissible arrival time windows then setup and hold time constraints will be met. 15 . The best way to think about how CTS can be generalized to be driven by timing and not by a set of balancing constraints is to think of each flip‐flop as having four constraints on the arrival time of its clock which are a simple re‐arrangement of the propagated clocks timing constraints on its D‐pin and on its Q‐pin: see Figure 12. Kourtev99.
The relaxation of the requirement to balance clocks unleashes significant freedom. when OCV derates are considered. including inter‐clock paths. and CPPR. CC‐Opt must somehow enter the propagated clocks world as soon as possible and then globally optimize both the clock and logic delays according to some coherent optimization objective which can be bounded by either logic considerations or clock considerations. Separating the steps of optimizing the logic and building the clock tree causes two key problems on real world commercial designs. and clock generator blocks. it is entirely possible to get into a vicious spiral of increasing insertion delays causing increasingly tight windows. and any increase in insertion delay causes more setup and hold timing degradation due to OCV on clock paths. They cost area and power. Although the concept of windowing sounds very powerful. If schedule calculation happens before clocks are built then it cannot be based on a propagated clocks model of timing! The second problem is that clocks are not free. insertion delays that are too long. interclock paths. which cause further increases in insertion delay in order to meet the windows. this does not mean that such a network would in practice have acceptable area and power. OCV derates. It can also be applied to internal nodes in the clock tree by a simple intersection of the windows of sub‐nodes. it has one key limitation which we alluded to in the previous section: it is isolated from the physical optimization of logic paths. In fact. and skipping the traditional post‐CTS optimization step which becomes 16 © 2009 Azuro. Inc. clock muxes. CCOpt will never close timing at the expense of creating an unreasonably large clock network or an unreasonable number of hold time violations. While positive sequential slacks imply that a clock network can theoretically be built to make traditional setup slacks positive. CC‐Opt differentiates itself from traditional approaches to useful skew by bringing both clock scheduling and physical optimization together under a unified architecture and basing all decisions on a true propagated clocks model of timing. Key signs of the failure to exploit this freedom properly are clock trees that are too large. but this freedom is commercially useless if it is not exploited wisely and in the context of a full propagated clocks model of timing. The first is the clock timing gap. and significant hold timing problems which result in an unreasonable increase in design area once hold fix buffers have been inserted. which as we have shown requires that the desired schedule be based on a true propagated clocks model of timing in order to properly account for OCV. OCV and clock gate timing. But it is also the key enabler for mainstream commercial adoption of CC‐Opt. Clock Concurrent Design Flow Inserting CC‐Opt into the design flow does not require any changes other than to replace physical optimization and CTS with CC‐Opt. can also easily be generalized to include OCV derates and CPPR. together with knowledge of the side‐ effects of various decisions in each domain. culminating in the situation where it is impossible to make setup slacks positive. CCOpt’s decisions are always based on a true propagated clocks measure of timing including clock gates. inter‐clock paths and clock gate enable timings. . CC‐Opt treats both clocks and logic as equally important classes of citizen and understands that in practice either can be the limiting factor on achievable chip speed. The close marriage of physical optimization and clock construction. is the most difficult component of the CC‐Opt problem to solve well. and also can be extended to apply to other timing endpoints such as clock gates. and nor does it mean that setup slacks could be met if OCV derates are being applied to clock paths as well as logic paths.
Inc. Azuro’s Rubix™ CC‐Opt tool already includes such a scan chain re‐stitching capability. and complex CTS configuration scripts become redundant as there is no longer any need to specify any balancing constraints. redundant: see Figure 13. There is no change in timing sign‐off or in formal verification or in gate level simulation. No special data structures or file formats are needed. 17 . What is done before clock concurrent optimization and what is done after remain the same. Traditional Design Flow RTL Synthesis Clock Concurrent Design Flow RTL Synthesis Floorplan Initial Placement Ideal Clocks Timing Floorplan Initial Placement Physical Optimization Clock Tree Synthesis PostCTS Optimization Routing PostRoute Optimization Signoff Verification Final Layout Ideal Clocks Timing Clock Concurrent Optimization Propagated Clocks Timing Routing PostRoute Optimization Signoff Verification Final Layout Propagated Clocks Timing Figure 13: Clock Concurrent Design Flow © 2009 Azuro. For example. There is a potential impact on the magnitude of scan chain hold violations but this can easily be managed by enhancing scan chain stitching algorithms to directly consider hold slacks and not just scan chain wire length.
there is no guarantee that the satisfactory aspects of the post‐placement timing picture will persist. and much of this effort often needs repeating every time a new netlist is provided by the frontend design team. If architected correctly. Increased productivity and accelerated time to market There are two distinct ways in which CC‐Opt increases designer productivity and accelerates time to market. complex clock muxing configurations. At advanced process nodes IR‐drop can have a critical impact on timing sign‐off and chip packaging cost. and timing optimization steps downstream from CTS just don’t have the horsepower to recover from the damage. 2. Reduced IRdrop Since CC‐Opt does not balance clocks the peak current drawn by the clock network is significantly reduced. Many of these iterations are for the sole purpose of asking the frontend design team to manually move logic across register boundaries by changing the RTL. Since the entire flow must be re‐run to incorporate the RTL changes. Without the use of a clock concurrent flow the clock timing gap increasingly cripples timing closure pre to post‐CTS. 4. The time saving from these reduced iterations can be many months. CPPR. The second way in which CC‐Opt increases designer productivity and accelerates time to market is due to a significant reduction in the number of iterations between the frontend and backend design teams. Clock gates. At 65nm and below the achievable increases in clock speed can be as much as 20%. The first is due to a lack of any requirement to configure clock tree balancing constraints or manually set insertion delay offsets for timing critical sink pins. This manual moving of logic is in essence a form of sequential optimization being performed manually and very inefficiently. Inc. the motto for CC‐Opt is “if I can time it then I can optimize it”. 3. Key Benefits of Clock Concurrent Optimization This section overviews the key benefits that CC‐Opt can bring to the digital chip design community. Using CC‐Opt. If the desired chip speed is already achievable without CC‐Opt then this same additional degree of freedom can be exploited to reduce chip area or power. . and multi‐mode should all fall out in the wash so long as the timing analysis engine is able to consider them. This is because all decisions it makes are directly based on a propagated clocks model of timing. Most of the need to do this manual logic moving is completely eliminated using CC‐Opt since it can simply skew the clocks instead. In fact it is entirely feasible to extend CC‐Opt to directly consider peak current (or some reasonable estimate of peak current) as an optimization parameter and specifically skew clocks and adjust logic path delays to ensure that peak current is controlled to be within a pre‐specified limit. multi‐corner. 18 © 2009 Azuro. For complex SoC designs composing a complete set of balancing constraints can take more than a month. OCV derates. CC‐Opt unshackles chip designers from the traditional conflict of interest between tight skew being good for timing but terrible for IR‐drop. migration to advanced process nodes can happen faster and with significantly less pain. Increased chip speed or reduced chip area and power Using CC‐Opt the maximum possible clock speed is limited by the critical chain and not the critical path in a design. 1. Accelerated migration to 45nm and below The ability of CC‐Opt to perform timing optimization is not degraded by the growing clock timing gap. Therefore the requested changes may not have the intended benefit. This is a fundamental new degree of freedom to help close timing above and beyond traditional design flows.
and addressing this challenge requires the clock construction algorithms to become very tightly bound with the logic optimization algorithms. increased productivity and accelerated time‐to‐market. Clock concurrent optimization gives up on the idea of clock balancing as both restrictive and unhelpful at advanced process nodes. © 2009 Azuro. It merges CTS with physical optimization building both clocks and optimizing logic delays at the same time based directly on a propagated clocks model of timing. and accelerated migration to 45nm and below. Clock gating. This unleashes a fundamental new degree of freedom to borrow time across register boundaries resulting in chip speed becoming limited by critical chains not critical paths. Under the hood. At 40/45nm the clock timing gap can be as much as 50% of the clock period resulting in an almost complete rewrite of the timing landscape between ideal and propagated clocks timing. CC‐Opt delivers four key types of benefit to the digital chip design community: increased chip speed or reduced chip area and power. clock complexity and on‐ chip variation are the key industry trends causing this divergence. The intimate relationship between clock construction and logic optimization is what differentiates CC‐Opt from the traditional techniques of sequential optimization and useful skew. 19 . the key challenge which CC‐Opt must tackle is the potential for clock networks to become unreasonably large. which we refer to as the clock timing gap. Inc. But the traditional assumption that if clocks are balanced then propagated clocks timing will mirror ideal clocks timing is fundamentally broken. Conclusions Clocking lies at the heart of commercial chip design flows and is almost as central to the digital chip design community as the transistor itself. reduced IR‐drop.
vol 39 pp 945–951. December 1999. G. 1998 [Ravindran03] A. vol. in IEEE Trans. . Kourtev and E. 146. pp. P. Vygen. pp. K. Friedman. Saxe. “Physical placement driven by sequential timing analysis. 1–7. Maßberg. “Useful‐skew clock routing with gate sizing for low power design. vol. Held. VLSI Signal Process. K. “Clock Skew Optimization”. 2‐3.” in Proceedings of the 21th International Conference on Computer Aided Design.” Proc. IEEE Trans. Xi and D.” Journal of VLSI and Computer Systems. K. P. Lim. S. Staepelaere. 16. Syst. “Optimal clock period clustering for sequential circuits with retiming”. 321–326. Leiserson and J. Korte. pp. P. “Physical planning with retiming. M. vol. pp. 1991 [Pan98] P. A. and C. Systems. vol. Xi and W. "Using Clock Skew as a Tool to Achieve Optimal Timing. 379‐386. no. “Retiming synchronous circuitry. G. Ringe and J. April 1999. Kuehlmann. on CAD. [Kourtev99b] I. Saxe. K. [Leiserson83] C.. pp. [Xi99] J. “Optimizing synchronous systems. 2003 [Xi97] J. Liu." Integrated System Design. January 1983. Cong and S. Karandikar. 163–179.” in IEE Proceedings on Circuits. “Multi‐domain clock skew scheduling. Proceedings of the 2003 IEEE/ACM international conference on Computer‐aided design [Hurst04] A. Devices. References [Cong00] J. B.” Algorithmica. 41–67. L. 1990 [Held03] S. November 2000 [Fishburn90] J. Leiserson and J. “Clock scheduling and clocktree construction for high performance ASICs”. [Leiserson91] C. on Computers. CA). [Kourtev99] I. (San Jose. ICCAD '04. Sentovich. 6. W. Dai. Hurst. 20 © 2009 Azuro.” in Digest of Technical Papers of the IEEE/ACM International Conference on Computer‐Aided Design. Friedman.” J. 1997. A. pp. Inc. “Synthesis of clock tree topologies to implement nonzero clock skew schedule. 5–35. ICCAD 1999. S. G.‐M. ACM. Proc. Ravindran and E. 1. Pan. Fishburn. pp 489‐498. Chong. J. “Clock Skew Scheduling for Improved Reliability via Quadratic Programming”. G. Kourtev and E.