Clock Domain Crossing (CDC) Design & Verification Techniques Using Systemverilog

SNUG-2008
Boston, MA
Voted Best Paper
1st Place
World Class Verilog & SystemVerilog Training
Clock Domain Crossing (CDC) Design & Verification

Techniques Using SystemVerilog
Clifford E. Cummings
Sunburst Design, Inc.
cliffc@sunburst-design.com
ABSTRACT
Important design considerations require that multi-clock designs be carefully constructed at

Clock Domain Crossing (CDC) boundaries. This paper details some of the latest strategies and
best known methods to address passing of one and multiple signals across a CDC boundary.
Included in the paper are techniques related to CDC verification and an interesting 2-deep FIFO
design for passing multiple control signals between clock domains. Although the design methods
described in the paper can be generally implemented using any HDL, the examples are shown
using efficient SystemVerilog techniques.
Table of Contents
1.0 Introduction........................................................................................................................... 6
2.0 Metastability.......................................................................................................................... 6
2.1 Why is metastability a problem?........................................................................................... 7
3.0 Synchronizers........................................................................................................................ 8
3.1 Two synchronization scenarios ............................................................................................. 8
3.2 Two flip-flop synchronizer ................................................................................................... 8
3.3 MTBF - mean time before failure ......................................................................................... 9
3.4 Three flip-flop synchronizer ............................................................................................... 10
3.5 Synchronizing signals from the sending clock domain....................................................... 10
3.6 Synchronizing signals into the receiving clock domain...................................................... 11
4.0 Synchronizing fast signals into slow clock domains .......................................................... 13
4.1 Requirement for reliable signal passing between clock domains ....................................... 13
4.1.1 The "three edge" requirement .......................................................................................... 13
4.2 Problem - passing a fast CDC pulse ................................................................................... 14
4.3 Problem - sampling a long CDC pulse - but not long enough!........................................... 15
4.4 Open-loop solution - sampling signals with synchronizers ................................................ 16
4.5 Closed loop solution - sampling signals with synchronizers .............................................. 17
5.0 Passing multiple signals between clock domains ............................................................... 18
5.1 Multi-bit CDC strategies..................................................................................................... 18
5.2 Multi-bit signal consolidation ............................................................................................. 18
5.3 Problem - Two simultaneously required control signals..................................................... 19
5.3.1 Solution - Consolidation.................................................................................................. 20
5.4 Problem - Two phase-shifted sequencing control signals................................................... 21
5.4.1 Solution - consolidation and an extra flip-flop................................................................ 22
5.5 Problem - Multiple CDC signals......................................................................................... 23
5.5.1 Solutions for passing multiple CDC signals.................................................................... 23
5.6 Multi-Cycle Path (MCP) formulation ................................................................................. 24
5.6.1 MCP formulation using a synchronized enable pulse ..................................................... 25
5.6.2 Closed-loop - MCP formulation with feedback .............................................................. 27
5.6.3 Closed-loop - MCP formulation with acknowledge feedback ........................................ 28
5.7 Synchronizing counters....................................................................................................... 29
5.7.1 Binary counters ................................................................................................................ 29
5.7.2 Gray codes ....................................................................................................................... 30
5.7.3 Gray-to-binary conversion ............................................................................................... 30
5.7.4 Binary-to-gray conversion ............................................................................................... 31
5.7.5 Gray code counter style #1 .............................................................................................. 32
5.7.6 Gray code counter style #2 .............................................................................................. 33
5.8 Additional multi-bit CDC techniques ................................................................................. 34
5.8.1 Multi-bit CDC signal passing using asynchronous FIFOS.............................................. 34
5.8.2 Multi-bit CDC signal passing using 1-deep / 2-register FIFO synchronizer ................... 35
6.0 Naming conventions & design partitioning ........................................................................ 36
6.1 Clock & signal naming conventions ................................................................................... 36
SNUG Boston 2008 2 Clock Domain Crossing (CDC) Design & Verification
Rev 1.0 Techniques Using SystemVerilog
6.1.1 Multi-clock / multi-source modules with no naming convention................................... 37
6.2 Timing verification for each clock domain......................................................................... 37
6.3 Clock oriented design partitioning...................................................................................... 37
6.3.1 Timing analysis of clock-partitioned modules................................................................. 39
6.4 Partitioning with MCP formulations................................................................................... 40
7.0 Multi-clock gate-level simulation issues ............................................................................ 41
7.1 Synchronizer gate-level CDC simulation issue .................................................................. 41
7.2 Strategies to remove X-propagation from gate-level simulations....................................... 41
7.2.1 Simulator command to turn off timing checks ................................................................ 42
7.2.2 Change flip-flop setup and hold times to 0...................................................................... 42
7.2.3 Copy and modify new flip-flop models ........................................................................... 42
7.2.4 Synopsys set_annotated_check command ....................................................................... 42
7.3 Additional strategies to remove X-propagation .................................................................. 43
7.3.1 Use multiple SDF files .................................................................................................... 43
7.3.2 Vendor synchronizer cell with supporting SDF generation tools.................................... 43
7.3.3 Vendors with built-in synchronizer support .................................................................... 44
7.4 Multiple SDF files for gate-level CDC simulations ........................................................... 44
7.5 Force synchronizer notifier inputs to a fixed value............................................................. 44
7.6 ASIC & FPGA library cell synchronizers........................................................................... 45
7.7 Simulation model with random delay insertion .................................................................. 46
8.0 Summary & conclusions ..................................................................................................... 47
8.1 Recommended 1-bit CDC techniques................................................................................. 47
8.2 Recommended multi-bit CDC techniques .......................................................................... 48
8.3 Recommended naming conventions and design partitioning ............................................. 48
8.4 Recommended solutions to multi-clock gate-level CDC simulations ................................ 48
9.0 Acknowledgements............................................................................................................. 48
10.0 References........................................................................................................................... 48
11.0 Author & Contact Information............................................................................................ 49
12.0 Appendix............................................................................................................................. 50
12.1 Common sync2 model - used by MCP formulation and FIFO synchronizer...................... 50
12.2 MCP formulation with ready-acknowledge source code .................................................... 50
12.3 Multi-bit 1-deep / 2-register FIFO synchronizer source code............................................. 55
Table of Figures
Figure 1 - Asynchronous clocks and synchronization failure ......................................................... 6

Figure 2 - Metastable bdat1 output propagating invalid data throughout the design................... 7
Figure 3 - Two flip-flop synchronizer............................................................................................. 9
Figure 4 - Primary contributing factors to short MTBF values..................................................... 10
Figure 5 - Three flip-flop synchronizer used in higher speed designs .......................................... 10
Figure 6 - Unregistered signals sent across a CDC boundary....................................................... 11
Figure 7 - Registered signals sent across a CDC boundary .......................................................... 12
Figure 8 - Short CDC signal pulse missed during synchronization .............................................. 14
Figure 9 - Marginal CDC pulse that violates the destination setup and hold times...................... 15
Figure 10 - Lengthened pulse to guarantee that the control signal will be sampled ..................... 16
Figure 11 - Signal with feedback to acknowledge receipt ............................................................ 17
Figure 12 - Problem - Passing multiple control signals between clock domains.......................... 19
Figure 13 - Solution - Consolidating control signals before passing between clock domains ..... 20
Figure 14 - Problem - Passing sequential control signals between clock domains....................... 21
Figure 15 - Solution - Logic to generate proper sequencing signals in the new clock domains ... 22
Figure 16 - Problem - Encoded control signals passed between clock domains .......................... 23
Figure 17 - Logic to pass a synchronized enable pulse between clock domains .......................... 24
Figure 18 - Synchronized pulse generation logic.......................................................................... 25
Figure 19 - Synchronized enable pulse generation logic and equivalent symbol ......................... 26
Figure 20 - Multi-Cycle Path (MCP ) formulation toggle-pulse generation................................. 26
Figure 21 - Multi-Cycle Path (MCP ) formulation toggle-pulse generation with acknowledge... 27
Figure 22 - Multi-Cycle Path (MCP ) formulation toggle-pulse generation with ready-ack ........ 28
Figure 23 - Binary count values sampled in mid-transition .......................................................... 29
Figure 24 - 4-bit gray-to-binary conversion equations.................................................................. 30
Figure 25 - 4-bit gray-to-binary conversion equations - 2nd method ........................................... 31
Figure 26 - 4-bit binary-to-gray conversion equations.................................................................. 31
Figure 27 - Gray code counter style #1 - only one gray code register........................................... 32
Figure 28 - Gray code counter style #2 - binary register and gray code register........................... 33
Figure 29 - 1-deep / 2-register FIFO synchronizer block diagram................................................ 35
Figure 30 - Design partitioned on clock boundaries ..................................................................... 38
Figure 31 - Partitioned design with MCP formulation ................................................................. 40
Figure 32 - Synchronizer gate-level CDC simulation waveforms ................................................ 41
Figure 33 - Sample ASIC & FPGA synchronizer cell for synthesis and simulation .................... 46
Table of Examples
Example 1 - Non-working but conceptually correct gray-to-binary SystemVerilog model ......... 30

Example 2 - Parameterized and correct gray-to-binary SystemVerilog model............................. 31
Example 3 - Parameterized binary-to-gray SystemVerilog model................................................ 32
Example 5 - Parameterized gray-code counter SystemVerilog model.......................................... 33
Example 6 - Parameterized gray-code counter with binary counter ............................................. 34
Example 7 - SystemVerilog model for ASIC & FPGA synchronizer cell.................................... 47
Example 8 - sync2.sv code............................................................................................................ 50
Example 9 - plsgen.sv code .......................................................................................................... 50
Example 10 - asend_fsm.sv code.................................................................................................. 51
Example 11 - back_fsm.sv code ................................................................................................... 51
Example 12 - bmcp_recv.sv code ................................................................................................. 52
Example 13 - mcp_blk.sv code..................................................................................................... 53
Example 14 - acmp_send.sv code................................................................................................. 54
Example 15 - wctl.sv code ............................................................................................................ 55
Example 16 - cdc_syncfifo.sv code .............................................................................................. 55
Example 17 - Dual Port Ram code - dp_ram2.sv ......................................................................... 56
Example 18 - rctl.sv code.............................................................................................................. 56
1.0 Introduction
In 2001, I presented my first paper on multi-asynchronous clock design. At that time, I had not
found any good sources to describe the design and synthesis techniques required to do proper
multi-clock design. The 2001 paper was a collection of techniques that I had gathered over years
from actual ASIC and FPGA design experiences. At the conclusion of the 2001 conference
presentation, dozens of engineers and colleagues came forward and shared with me enough
additional interesting ideas and techniques to write a sequel on the topic. Over the past eight
years, I have included instruction on multi-clock design techniques in my Advanced and Expert
Verilog and SystemVerilog training courses, and over that same period of time, more colleagues
and students have shared with me additional interesting multi-clock design techniques. Since the
release of the first multi-clock paper in 2001, the industry has largely identified these types of
design methodologies as Clock Domain Crossing (CDC) techniques. I will use this common
nomenclature in this paper.
This paper includes the best techniques described in the 2001 paper along with an updated
collection of interesting and efficient multi-clock design techniques that have been shared with
me over the past decade. The actual conference presentation slides will be mostly a collection of
the new techniques incorporated since the original 2001 presentation, retaining only enough of
the original slides to introduce the fundamental CDC design concepts and issues.
2.0 Metastability
Metastbility refers to signals that do not assume stable 0 or 1 states for some duration of time at
some point during normal operation of a design. In a multi-clock design, metastability cannot be
avoided but the detrimental effects of metastability can be neutralized.
Figure 1 - Asynchronous clocks and synchronization failure
Quoting from Dally and Poulton's book[9] concerning metastability:
"When sampling a changing data signal with a clock ... the order of the events
determines the outcome. The smaller the time difference between the events, the
longer it takes to determine which came first. When two events occur very close
together, the decision process can take longer than the time allotted, and a
synchronization failure occurs."
Figure 1 shows a synchronization failure that occurs when a signal generated in one clock
domain is sampled too close to the rising edge of a clock signal from a second clock domain.
Synchronization failure is caused by an output going metastable and not converging to a legal
stable state by the time the output must be sampled again.
2.1 Why is metastability a problem?

So why is metastability a problem? Figure 2 shows that a metastable output that traverses
additional logic in the receiving clock domain can cause illegal signal values to be propagated
throughout the rest of the design. Since the CDC signal can fluctuate for some period of time, the
input logic in the receiving clock domain might recognize the logic level of the fluctuating signal
to be different values and hence propagate erroneous signals into the receiving clock domain.
Figure 2 - Metastable bdat1 output propagating invalid data throughout the design
Every flip-flop that is used in any design has a specified setup and hold time, or the time in which
the data input is not legally permitted to change before and after a rising clock edge. This time
window is specified as a design parameter precisely to keep a data signal from changing too close
to another synchronizing signal that could cause the output to go metastable.
3.0 Synchronizers
When passing signals between clock domains, an important question to ask is, do I need to
sample every value of a signal that is passed from one clock domain to another?
3.1 Two synchronization scenarios

There are two scenarios that are possible when passing signals across CDC boundaries, and it is
important to determine which scenario applies to your design:
(1) It is permitted to miss samples that are passed between clock domains.
(2) Every signal passed between clock domains must be sampled.
First scenario: sometimes it is not necessary to sample every value, but it is important that the
sampled values are accurate. One example is the set of gray code counters used in a standard
asynchronous FIFO design. In a properly designed asynchronous FIFO model, synchronized gray
code counters do not need to capture every legal value from the opposite clock domain, but it is
critical that sampled values be accurate to recognize when full and empty conditions have
occurred.
Second scenario: a CDC signal must be properly recognized or recognized and acknowledged
before a change is permitted on the CDC signal.
In both of these scenarios, the CDC signals will require some form of synchronization into the
receiving clock domain.
3.2 Two flip-flop synchronizer

Quoting again from Dally and Poulton[9] concerning synchronizers:
"A synchronizer is a device that samples an asynchronous signal and outputs a version
of the signal that has transitions synchronized to a local or sample clock."
The simplest and most common synchronizer used by digital designers is a two-flip-flop
synchronizer as shown in Figure 3.
The first flip-flop samples the asynchronous input signal into the new clock domain and waits for
a full clock cycle to permit any metastability on the stage-1 output signal to decay, then the stage-
1 signal is sampled by the same clock into a second stage flip-flop, with the intended goal that
the stage-2 signal is now a stable and valid signal synchronized and ready for distribution within
the new clock domain.
Figure 3 - Two flip-flop synchronizer
It is theoretically possible for the stage-1 signal to still be sufficiently metastable by the time the
signal is clocked into the second stage to cause the stage-2 output signal to also go metastable.
The calculation of the probability of the time between synchronization failures (MTBF) is a
function of multiple variables including the clock frequencies used to generate the input signal
and to clock the synchronizing flip-flops. One description of the MTBF calculation can be found
in Dally and Poulton[9].
For most synchronization applications, the two flip-flop synchronizer is sufficient to remove all
likely metastability.
3.3 MTBF - mean time before failure

For most applications, it is important to run a calculation of the Mean Time Before Failure
(MTBF) for any signal crossing a CDC boundary. Failure in this sense means a signal that is
passed to a synchronizing flip-flop, goes metastable on the first stage synchronizer flip-flop, and
continues to be metastable one cycle later when it is sampled into the second stage synchronizer
flip-flop. Since the signal did not settle to a known value after one clock cycle, the signal could
still be metastable when sampled and passed to the receiving clock domain, causing potential
failures to the corresponding logic.
When calculating MTBF numbers, larger numbers are preferred over smaller numbers. Larger
MTBF numbers indicate longer periods of time between potential failures, while smaller MTBF
numbers indicate that metastability could happen frequently, similarly causing failures within the
design.
Dally and Poulton[9] give a good equation with very thorough analysis of the calculation that can
be performed to calculate the MTBF of a synchronizer circuit. Without repeating the equation
and analysis, it should be pointed out that two of the most important factors that directly impact
the MTBF of a synchronizer circuit are, the sample clock frequency (how fast are signals being
sampled into the receiving clock domain) and the data change frequency (how fast is the data
changing that crosses the CDC boundary).
Figure 4 - Primary contributing factors to short MTBF values
From the above partial equation, it can be seen that failures occur more frequently (shorter
MTBF) in higher speed designs, or when the sampled data changes more frequently.
3.4 Three flip-flop synchronizer
For some very high speed designs, the MTBF of a two-flop synchronizer is too short and a third
flop is added to increase the MTBF to a satisfactory duration of time. Of course, satisfactory is
determined by the architect of the design.
Figure 5 - Three flip-flop synchronizer used in higher speed designs
3.5 Synchronizing signals from the sending clock domain

Frequently asked question regarding CDC design: Is it a good idea to register signals from
the sending clock domain before passing the signals to the receiving clock domain? Implied in
the question is the assumption that CDC signals will be synchronized into the receiving clock
domain; therefore, they do not require synchronization in the sending clock domain. This
rationalization is incorrect and registering signals in the sending clock domain should generally
be required.
Consider an example where the signals in the sending clock domain are not registered before
being passed into the receiving clock domain, as shown in Figure 6.
Figure 6 - Unregistered signals sent across a CDC boundary
In this example, the combinational output from the sending clock domain could experience
combinational settling at the CDC boundary. This combinational settling effectively increases the
data-change frequency potentially creating small bursts of oscillating data and thereby increasing
the number of edges that could be sampled while changing, with a corresponding increase in the
potential for sampling changing data and generating metastable signals.
3.6 Synchronizing signals into the receiving clock domain
Signals in the sending clock domain should be synchronized before being passed to a CDC
boundary. The synchronization of signals from the sending clock domain reduces the number of
edges that can be sampled in the receiving clock domain, effectively reducing the data-change
frequency in the MTBF equation and hence increasing the time between calculated failures (see
section 3.3 for a description of the impact of data change frequencies on MTBF).
Figure 7 - Registered signals sent across a CDC boundary
In Figure 7, the aclk logic settles and sets up on the adat flip-flop before being passed into the
bclk domain. The adat flip-flop filters out the combinational settling on the flip-flop input (a)
and passes a clean signal to the bclk logic.
4.0 Synchronizing fast signals into slow clock domains
As discussed in section 3.1, if a CDC signal cannot be skipped when passed between clock
domains, it is important to consider signal widths or synchronization techniques when they are
passed between clock domains.
One issue associated with synchronizers is the possibility that a signal from a sending clock
domain might change values twice before it can be sampled, or might be too close to the
sampling edges of a slower clock domain. This possibility must be considered any time signals
are sent from one clock domain to another and a determination must be made whether missed
signals are or are not a problem for the design in question.
When missed samples are not allowed, there are two general approaches to the problem:
(1) An open-loop solution to ensure that signals are captured without acknowledgment.
(2) A closed-loop solution that requires acknowledgement of receipt of the signal that crosses a
CDC boundary.
Both solutions are discussed in this section.
4.1 Requirement for reliable signal passing between clock domains

Synchronizing slower control signals into a faster clock domain is generally not a problem if the
faster clock domain is 1.5X the frequency (or more) of the slower clock domain, since the faster
clock signal will sample the slower CDC signal one or more times. Recognizing that sampling
slower signals into faster clock domains causes fewer potential problems than sampling faster
signals into slower clock domains, a designer might to take advantage of this fact by using simple
two flip-flop synchronizers to pass single CDC signals between clock domains.
4.1.1 The "three edge" requirement

Mark Litterick[4] noted that when passing one CDC signal between clock domains through a
two-flip-flop synchronizer, the CDC signal must be wider than 1-1/2 times the cycle width of the
receiving domain clock period. Littereick described this requirement as "input data values must
be stable for three destination clock edges."
For exceptionally long source and destination clock frequencies, this requirement could probably
be safely relaxed to 1-1/4 times the cycle time of the receiving clock domain or less, but the
"three edge" guideline is the safest initial design condition, and is easier to prove through the use
of SystemVerilog assertions than to dynamically measure a fractional width of a CDC signal
during simulation.
The "three edge" requirement actually applies to both open-loop and closed-loop solutions, but
implementations of the closed-loop solution automatically ensure that at least three edges are
detected for all CDC signals.
4.2 Problem - passing a fast CDC pulse
Consider the severely flawed condition where the sending clock domain has a higher frequency
than the receiving clock domain and that a CDC pulse is only one cycle wide in the sending clock
domain. If the CDC signal is only pulsed for one fast-clock cycle, the CDC signal could go high
and low between the rising edges of a slower clock and not be captured into the slower clock
domain as shown in Figure 8.
Figure 8 - Short CDC signal pulse missed during synchronization
4.3 Problem - sampling a long CDC pulse - but not long enough!
Consider the somewhat non-intuitive and flawed condition where the sending clock domain
sends a pulse to the receiving clock domain that is slightly wider than the period of the receiving
clock frequency. Under most conditions, the signal will be sampled and passed, but there is the
small but real chance that the CDC pulse will change too close to the two rising clock edges of
the receiving clock domain and thereby violate the setup time on the first clock edge and violate
the hold time of the second clock edge and not form the anticipated pulse. This possible failure is
shown in Figure 9.
Figure 9 - Marginal CDC pulse that violates the destination setup and hold times
4.4 Open-loop solution - sampling signals with synchronizers
One potential solution to this problem is to assert CDC signals for a period of time that exceeds
the cycle time of the sampling clock as shown in Figure 10. As discussed in section 4.1.1, the
minimum pulse width is 1.5X the period of the receiving clock frequency. The assumption is that
the CDC signal will be sampled at least once and possibly twice by the receiver clock.
Open-loop sampling can be used when relative clock frequencies are fixed and properly
analyzed.
Advantage: the Open-loop solution is the fastest way to pass signals across CDC boundaries that
does not require acknowledgement of the received signal.
Disadvantage: the largest potential problem related to an open-loop solution is that another
engineer might mistake the solution for a general purpose solution, or the design requirements
might change and an engineer might fail to re-analyze the original open loop solution. This
problem can be minimized by adding a SystemVerilog Assertion to the model to detect if the
input pulse ever fails to exceed the "three edges" design requirement.
Figure 10 - Lengthened pulse to guarantee that the control signal will be sampled
4.5 Closed loop solution - sampling signals with synchronizers
A second potential solution to this problem is to send an enabling control signal, synchronize it
into the new clock domain and then pass the synchronized signal back through another
synchronizer to the sending clock domain as an acknowledge signal.
Advantage: synchronizing a feedback signal is a very safe technique to acknowledge that the
first control signal was recognized and sampled into the new clock domain.
Disadvantage: there is potentially considerable delay associated with synchronizing control

signals in both directions before allowing the control signal to change.
Figure 11 - Signal with feedback to acknowledge receipt
5.0 Passing multiple signals between clock domains
When passing multiple signals between clock domains, simple synchronizers do not guarantee
safe delivery of the data.
A frequent mistake made by engineers when working on multi-clock designs is passing multiple
CDC bits required in the same transaction from one clock domain to another and overlooking the
importance of the synchronized sampling of the CDC bits.
The problem is that multiple signals that are synchronized to one clock will experience small
data changing skews that can occasionally be sampled on different rising clock edges in a second
clock domain. Even if we could perfectly control and match the trace lengths of the multiple
signals, differences in rise and fall times as well as process variations across a die could
introduce enough skew to cause sampling failures on otherwise carefully matched traces.
Multi-bit CDC strategies must be employed to avoid skewed sampling of the multi-bit value.
5.1 Multi-bit CDC strategies

To avoid multi-bit CDC skewed sampling scenarios, I have classified multi-bit CDC strategies
into three main categories:
(1) Multi-bit signal consolidation. Where possible, consolidate multiple CDC bits into 1bit CDC
signals.
(2) Multi-cycle path formulations. Use a synchronized load signal to safely pass multiple CDC
bits.
(3) Pass multiple CDC bits using gray codes.
Each of these strategies is detailed in the remainder of this section.
5.2 Multi-bit signal consolidation

Where possible, consolidate multiple CDC signals into a 1bit CDC signal. Ask yourself the
question, do I really need multiple bits to control logic across a CDC boundary?
Simply using synchronizers on all of the CDC bits is not always good enough as will be shown in
the following examples.
If the order or alignment of the control signals is significant, care must be taken to correctly pass
the signals into the new clock domain. All of the examples shown in this section are overly
simplistic but they closely mimic situations that often arise in real designs.
5.3 Problem - Two simultaneously required control signals.
In the simple example shown in Figure 12, a register in the receiving clock domain requires both
a load signal and an enable signal in order to load a data value into the register. If both the load
and enable signals are driven on the same sending clock edge, there is a chance that a small skew
between the control signals could cause the two signals to be synchronized into different clock
cycles within the receiving clock domain. Under these conditions, the data would not be loaded
into the register.
Figure 12 - Problem - Passing multiple control signals between clock domains
5.3.1 Solution - Consolidation
The solution to the problem in section 5.3 is simple, consolidate the control signals. As shown in
Figure 13, drive both the load and enable register input signals in the receiving clock domain
from just one load-enable signal. Consolidation will remove the potential of two control signals
arriving shifted in time.
Figure 13 - Solution - Consolidating control signals before passing between clock domains
5.4 Problem - Two phase-shifted sequencing control signals.
The diagram in Figure 14, shows two enable signals, aen1 and aen2, that are sequentially driven
from a sending clock domain into the receiving clock domain to control the enable inputs of
pipelined data registers. The problem is that in the first clock domain, the aen1 control signal
might terminate slightly before the aen2 control signal is generated, and the rising edge of the
receiving clock might occur in the slight gap between the aen1 and aen2 control signal pulses,
causing a one-cycle gap to form in the enable control-signal chain in the receiving clock domain.
This would cause the a2 data value to be missed by the second register.
Figure 14 - Problem - Passing sequential control signals between clock domains
5.4.1 Solution - consolidation and an extra flip-flop
The solution to this problem, as shown in Figure 15, is to send only one control signal into the
receiving clock domain and generate the second phase-shifted pipelined enable signal within the
receiving clock domain.
Figure 15 - Solution - Logic to generate proper sequencing signals in the new clock domains
5.5 Problem - Multiple CDC signals
The diagram in Figure 16 shows two encoded control signals being passed between clock
domains. If the two encoded signals are slightly skewed when sampled, an erroneous decoded
output could be generated for one clock period in the receiving clock domain.
Figure 16 - Problem - Encoded control signals passed between clock domains
5.5.1 Solutions for passing multiple CDC signals

Multi-Cycle Path (MCP) formulations and FIFO techniques can be used to address problems
related to passing multiple CDC signals. A description and definition of an MCP formulation is
given in section 5.6
There are at least two Multi-Cycle Path (MCP) formulations that can be used to fix this problem:
(1) Closed-loop - MCP formulation with feedback.
(2) Closed-loop - MCP formulation with acknowledge feedback.
The MCP formulation implementation techniques are described starting in the next section.
There are also at least two FIFO strategies that act as closed loop solutions to this problem:
(1) Asynchronous FIFO implementation.
(2) 2-deep FIFO implementation.
The FIFO implementation techniques are described starting in section 5.8.
Figure 17 - Logic to pass a synchronized enable pulse between clock domains
5.6 Multi-Cycle Path (MCP) formulation

Using an MCP formulation is a common technique for safely passing multiple CDC signals.
An MCP formulation refers to sending unsynchronized data to a receiving clock domain paired
with a synchronized control signal. The data and control signals are sent simultaneously allowing
the data to setup on the inputs of the destination register while the control signal is synchronized
for two receiving clock cycles before it arrives at the load input of the destination register.
Advantages:
(1) The sending clock domain is not required to calculate the appropriate pulse width to send
between clock domains.
(2) The sending clock domain is only required to toggle an enable into the receiving clock
domain to indicate that data has been passed and is ready to be loaded. The enable signal is
not required to return to its initial logic level.
This strategy passes multiple CDC signals without synchronization, and simultaneously passes a
synchronized enable signal to the receiving clock domain. The receiving clock domain is not
allowed to sample the multi-bit CDC signals until the synchronized enable passes through
synchronization and arrives at the receiving register.
This strategy is called a Multi-Cycle Path Formulation[8] due to the fact that the unsynchronized
data word is passed directly to the receiving clock domain and held for multiple receiving clock
cycles, allowing an enable signal to be synchronized and recognized into the receiving clock
domain before permitting the unsynchronized data word to change.
Because the unsynchronized data is passed and held stable for multiple clock cycles before being
sampled, there is no danger that the sampled value will go metastable.
5.6.1 MCP formulation using a synchronized enable pulse

Perhaps the most common method to pass a synchronized enable signal between clock domains
is to employ a toggling enable signal that is passed to a synchronized pulse generator to indicate
that the unsynchronized multi-cycle data word can be captured on the next receiving clock edge
as shown in Figure 18.
A key feature of this synchronized enable pulse generation is that the polarity of the input signal
does not matter. In Figure 18, the d-input is toggled high in cycle 1 and by cycle 4 a high signal
has propagated through the three synchronizing flip-flops. In cycle 3 the outputs of the q2 and q3
flip-flops have a different polarity causing the synchronized enable pulse to form on the output of
the exclusive-or gate in that same cycle. Similarly, the d-input is toggled low in cycle 7 and by
cycle 10 a high signal has propagated through the three synchronizing flip-flops. And again in
cycle 9 the outputs of the q2 and q3 flip-flops have a different polarity causing the synchronized
enable pulse to form on the output of the exclusive-or gate.
Figure 18 - Synchronized pulse generation logic

Since all of the MCP formulations described in section 5.0 use the synchronized enable pulse
generation circuit, it was deemed useful to create and use a smaller equivalent symbol to
represent the synchronized enable pulse generation circuit. The equivalent symbol is shown in
Figure 19.
Figure 19 - Synchronized enable pulse generation logic and equivalent symbol
In addition to generating a pulse off of any d-input polarity, the synchronized enable pulse
generation circuit also has a q-output that follows the d-input delayed by three clock cycles. The
q-output is frequently used as a feedback signal and passed as an acknowledge signal through
another synchronized enable pulse generation circuit in the sending clock domain.
Figure 20 shows a typical send-receive toggle-pulse generation design.
Figure 20 - Multi-Cycle Path (MCP ) formulation toggle-pulse generation
Using this technique, it is required that the receiving clock domain have logic in place to capture
the data when the pulse is detected, because the pulse will only be valid for one receiving clock
cycle per multi-cycle data word.
5.6.2 Closed-loop - MCP formulation with feedback
An important technique when using an MCP formulation is to pass the enable signal back to the
sending clock domain as an acknowledge signal as shown in Figure 21.
Figure 21 - Multi-Cycle Path (MCP ) formulation toggle-pulse generation with acknowledge
For the example in Figure 21, the acknowledge feedback signal (b_ack) generates an
acknowledge pulse (aack) that is used as an input to a small READY-BUSY, 1-state FSM block
that generates a ready signal (aready) to indicate that it is now safe to change the data input
(adatain) value again. Once the aready signal goes high, the sender is free to send new data
(adatain) and the accompanying asend control signal.
This is an automatic feedback path that assumes that the receiving clock domain will always be
ready for the next data word synchronized through an MCP formulation.
5.6.3 Closed-loop - MCP formulation with acknowledge feedback
A fully responsive variation of the technique described in section 5.6.2 uses an MCP formulation
is to pass the enable signal back to the sending clock domain as an acknowledge signal only after
the receiving clock domain acknowledges receipt of the data with a bload pulse as shown in
Figure 22.
Figure 22 - Multi-Cycle Path (MCP ) formulation toggle-pulse generation with ready-ack
For the example in Figure 22, the receiving clock domain has a small WAIT-READY, 1-state FSM
that sends a valid signal (bvalid) to the receiving logic when data is valid on the input to the
data register. The data is not actually loaded until the receiving logic acknowledges that the data
should be loaded by asserting the bload signal. There is no feedback to the sending clock
domain until the data has been loaded, then the b_ack signal is sent back the same as the MCP
formulation with automatic feedback.
This is an feedback path requires action on the part of the receiving clock domain before data is
captured and feedback is sent.
5.7 Synchronizing counters
As mentioned earlier, when passing multiple signals between clock domains, an important
question to ask is, do I need to sample every value of a signal that is passed from one clock
domain to another? With counters, the answer is frequently, no!
Reference [1] details FIFO design techniques where gray code counters are sampled between
clock domains and intermediate gray count values are often missed. For this FIFO design, the
greater consideration is to make sure that the counters cannot overrun their boundaries, which
could cause missed full and empty flag detection. Even though the sampled gray count values
between clock domains are often missed, the design is robust and all important gray count values
are appropriately sampled. See [1] for details.
Since a valid design might be allowed to skip some count value samples, can any counter be used
to pass count values across a CDC boundary? The answer is no.
5.7.1 Binary counters

One characteristic of binary counters is that half of all sequential binary incrementing operations
require that two or more counter bits must change. Trying to synchronize a binary counter across
a CDC boundary is the same as trying to synchronize multiple CDC signals into a new clock
domain. If a simple 4-bit binary counter changes from address 7 (binary 0111) to address 8
(binary 1000), all four counter bits will change at the same time. If a synchronizing clock edge
comes in the middle of this transition, it is possible that any 4-bit binary pattern could be sampled
and synchronized into the new clock domain as shown in Figure 23.
Binary Count 07 -> 08 possible binary transitions

Values
0 1 1 1 -> 1 0 0 0 (07->08)
00 0 0 0 0 0 1 1 1 -> 0 0 0 0 (07->00)
01 0 0 0 1 0 1 1 1 -> 0 0 0 1 (07->01)
02 0 0 1 0 0 1 1 1 -> 0 0 1 0 (07->02)
03 0 0 1 1 0 1 1 1 -> 0 0 1 1 (07->03)
04 0 1 0 0 0 1 1 1 -> 0 1 0 0 (07->04)
05 0 1 0 1 0 1 1 1 -> 0 1 0 1 (07->05)
06 0 1 1 0 0 1 1 1 -> 0 1 1 0 (07->06)
07 0 1 1 1 0 1 1 1 -> 0 1 1 1 (07->07)
08 1 0 0 0 0 1 1 1 -> 1 0 0 0 (07->08)
09 1 0 0 1 0 1 1 1 -> 1 0 0 1 (07->09)
10 1 0 1 0 0 1 1 1 -> 1 0 1 0 (07->10)
11 1 0 1 1 0 1 1 1 -> 1 0 1 1 (07->11)
12 1 1 0 0 0 1 1 1 -> 1 1 0 0 (07->12)
13 1 1 0 1 0 1 1 1 -> 1 1 0 1 (07->13)
14 1 1 1 0 0 1 1 1 -> 1 1 1 0 (07->14)
15 1 1 1 1 0 1 1 1 -> 1 1 1 1 (07->15)
Figure 23 - Binary count values sampled in mid-transition
In a FIFO design, the new synchronized binary value might trigger a false full or empty flag, or
even worse, it might not trigger a real full or empty flag causing data to be lost due to FIFO
overflow or causing invalid data to be read from the FIFO due to an attempt to read data when
the FIFO is really empty.
5.7.2 Gray codes

Gray codes are named after Frank Gray[4] and the safest counters that can be used in multi-clock
designs are Gray code counters. Gray codes only allow one bit to change for each clock
transition, eliminating the problem associated with trying to synchronize multiple changing CDC
bits across a clock domain.
Standard gray codes have very nice translation properties to convert gray-to-binary and back
again. Using these conversions, it is simple to design efficient gray code counters.
5.7.3 Gray-to-binary conversion

To convert a gray-code value to an equivalent binary-code value, using an n-bit gray code value
as an example, binary bit 0 is equal to the exclusive-or of gray code bit 0 exclusive-ored with all
other gray code bits from 1 to n. Binary bit 1 is equal gray code bit 1 exclusive-ored with all
other gray code bits from 2 to n, etc. The most significant binary bit is just equal to the most
significant gray code bit.
The equations for a sample 4-bit gray-to-binary conversion are shown in Figure 24.
bin[0] = gray[3] ^ gray[2] ^ gray[1] ^ gray[0];
bin[1] = gray[3] ^ gray[2] ^ gray[1];
bin[2] = gray[3] ^ gray[2];
bin[3] = gray[3];
Figure 24 - 4-bit gray-to-binary conversion equations
The easiest way to code a gray-to-binary converter is to code a for-loop and do an exclusive-or
reduction on a gray code vector with variable index range, where each time through the loop the
LSB of the index range increases until we are left with a simple assignment of bin[MSB] =
^gray[MSB:MSB] (just the 1-bit MSB of the gray code vector), as shown in Example 1.
module gray2bin_bad #(parameter SIZE = 4)
(output logic [SIZE-1:0] bin,
input logic [SIZE-1:0] gray);
// Syntax Error - variable index range

always_comb
for (int i=0; i<SIZE; i++)
bin[i] = ^(gray[SIZE-1:i]);
endmodule
Example 1 - Non-working but conceptually correct gray-to-binary SystemVerilog model
Unfortunately, Verilog and SystemVerilog do not permit part selects using a variable index range
so the code in Example 1, although conceptually correct, will not compile.
To address this issue, remember that an exclusive-or gate is really a programmable inverter. If
one input is tied high, the other input is inverted and passed to the output. Similarly, if one input
is tied low, the other input is passed to the output without inversion (no change from input to
output).
Taking advantage of the fact that any added exclusive-or operation that involves a 0-input does
not change the outcome of the operation, the way to approach of a gray-to-binary conversion is to
exclusive-or the significant gray-code bits with padded 0's as shown in Figure 25.
bin[0] = gray[3] ^ gray[2] ^ gray[1] ^ gray[0] ; // gray>>0
bin[1] = 1'b0 ^ gray[3] ^ gray[2] ^ gray[1] ; // gray>>1
bin[2] = 1'b0 ^ 1'b0 ^ gray[3] ^ gray[2] ; // gray>>2
bin[3] = 1'b0 ^ 1'b0 ^ 1'b0 ^ gray[3] ; // gray>>3
Figure 25 - 4-bit gray-to-binary conversion equations - 2nd method
The corresponding parameterized SystemVerilog model for this simplified algorithm is shown in
Example 2. This example is syntactically correct, will compile and does work.
module gray2bin #(parameter SIZE = 4)
(output logic [SIZE-1:0] bin,
input logic [SIZE-1:0] gray);
always_comb
bin[i] = ^(gray>>i);
endmodule
Example 2 - Parameterized and correct gray-to-binary SystemVerilog model
What happens to all of the extra exclusive-or operations with inputs tied to 0? Synthesis tools
recognize that exclusive-or gates with a constant-0 on one input can be optimized away to infer a
very efficient implementation of the design.
5.7.4 Binary-to-gray conversion

To convert a binary value to an equivalent gray-code value, using an n-bit binary value as an
example, gray-code bit 0 is equal to the exclusive-or of binary bits 0 and 1. Gray-code bit 1 is
equal to the exclusive-or of binary bits 1 and 2, etc. The most significant gray-code bit is just
equal to the most significant binary bit.
The equations for a sample 4-bit binary-to-gray conversion are shown in Figure 26.
gray[0] = bin[0] ^ bin[1];
gray[1] = bin[1] ^ bin[2];
gray[2] = bin[2] ^ bin[3];
gray[3] = bin[3] ^ 1'b0 ; // same as gray[3] = bin[3];
Figure 26 - 4-bit binary-to-gray conversion equations
The easiest way to code a binary-to-gray converter is to code a simple continuous assignment that
performs a bit-wise exclusive-or operation between the binary vector and a right-shifted version
of the same binary vector as shown in
Example 3. This example is syntactically correct, will compile and does work.
module bin2gray #(parameter SIZE = 4)
(output logic [SIZE-1:0] gray,
input logic [SIZE-1:0] bin);
assign gray = (bin>>1) ^ bin;

endmodule
Example 3 - Parameterized binary-to-gray SystemVerilog model
5.7.5 Gray code counter style #1

We can build a gray code counter by using the conversions that were shown in sections 5.7.3 and
5.7.4. For any gray code counter, it is important to remember that the gray-output must be
registered to eliminate any combinational settling in the design.
The SystemVerilog code for gray-code counter style #1 incorporates a gray-to-binary converter, a
binary-to-gray converter and increments the binary value between conversions as shown in
Figure 27.
Figure 27 - Gray code counter style #1 - only one gray code register
The corresponding parameterized SystemVerilog model for the gray-code counter style #1 is
shown in Example 4.
module graycntr #(parameter SIZE = 5)
input logic clk, inc, rst_n);
logic [SIZE-1:0] gnext, bnext, bin;
always_ff @(posedge clk or negedge rst_n)

if (!rst_n) gray <= '0;
else gray <= gnext;
always_comb begin
bnext = bin + inc;
gnext = (bnext>>1) ^ bnext;
end
endmodule
Example 4 - Parameterized gray-code counter SystemVerilog model
5.7.6 Gray code counter style #2

We can build the second style of gray code counter by using just the binary-to-gray conversions
that were shown in section 5.7.4. This gray code counter actually both a binary count register and
a gray code count register.
Figure 28 - Gray code counter style #2 - binary register and gray code register
The SystemVerilog code for gray-code counter style #2 incorporates a binary counter to eliminate
the need for the gray-to-binary conversion, and uses the next binary count value to do the binary-
to-gray conversion that is then registered into the gray code register. This style uses twice as
many flip-flops but a shorter combinational logic path to generate the next gray code value,
which makes this implementation faster than gray code counter style #1. The block diagram for
gray code counter style #2 is shown in Figure 28,
The corresponding parameterized SystemVerilog model for the gray-code counter style #2 is
shown in Example 5.
module graycntr #(parameter SIZE = 5)

input logic clk, full, inc, rst_n);
logic [SIZE-1:0] gnext, bnext, bin;

if (!wrst_n) {bin, gray} <= '0;
else {bin, gray} <= {bnext, gnext};
assign bnext = !full ? bin + inc : bin;

assign gnext = (bnext>>1) ^ bnext;
endmodule
Example 5 - Parameterized gray-code counter with binary counter
5.8 Additional multi-bit CDC techniques

In addition to the MCP formulation techniques described in earlier sections, I have found a
number of engineers that use standard FIFOs to pass data and control signals between clock
domains.
There are at least two interesting FIFO implementation strategies that can be used to address
multi-bit CDC signal integrity:
(1) Asynchronous FIFO implementations.
(2) 2-deep FIFO implementation.
5.8.1 Multi-bit CDC signal passing using asynchronous FIFOS

Passing multiple bits, whether data bits or control bits, can be done through an asynchronous
FIFO. An asynchronous FIFO is a shared memory or register buffer where data is inserted from
the write clock domain and data is removed from the read clock domain. Since both sender and
receiver operate within their own respective clock domains, using a dual-port buffer, such as a
FIFO, is a safe way to pass multi-bit values between clock domains.
A standard asynchronous FIFO device allows multiple data or control words to be inserted as
long as the FIFO is not full, and the receiver and then extract multiple data or control words
when convenient as long as the FIFO is not empty.
Most of the hard work in a FIFO design is done through the synchronization of gray code
counters and a proven FIFO design technique is described in [1].
5.8.2 Multi-bit CDC signal passing using 1-deep / 2-register FIFO synchronizer
Another interesting variation on passing multiple control and data bits across CDC boundaries
involves the use of a 1-deep two register FIFO as shown in Figure 29.
Figure 29 - 1-deep / 2-register FIFO synchronizer block diagram
This 1-deep two register FIFO has a number of interesting characteristics. Since the FIFO is built
using only two registers or a 2-deep dual port RAM, the gray code counters used to detect full
and empty are simple toggle flip-flops, which is really nothing more than 1-bit binary counters
(remember, the MSB of a standard gray code is the same as the MSB of a binary code).
On reset, both pointers are cleared and the FIFO is empty and hence the FIFO is not full. We use
the inverted not-full condition to indicate that the FIFO is ready to receive a data or control word
(wrdy is high). After a data or control word is put into the FIFO (using wput), the wptr toggles
and the FIFO becomes full, or in other words, the wrdy signal goes low, which also disables the
ability to toggle the wptr and therefore also disables the ability to put another word into the 2-
register FIFO until the first word is removed from the FIFO by the receiving clock-domain logic.
What is especially interesting about this design is that the wptr is now pointing to the second
location in the 2-register FIFO, so when the FIFO does again become ready (when wrdy is high),
the wptr is already pointing to the next location to write.
The same concept is replicated on receiving side of the FIFO. When a data or control word is
written into the FIFO, the FIFO becomes not empty. We use the inverted not-empty condition to
indicate that the FIFO is has a data or control word that is ready to be received (rrdy is high).
By using two registers to store the multi-bit CDC values, we are able to remove one clock cycle
from the send MCP formulation and another cycle from the acknowledge feedback path.
6.0 Naming conventions & design partitioning
Naming conventions help to ensure good team communication and also facilitate the use of
scripting languages to gather and group all signals in a design that are associated with a particular
clock. Good design partitioning can significantly reduce the effort to synthesize and verify the
timing of a multi-clock design. Recommended naming conventions and design partitioning are
discussed in this section.
There are two approaches to address potential CDC problems: (1) verify that the design meets
qualified CDC rules, (2) avoid the problem. Both approaches are valuable and should be used to
ensure an error-free design.
The first approach, verification of CDC design rules, typically requires the use of special tools to
check the design for possible CDC violations. When I wrote my first paper on multi-clock design
in 2001, I was unaware of any tool on the market that performed checking of CDC rules. Today
there are a number of companies that provide such tools (see [11] for a list of companies and
tools in the CDC verification space).
The second approach, avoid the problem, can be done by employing a few good coding
guidelines as outlined below.
6.1 Clock & signal naming conventions

A number of useful clock and signal naming conventions have been used by various design
teams.
Guideline: Use a clock naming convention to identify the clock source of every signal in a
design.
Reason: A naming convention helps all team members to identify the clock domain for every
signal in a design and also makes grouping of signals for timing analysis easier to do using
regular expression "wild-carding" from within a synthesis script.
One proven naming convention requires that a leading prefix character be used to identify the
various asynchronous clock domains. Examples included: uClk for the microprocessor clock,
vClk for the video clock and dClk for the display clock.
Each signal is then synchronized to one of the clock domains in the design and each signal-name
is labeled with a prefix character to identify the clock domain used to generate that signal. For
example, any signal that is generated by the uClk is labeled with a u-prefix in the signal name,
such as uaddr, udata, uwrite, etc. Any signal that is generated by the vClk is similarly
labeled with a v-prefix in the signal name, such as vdata, vhsync, vframe, etc. The same
signal naming convention is used for all signals generated by any of the other clocks in the
design.
Using this technique, any engineer on the design team can easily identify the clock-domain
source of any signal in the design and either use the signals directly or pass the signals through
proper synchronization so that the signals can be used within a new clock domain.
The exact naming convention is not important, but it is vital that every engineer on the project
agrees to adhere to the naming convention chosen by the team. A naming convention will
significantly contribute to the productivity of the design team.
6.1.1 Multi-clock / multi-source modules with no naming convention

If your team does not using any particular clock-oriented signal naming convention and if
modules are allowed to have multiple clock inputs, there is always the danger that the CDC
analysis tool might not be setup correctly and it is easy to miss bad CDC design practices.
Even if your team has access to good CDC analysis tools, I strongly recommend that you take a
few simple steps to make analysis and recognition of potential CDC design problems easier to
identify and debug.
6.2 Timing verification for each clock domain

To verify the timing of any design, one must verify the that timing is met for each clock domain
in a design. Although tools have improved over the past decade to help automate the analysis and
verification of signals in separate clock domains, it is still a good practice to approach multi-
clock design using good partitioning and naming conventions.
By partitioning a design to permit only one clock per module, static timing analysis becomes a
significantly easier task for each domain in the design.
6.3 Clock oriented design partitioning

Some of the simplest and best design partitioning methodologies are implemented using design
partitioning at clock boundaries.
Guideline: Only allow one clock per module[9].
Reason: Static timing analysis and creating synthesis scripts is more easily accomplished on
single-clock modules or groups of single-clock modules.
Exception: The top-level module that connects together the signals from all of the different
clock domains will naturally have all of the clocks as inputs to this module. Minimize your
multi-clock verification effort and only allow the top-module to have multiple clock inputs.
Guideline: Partition the design blocks into one-clock modules.
Reason: The timing verification of completely synchronous sub-blocks can be easily verified
using STA (Static Timing Analysis) tools and partitioning the design blocks into multiple one-
clock domain sub-blocks turns a large complicated timing analysis task into multiple, completely
synchronous, one-clock designs.
Guideline: Create synchronizer modules to pass signals from one clock domain into another
clock domain and only allow one clock per synchronizer module.
Reason: It is given that any signal passing from one clock domain to another clock domain will
eventually experience setup and hold time problems. Isolating the CDC boundary logic can
significantly reduce the design and verification effort of multi-clock designs.
Under most conditions, the synchronizer modules will be the only blocks in the design that will
experience intentional setup and hold time violations. When passing signals between
asynchronous clock domains, it is given that timing violations will occur, that is the whole reason
why synchronizers must be added to a design.
Figure 30 - Design partitioned on clock boundaries
Consider an example design with three clock domains, labeled aClk, bClk and cClk and shown
in Figure 30. In this design, all of the aClk design blocks have been grouped into a single aClk
Logic block. All of the bClk design blocks have been grouped into a single bClk Logic block
and similarly we have created a cClk Logic block. Any signal that originates in an asynchronous
clock domain passes through a synchronizer module before it is permitted to drive an input of
another Logic block.
6.3.1 Timing analysis of clock-partitioned modules

Using the clock-oriented design partitioning strategy, all of the inputs and outputs of each design
block are completely synchronous to just one clock. This is the easiest type of design to verify
using static timing analysis (STA) tools because there are no false paths in the design.
Group together all design modules that are clocked within each clock domain. One group should
be formed for each clock domain in the design. These groups will be timing verified as if each
were a separate, completely synchronous design. For each clock domain, we have a single design
block and we can easily perform worst-case (max-time / setup time checking) timing analysis,
and best-case (min-time / hold time checking) timing analysis.
Also using this clock-oriented partitioning strategy, each of the CDC boundaries has been
isolated using a synchronizer module. Each synchronizer module only includes either
synchronizer cells provided by the ASIC or FPGA vendor (preferred), or is built using flip-flops
connected as pairs to form synchronizer-equivalent cells.
If synchronizer cells are available from the ASIC or FPGA vendor, and instantiated into the
design, there will be no need to verify setup and hold times on these modules, since the vendor
should have already created a cell layout that does not violate setup or hold times between the
flip-flop stages.
If the synchronizers are synthesized from RTL code, it is most important to perform best-case
timing analysis to make sure that the flip-flops are not placed too close together in such a way
that the output from the first stage might change too quickly to satisfy the hold time requirement
of the input of the second stage. A colleague recently pointed out that worst-case timing analysis
should also be performed just in case the layout tools happen to place the two synchronizer flip-
flops far apart on the ASIC or FPGA die. I agree with this updated recommendation.
Because of the partitioning of the separate synchronizers, gate-level simulations can more easily
be configured to ignore setup and hold time violations on the first stage of each synchronizer
The static timing analysis of the RTL synchronizers requires simple set_false_path commands to
remove the inputs from the STA. We know that there are timing problems at the inputs of the
synchronizers, that is why the synchronizers are used.
By partitioning design and synchronizer blocks to permit only one clock per module, static
timing analysis becomes a significantly easier task to perform. Synthesis script commands used
to address multiple clock domain issues now become a matter of grouping, identifying false paths
and performing min-max timing analysis.
6.4 Partitioning with MCP formulations
Partitioning a design at clock boundaries into separate design blocks and synchronizer blocks
works well most of the time, but if multiple signals need to be passed between clock domains
using an MCP formulation, then some of the signals that are passed to a design block may come
from a different clock domain as shown in Figure 31.
Figure 31 - Partitioned design with MCP formulation

Design blocks with asynchronous inputs can still be easily timed if a clock based naming
convention has been used for the signals in the design. Before performing STA on the design
block in question, simply exclude the asynchronous inputs from the analysis.
In general, only the inputs to the synchronizers and MCP formulation data paths require
"set_false_path" commands. If a clock-prefix naming scheme is used, then wild-cards can be
used to easily identify all asynchronous inputs. In Figure 31, to exclude the adata bus from STA
within the bClk Logic block, first execute the command:
set_false_path -from { a* }
This command should be sufficient to eliminate all asynchronous inputs from bClk STA.
7.0 Multi-clock gate-level simulation issues
Digital simulation models typically generate X's when synchronizers recognize setup and hold
time violations on CDC signals. This can frequently cause gate-level simulations to fail. What
techniques exist to address this problem?
As mentioned in section 6.3.1, signals crossing clock boundaries through a synchronizer will
experience setup and hold violations. That is why synchronizers are added to a design, to filter
out the metastability effects of a signal that changes too close to the rising edge of a receiving
clock domain clock signal.
7.1 Synchronizer gate-level CDC simulation issue

When doing gate-level simulations on a multi-clock design, the ASIC library models of flip-flops
are modeled with setup and hold time expressions to match the timing specifications of the actual
flip-flops. ASIC libraries typically model flip-flops to drive X's (unknowns) on the flip-flop
outputs when a timing violation occurs. When simulating gate-level synchronizers, setup and
hold time violations might cause ASIC libraries to issue setup and hold time error messages and
the offending signals are frequently driven to an X value. These X-values propagate to the rest of
the design causing problems when trying to verify the functionality of the entire gate-level design
Figure 32 - Synchronizer gate-level CDC simulation waveforms

7.2 Strategies to remove X-propagation from gate-level simulations
There are a number of strategies that have been shared with me by esteemed colleagues over the
past 10 years to address issues related to unwanted propagation of X's every time a signal violates
a setup or hold time on the first stage of the synchronizer.
Since X-propagation happens when a setup or hold time is violated, almost all of the approaches
to address this issue involve changing the setup and hold times to 0 so that there can be no setup
or hold time violation, and hence, no X-propagation.
Some of the approaches are bad and others are good. Below are some of the strategies that have
been considered to address the X-propagation problem.
7.2.1 Simulator command to turn off timing checks

Most SystemVerilog simulators have a command option to ignore all timing checks, but this
would also ignore the desired timing checks for the rest of the design.
7.2.2 Change flip-flop setup and hold times to 0

It is possible to change the setup and hold time setting to zero for any ASIC library flip-flop that
is used in a synchronizer, but that would cause all setup and hold time checks of all instances of
that same type of flip-flop to be set to zero, including the flip-flops that you might want to use to
test the rest of the design.
7.2.3 Copy and modify new flip-flop models

You could make copies of flip-flops from an ASIC library and store them into a new
SystemVerilog library with different names, set to zero all setup and hold times, then modify the
design gate-level netlist, replacing all first stage synchronizer ASIC library flip-flops with the
modified library flip-flops without timing checks, but this could be an error prone and tedious
process that might have to be repeated each time a new netlist is generated or it might require the
creation of a makefile and scripts to automatically make the modifications each time a new netlist
is generated.
7.2.4 Synopsys set_annotated_check command

A useful approach to this problem suggested by Bhatnagar[5] is to use Synopsys commands to
modify the SDF backannotation of the setup and hold time on just the first stage flip-flop cells in
the design. Bhatnagar points out that the SDF file is instance based and therefore targeting the
setup and hold times for the offending cells is more easily accomplished. Bhatnagar notes:
Instead of manually removing the setup and hold-time constructs from the SDF file, a
better way is to zero out the setup and hold-times in the SDF file, only for the
violating flops, i.e., replace the existing setup and hold-time numbers with zero's.
Bhatnagar further points out that setup hold times of zero means that there can be no timing
violation, therefore no unknowns propagated to the rest of the design. The following dc_shell-t
command, given by Bhatnagar, is used to make setup and hold times zero:
set_annotated_check 0 -setup -hold -from REG1/CLK -to REG1/D
Using a creative naming convention for the output of the first stage flip-flop of a synchronizer
might make wild card expressions possible to easily backannotate all first stage flip-flop SDF
setup and hold time values to zero using very few dc_shell-t commands.
This technique works if the design is being done with Synopsys DesignCompiler tools, but what
about non-Synopsys flows?
7.3 Additional strategies to remove X-propagation

All of the strategies described in sections 7.2 through 7.2.4 were shared in my first multi-clock
design paper given in 2001. After the initial presentation a number of engineers came forward to
share additional techniques to remove X-propagation from gate level simulations. Engineers
from at least three companies described the technique very similar to the description in section
### (kudos to the engineers who attended San Jose SNUG-2001).
Since then, other engineers from many companies have shared additional techniques. Those
techniques are described in this section and I am very grateful to all the engineers who continue
to share interesting techniques with me each year. Kudos to you all!
7.3.1 Use multiple SDF files

Remember, a key to removing unwanted X-propagation is to force the setup and hold times of
the synchronizer inputs to 0, thereby removing all possible setup and hold time violations on the
synchronizer inputs.
Many engineers have told me that they actually generate two SDF files. The first SDF file has all
of the actual delays, including accurate setup and hold times, for the entire design. Then
engineers generate a second SDF file with only the first stage flip-flops included in the file. In
this file, the setup and hold times are set to 0. Some engineers build this file by hand and other
generated this file using scripts.
Engineers then read in the first SDF file using the $sdf_annotate command. Then they read in the
second SDF file, which overwrites the setup and hold times for the data inputs of the first stage
synchronizers. When reading in the two SDF files, last SDF file for each instance wins. All
timing was accurately annotated, and then the timing checks for the first stage synchronizers
were modified.
This is a clever technique that can be used with any tools flow that generates SDF files.
7.3.2 Vendor synchronizer cell with supporting SDF generation tools

Other engineers have told of a great way to approach the X-propagation problem, but the method
requires either (a) control over the cell library, or (b) a good working relationship with your
ASIC vendor.
This technique requires that a separate synchronizer cell be created with proper placement
relationship between the two flip-flop stages. To make this method work, the vendor must
provide:
(1) the actual synchronizer cell - these will be instatiated into the design.
(2) the SystemVerilog model for the synchronizer cell for simulation.
(3) SDF file generation tools that will generate an SDF file with 0-setup and 0-hold for the
synchronizer cells.
If a vendor can provide this cell and these capabilities, it will only be necessary to generate a
single SDF file with proper timing checks for the synchronizer cells.
Any ASIC of FPGA vendor who provides this capability is doing a huge favor for their customer
base. I have heard that some ASIC vendors provide this capability. I do not know of any FPGA
vendor who provides this capability. Recognizing that most modern designs are multi-clock
designs, I strongly urge all ASIC and even all FPGA vendors to provide synchronizer cells with
appropriate simulation and SDF file tool support.
7.3.3 Vendors with built-in synchronizer support

If anyone knows of a vendor who offers this support, please let me know who the vendor is with
appropriate contact information and I will periodically update this paper to honor vendors who
offer this capability to us, the design community.
Vendor list:
(No vendors listed as of this release of this paper)
7.4 Multiple SDF files for gate-level CDC simulations

Immediately after giving my first multi-clock presentation in 2001, engineers from at least three
different companies came up after my presentation to share the following great technique to
address X-propagation in gate-level simulations.
The technique involved writing out the full SDF timing file and then either manually or by using
a script, generate a second SDF file for just the first-stage flip-flops of all synchronizer modules.
The second SDF file set all setup and hold times to 0 and then the two SDF files are applied to
the design using $sdf_annotate commands. The first SDF file annotates all of the actual timing to
the entire design and then the second SDF file is read to over-write the setup and hold times for
the first stage synchronizers.
The advantage of this technique is that it can be used for all designs using all tools, not just
Synopsys ASIC designs. This is a highly recommended technique.
7.5 Force synchronizer notifier inputs to a fixed value

The built-in timing checks for Verilog and SystemVerilog setup and hold time checks ($setup,
$hold, and $setuphold) have an optional notifier output. This notifier output toggles from 0-1-X-
Z whenever a timing violation is detected.
Most ASIC and FPGA flip-flop models are built from Verilog User Defined Primitives (UDPs)
and the notifier signal is typically listed as one of the inputs to the UDP table. Whenever the
notifier input toggles (caused by a timing violation), the flip-flop output goes unknown and that
unknown is what is visible on the output of the gate-level flip-flop models. The notifier on these
first-stage flip-flop models can be force to a logic level to prevent them from toggling and
causing the flip-flop outputs to go unknown during simulation.
One clever technique used by at least one company forces the timing violation notifiers of the
first-stage synchronizer flip-flops to be forced to one logic level so they can never toggle and
trigger X's into the flip-flop models.
7.6 ASIC & FPGA library cell synchronizers

ASIC and FPGA providers could make CDC design much easier to do if they would provide
fully characterized synchronizer cells that could be instantiated into a design. Advanced ASIC
vendors provide:
(1) the characterized synchronizer cell.
(2) the Verilog model to simulate the synchronizer cell.
(3) the SDF generator to generate SDF files that annotate the setup and hold time on a
synchronizer cell to 0 to avoid the X-generation when a signal violates setup or hold times
while crossing CDC boundaries.
I know of no FPGA provider that provides this capability, but a forward thinking FPGA provider
would provide such cells for their advanced multi-clock design customers.
7.7 Simulation model with random delay insertion
An interesting model that synthesizes to the correct synchronizer for design, but simulates with
random cycle delays has been suggested by multiple colleagues.
The block diagram for the model is shown in Figure 33, and the SystemVerilog code to support
this model is shown in Example 6.
Figure 33 - Sample ASIC & FPGA synchronizer cell for synthesis and simulation
As can be seen in the block diagram, the model is designed to produce either a synthesizble
synchronizer model or to be used as a simulation model with selectable delays.
The IEEE Std 1364.1-2002 Verilog RTL Synthesis Standard[6] requires that a compliant
synthesis tool set the SYNTHESIS macro before reading in any Verilog models. Although most
synthesis tools have largely ignored many of the requirements of the IEEE Verilog synthesis
standard, most tools have implemented this nice SYNTHESIS macro requirement.
Tools that set the SYNTHESIS macro before reading this sync2 SystemVerilog code will select
the code to infer the two flip-flop synchronizer.
Simulators, which do not set the SYNTHESIS macro, will read the sync2 model, ignore the code
intended for a synthesizable model and will simulate the model in the èlse portion of the code.
The model is parameterized so the same model can be used with the default parameter SIZE of
1-bit in width for simple 1-bit CDC signals, or the model can be instantiated with the SIZE
parameter set to a multi-bit width so that the synchronizer can be used to capture and synchronize
multi-bit buses such as gray code counters.
module sync2 #(parameter SIZE=1)

(output logic [SIZE-1:0] q2,
input logic [SIZE-1:0] d,
input logic clk, rst_n);
ìfdef SYNTHESIS
logic [SIZE-1:0] q1;

if (!rst_n) {q2,q1} <= '0;
else {q2,q1} <= {q1,d};
èlse
logic [SIZE-1:0] y1, q1a, q1b;
logic [SIZE-1:0] DLY = '0;
assign y1 = (~DLY & q1a) | (DLY & q1b);

if (!rst_n) {q2,q1b,q1a} <= '0;
else {q2,q1b,q1a} <= {y1,q1a,d};
èndif
endmodule
Example 6 - SystemVerilog model for ASIC & FPGA synchronizer cell
The simulation portion of the model includes a default declaration for a SIZE-ed variable called
DLY. By default the DLY variable is initialized to 0, which causes the entire sync2 model to
simulate with the default of two flip-flop delays, but the DLY variable can be hierarchically set
from the testbench to a reproducible 1's and 0's random value to cause some of the bits in the bus
to pass through three flip-flop stages while others pass through only two flip-flop stages. This can
model the behavior of a set of synchronizers where some bits are captured on an earlier clock
edge than others and allow the simulation to observe how well the design behaves with small
skews in the multi-bit data path.
8.0 Summary & conclusions

Clock Domain Crossing (CDC) errors can cause serious design failures. These expensive failures
can be avoided by following a few critical guidelines and using well established verification
techniques.
8.1 Recommended 1-bit CDC techniques

When passing one bit between clock domains:
• register the signal in the sending clock domain to remove combinational settling.
• synchronize the signal into the receiving clock domain. A Multi-Cycle Path (MCP)
formulation may be necessary.
8.2 Recommended multi-bit CDC techniques
When passing multiple control or data signals between clock domains, use one of the following
strategies:
• Consolidate - first attempt to combine multiple signals into a 1-bit representation in the
sending clock domain before synchronizing the signal into the receiving domain.
• Use Multi-Cycle Path (MCP) formulations to pass multiple signals across clock domains
• Use FIFOs to pass multi-bit buses, either data or control buses.
• Use gray code counters.
8.3 Recommended naming conventions and design partitioning

Use a clock-based naming convention.
As much as possible, partition the design sub-blocks into completely synchronous 1-clock
designs.
8.4 Recommended solutions to multi-clock gate-level CDC simulations

There are multiple useful solutions to the CDC X-propagation simulation issues during gate-level
simulation:
• Use a Synopsys switch to generate 0-setup and 0-hold times for first stage flip-flops on
synchronizers. Works okay with Synopsys tools only.
• Use multiple SDF files - good technique described later in this section.
• Vendor provides a synchronizer cell and appropriate SDF tools - great solution if your ASIC
or FPGA vendor provides the models and tools (very few do - ask you ASIC & FPGA vendors
to support this feature)
• Use creative SystemVerilog models to model synchronization problems.
The techniques described in this paper were designed to facilitate robust development and
verification of multi-clock designs.
9.0 Acknowledgements
My thanks to the hundreds of colleagues and students who have shared interesting multi-
asynchronous clock CDC design techniques over the past eight years.
10.0 References
[1] Clifford E. Cummings, “Simulation and Synthesis Techniques for Asynchronous FIFO Design,”
SNUG 2002 - www.sunburst-design.com/papers/CummingsSNUG2002SJ_FIFO1.pdf
[2] Clifford E. Cummings, “Synthesis and Scripting Techniques for Designing Multi-Asynchronous
Clock Designs,” SNUG 2001 -
www.sunburst-design.com/papers/CummingsSNUG2001SJ_AsyncClk.pdf
[3] Don Mills & Clifford E. Cummings, “RTL Coding Styles That Yield Simulation and Synthesis
Mismatches” SNUG 1999 -
www.sunburst-design.com/papers/CummingsSNUG1999SJ_SynthMismatch.pdf
[4] Frank Gray, "Pulse Code Communication." United States Patent Number 2,632,058. March 17,
1953.
[5] Himanshu Bhatnagar, Advanced ASIC Chip Synthesis, Second Edition, Kluwer Academic
Publishers, 2002.
[6] "IEEE Std. 1364.1 - 2002 IEEE Standard for Verilog Register Transfer Level Synthesis," IEEE
Computer Society, IEEE, New York, NY, IEEE Std 1364.1-2002
[7] Mark Litterick, “Pragmatic Simulation-Based Verification of Clock Domain Crossing Signals and
Jitter Using SystemVerilog Assertions,” DVCon 2006
www.verilab.com/files/sva_cdc_paper_dvcon2006.pdf
[8] Real Intent, Inc. (white paper), “Clock Domain Crossing Demystified: The Second Generation
Solution for CDC Verification,” February 2008 - www.realintent.com
[9] Steve Golson, personal communication
[10] William J. Dally and John W. Poulton, Digital Systems Engineering, Cambridge University Press,
1998
[11] Wikipedia: http://en.wikipedia.org/wiki/Clock_Domain_Crossing_Verification
11.0 Author & Contact Information

Cliff Cummings, President of Sunburst Design, Inc., is an independent EDA consultant and
trainer with 26 years of ASIC, FPGA and system design experience and 16 years of
SystemVerilog, synthesis and methodology training experience.
Mr. Cummings has presented more than 80 SystemVerilog seminars and training classes in the
past five years and was the featured speaker at the world-wide SystemVerilog NOW! seminars.
Mr. Cummings has participated on every IEEE & Accellera SystemVerilog, SystemVerilog
Synthesis, SystemVerilog committee, and has presented more than 40 papers on SystemVerilog
& SystemVerilog related design, synthesis and verification techniques.
Mr. Cummings holds a BSEE from Brigham Young University and an MSEE from Oregon State
University.
Sunburst Design, Inc. offers World Class Verilog & SystemVerilog training courses. For more
information, visit the www.sunburst-design.com web site.
Email address: cliffc@sunburst-design.com
An updated version of this paper can be downloaded from the web site: www.sunburst-
design.com/papers
(Last updated September 26, 2008)
12.0 Appendix
This appendix includes the source code for the MCP formulation with acknowledge feedback
and the 1-deep, 2-register FIFO synchronizer.
12.1 Common sync2 model - used by MCP formulation and FIFO synchronizer
The sync2 model is common to both the MCP formulation with ready-acknowledge design
(source code in section 12.2) and the multi-bit 1-deep / 2-register FIFO synchronizer (source
code in section 12.3).
// sync signal to different clock domain

module sync2 (
output logic q,
input logic d, clk, rst_n);
logic q1; // 1st stage ff output

if (!rst_n) {q,q1} <= '0;
else {q,q1} <= {q1,d};
endmodule
Example 7 - sync2.sv code
12.2 MCP formulation with ready-acknowledge source code

This model requires the sync2 model shown in section 12.1.
// Pulse Generator
module plsgen (
output logic pulse, q,
input logic d,
input logic clk, rst_n);

if (!rst_n) q <= '0;
else q <= d;
assign pulse = q ^ d;
endmodule
Example 8 - plsgen.sv code
module asend_fsm (
output logic aready, // ready to send next data
input logic asend, // send adata
input logic aack, // acknowledge receipt of adata
input logic aclk, arst_n);
enum logic {READY = '1,

BUSY = '0} state, next;
always_ff @(posedge aclk or negedge arst_n)

if (!arst_n) state <= READY;
else state <= next;
always_comb begin
case (state)
READY: if (asend) next = BUSY;
else next = READY;
BUSY : if (aack) next = READY;
else next = BUSY;
endcase
end
assign aready = state;

endmodule
Example 9 - asend_fsm.sv code
module back_fsm (
output logic bvalid, // data valid / ready to load
input logic bload, // load data / send acknowledge
input logic b_en, // enable receipt of adata
input logic bclk, brst_n);
enum logic {READY = '1,

WAIT = '0} state, next;
always_ff @(posedge bclk or negedge brst_n)

if (!brst_n) state <= WAIT;
else state <= next;
always_comb begin
case (state)
READY: if (bload) next = WAIT;
else next = READY;
WAIT : if (b_en) next = READY;
else next = WAIT;
endcase
end
assign bvalid = state;

endmodule
Example 10 - back_fsm.sv code
module bmcp_recv (
output logic [7:0] bdata,
output logic bvalid, // bdata valid
output logic b_ack, // acknowledge signal
input logic [7:0] adata, // unsynchronized adata
input logic bload, // load data and acknowledge receipt
input logic bq2_en, // synchornized enable input
logic b_en; // enable pulse from pulse generator
// Pulse Generator
plsgen pg1 (.pulse(b_en), .q(), .d(bq2_en),
.clk(bclk), .rst_n(brst_n), .*);
// data ready/acknowledge FSM

back_fsm fsm (.*);
// load next data word

assign bload_data = bvalid & bload;
// toggle-flop controlled by bload_data

if ( !brst_n) b_ack <= '0;
else if (bload_data) b_ack <= ~b_ack;

if ( !brst_n) bdata <= '0;
else if (bload_data) bdata <= adata;
endmodule
Example 11 - bmcp_recv.sv code
module mcp_blk #(parameter type dat_t = logic [7:0]) (
output logic aready, // ready to receive next data
input logic [7:0] adatain,
input logic asend,
input logic aclk, arst_n,
output logic [7:0] bdata,

output logic bvalid, // bdata valid (ready)
input logic bload,
logic [7:0] adata; // internal data bus
logic b_ack; // acknowledge enable signal

logic a_en; // control enable signal
logic bq2_en; // control - sync output
logic aq2_ack; // feedback - sync output
sync2 async (.q(aq2_ack), .d(b_ack), .clk(aclk), .rst_n(arst_n));

sync2 bsync (.q(bq2_en), .d(a_en), .clk(bclk), .rst_n(brst_n));
amcp_send alogic (.*);
bmcp_recv blogic (.*);
endmodule
Example 12 - mcp_blk.sv code
module amcp_send (
output logic [7:0] adata,
output logic a_en, aready,
input logic [7:0] adatain,
input logic asend,
input logic aq2_ack,
input logic aclk, arst_n);
logic aack; // acknowledge pulse from pulse generator
// Pulse Generator
plsgen pg1 (.pulse(aack), .q(), .d(aq2_ack),
.clk(aclk), .rst_n(arst_n));
// data ready/acknowledge FSM

asend_fsm fsm (.*);
// send next data word

assign anxt_data = aready & asend;
// toggle-flop controlled by anxt_data

if ( !arst_n) a_en <= '0;
else if (anxt_data) a_en <= ~a_en;

if ( !arst_n) adata <= '0;
else if (anxt_data) adata <= adatain;
endmodule
Example 13 - acmp_send.sv code
12.3 Multi-bit 1-deep / 2-register FIFO synchronizer source code
This model requires the sync2 model shown in section 12.1.
module wctl (
output logic wrdy, wptr, we,
input logic wput, wq2_rptr,
input logic wclk, wrst_n);
assign we = wrdy & wput;

assign wrdy = ~(wq2_rptr ^ wptr);
always_ff @(posedge wclk or negedge wrst_n)

if (!wrst_n) wptr <= '0;
else wptr <= wptr ^ we;
endmodule
Example 14 - wctl.sv code
`timescale 1ns/1ns
module cdc_syncfifo #(parameter type dat_t = logic [7:0]) (
// Write clk interface
input dat_t wdata,
output logic wrdy,
input logic wput,
input logic wclk, wrst_n,
// Read clk interface
output dat_t rdata,
output logic rrdy,
input logic rget,
input logic rclk, rrst_n);
logic wptr, we, wq2_rptr;

logic rptr, rq2_wptr;
wctl wctl (.*);
rctl rctl (.*);
sync2 w2r_sync (.q(rq2_wptr), .d(wptr), .clk(rclk), .rst_n(rrst_n));
sync2 r2w_sync (.q(wq2_rptr), .d(rptr), .clk(wclk), .rst_n(wrst_n));
// dual-port 2-deep ram

dp_ram2 #(dat_t) dpram (.q(rdata), .d(wdata),
.waddr(wptr), .raddr(rptr),
.we(we), .clk(wclk), .*);
endmodule
Example 15 - cdc_syncfifo.sv code
// dual-port 2-deep ram
module dp_ram2 #(parameter type dat_t = logic [7:0])
(output dat_t q,
input dat_t d,
input logic waddr, raddr, we, clk);
dat_t mem [0:1];
always_ff @(posedge clk)

if (we) mem[waddr] <= d;
assign q = mem[raddr];
endmodule
Example 16 - Dual Port Ram code - dp_ram2.sv
module rctl (
output logic rrdy, rptr,
input logic rget,rq2_wptr,
input logic rclk, rrst_n);
typedef enum {xxx, VALID} status_e;

status_e status;
assign status = status_e'(rrdy);
assign rinc = rrdy & rget;

assign rrdy = (rq2_wptr ^ rptr);
always_ff @(posedge rclk or negedge rrst_n)

if (!rrst_n) rptr <= '0;
else rptr <= rptr ^ rinc;
endmodule
Example 17 - rctl.sv code
Expert Verilog, SystemVerilog & Synthesis Training
Synthesis and Scripting Techniques for Designing Multi-

Asynchronous Clock Designs
SNUG-2001
San Jose, CA
Voted Best Paper
3rd Place
Clifford E. Cummings, Sunburst Design, Inc.

ABSTRACT
Designing a pure, one-clock synchronous design is a luxury that few ASIC designers will ever know. Most of the
ASICs that are ever designed are driven by multiple asynchronous clocks and require special data, control-signal
and verification handling to insure the timely completion of a robust working design.
1.0 Introduction
Most college courses teach engineering students prescribed techniques for designing completely synchronous
(single clock) logic. In the real ASIC design world, there are very few single clock designs. This paper will detail
some of the hardware design, timing analysis, synthesis and simulation methodologies to address multi-clock
designs.
This paper is not intended to provide exhaustive coverage of this topic, but is presented to share techniques learned
from experience.
2.0 Metastability
Quoting from Dally and Poulton's book[6] concerning metastability:
"When sampling a changing data signal with a clock ... the order of the events determines the outcome.
The smaller the time difference between the events, the longer it takes to determine which came first.
When two events occur very close together, the decision process can take longer than the time allotted,
and a synchronization failure occurs."
Only one
synchronizing flip-flop
aclk is
asynchronous adat bdat1
to bclk dat
aclk
bclk
Data
aclk changing
adat
bclk samples adat
while it is changing
bclk
... and might still be

bdat1 metastable at the next
Clocked signal is rising edge of bclk
initially metastable ...
Figure 1 - Asynchronous clocks and synchronization failure
Figure 1 shows a synchronization failure that occurs when a signal generated in one clock domain is sampled too
close to the rising edge of a clock signal from another clock domain.
Synchronization failure is caused by an output going metastable and not converging to a legal stable state by the
time the output must be sampled again. Figure 2 shows that a metastable output can cause illegal signal values to be
propagated throughout the rest of the design.
SNUG San Jose 2001 2 Synthesis and Scripting Techniques for

Rev 1.2 Designing Multi-Asynchronous Clock Designs
invalid data propagated ?? "1"
throughout the design
?? "0"
adat bdat1
dat
Sampling aclk ?? ??
clock bclk
aclk adat
changing
adat
Clocked signal is
initially metastable
bclk and is still meta-
stable on the next
active clock edge
bdat1
Other logic output values
are indeterminate
Figure 2 - Metastable bdat1 output propagating invalid data throughout the design
Every flip-flop that is used in any design has a specified setup and hold time, or the time in which the data input is
not legally permitted to change before and after a rising clock edge. This time window is specified as a design
parameter precisely to keep a data signal from changing too close to another synchronizing signal that could cause
the output to go metastable.
The metastable output problem shown in Figure 2 is sometimes known as the John Cooley ESNUG effect, or in
other words, the propagation of unwanted information!
(Just kidding, John! ☺)
3.0 Synchronizers
Quoting again from Dally and Poulton[7] concerning synchronizers:
"A synchronizer is a device that samples an asynchronous signal and outputs a version of the signal
that has transitions synchronized to a local or sample clock."
The most common synchronizer used by digital designers is a two-flip-flop synchronizer as shown in Figure 3.

"1" "0"
"1" "0"
adat bdat1 bdat2
dat
Sampling aclk "1" "0"

clock bclk
aclk adat
changing
adat
Clocked signal is
initially metastable
bclk but goes "high"
before the next
active clock edge
bdat1
bdat2 bdat2 is synchronized

and valid
Figure 3 - Two flip-flop synchronizer

The first flip-flop samples the asynchronous input signal into the new clock domain and waits for a full clock cycle
to permit any metastability on the stage-1 output signal to decay, then the stage-1 signal is sampled by the same
clock into a second stage flip-flop, with the intended goal that the stage-2 signal is now a stable and valid signal
synchronized into the new clock domain.
It is theoretically possible for the stage-1 signal to still be sufficiently metastable by the time the signal is clocked
into the second stage to cause the stage-2 signal to also go metastable. The calculation of the probability of the time
between synchronization failures (MTBF) is a function of multiple variables including the clock frequencies used to
generate the input signal and to clock the synchronizing flip-flops. One description of the MTBF calculation can be
found in Dally and Poulton[8].
For most synchronization applications, the two flip-flop synchronizer is sufficient to remove all likely metastability.
4.0 Static Timing Analysis

Performing static timing analysis is the process of verifying that every signal path in a design meets required clock-
cycle timing, whether or not all of the signal paths are even possible. Static timing analysis is not used to verify the
functionality of the design, only that the design meets timing goals. In theory, timing verification could be
accomplished by running exhaustive gate-level simulations with SDF backannotation of actual timing values after a
design is placed and routed. This is often referred to as dynamic timing verification.
Static timing analysis has three principal advantages over dynamic timing verification: (1) static timing analysis
tools verify every single path between any two sequential elements, (2) static timing analysis does not require the
generation of any test vectors, and (3) static timing analysis tools are orders of magnitude faster than trying to do
timing verification running exhaustive gate-level simulations[4].
Timing analysis using Synopsys tools on a completely synchronous design is relatively easy to perform using either
DesignTime within the Synopsys Design Compiler or Design Analyzer environments, or by using PrimeTime.

Timing analysis on modules with two or more asynchronous clocks is error prone, more difficult and can be time
consuming. Static timing analysis on signals generated from one clock domain and latched into sequential elements
within a second, asynchronous clock domain is inaccurate and for the most part worthless. The timing information
for a signal latched by a clock that is asynchronous to the latched signal is inaccurate because the phase relationship
between the signal and the asynchronous clock is always changing; therefore, the static timing analysis tool would
have to check an infinite number of phase relationships between the signal and asynchronous clock. The fact is, one
must assume that signals that pass from one clock domain to another at some point will violate either setup or hold
times on the destination sequential element.
There is no good reason to perform timing analysis on signals that are generated in one clock domain and registered
in another asynchronous clock domain. It is a given that these signals DO violate setup and hold times on the
destination register. This is why synchronizers (see section 3.0) are needed, to alleviate the problems that can occur
when a signal is passed from one clock domain to another.
For RTL modules that have two or more asynchronous clocks as inputs, a designer will be required to indicate to the
static timing analysis tool which signal paths should be ignored. This is accomplished by "setting false paths" on
signals that cross from one clock domain to another. This can be a tedious and error prone job unless the guidelines
in the next two sections are followed.
5.0 Clock Naming Conventions

Guideline: Use a clock naming convention to identify the clock source of every signal in a design.
Reason: A naming convention helps all team members to identify the clock domain for every signal in a design and
also makes grouping of signals for timing analysis easier to do using regular expression "wild-carding" from within
a synthesis script.
A number of useful clock naming conventions have been used by various design teams. One that was used by
design engineers in 1995 while designing video ASICs for In Focus projectors required that a leading prefix
character be used to identify the various asynchronous clock domains. Examples included: uClk for the
microprocessor clock, vClk for the video clock and dClk for the display clock.
Each signal was synchronized to one of the clock domains in the design and each signal-name had to include a
prefix character identifying the clock domain for that signal. Any signal that was clocked by the uClk would have a
u-prefix in the signal name, such as uaddr, udata, uwrite, etc. Any signal that was clocked by the vClk would
similarly have a v-prefix in the signal name, such as vdata, vhsync, vframe, etc. The same signal naming convention
was used for all signals generated by any of the other clocks in the design.
Using this technique, any engineer on the ASIC design team could easily identify the clock-domain source of any
signal in the design and either use the signals directly or pass the signals through a synchronizer so that they could
be used within a new clock domain.
The naming convention alone contributed significantly to the productivity of the design team. How do we know
there was a productivity gain? One of the design engineers started his part of the ASIC design using his own naming
convention, ignoring the convention in use by the other design team members. After much confusion about the
signals entering and leaving his design partition, a team meeting was called and the non-compliant designer was
"strongly encouraged" to rename the signals in his part of the design to conform to the team naming convention.
After the signal names were changed, it became easier to interface to the partition in question. Fewer questions and
less confusions occurred after the change.

6.0 Design Partitioning
Guideline: Only allow one clock per module.
Reason: Static timing analysis and creating synthesis scripts is more easily accomplished on single-clock modules
or groups of single-clock modules.
Guideline: Create a synchronizer module for each set of signals that pass from just one clock domain into another
clock domain.
Reason: It is given that any signal passing from one clock domain to another clock domain is going to have setup
and hold time problems. No worst-case (max time) timing analysis is required for synchronizer modules. Only best
case (min time) timing analysis is required between first and second stage flip-flops to ensure that all hold times are
met. Also, gate-level simulations can more easily be configured to ignore setup and hold time violations on the first
stage of each synchronizer.
bSig0
aSig2
aSig0 aClk Logic aSig1 aSig1 sync_ bClk Logic bSig1

a2b
bSig1 sync_ aSig2 bSig0 bSig2

b2a
cSig1 sync_ aSig3 cSig2 sync_ bSig3

c2a c2b
Each non-
synchronizer aSig3 sync_ cClk Logic cSig1 Simple to
module is now a2c perform static
completely
timing analysis
synchronous to
bSig2 sync_ cSig2 for each clock
just one clock
b2c
cSig0 cSig0 cSig3 cSig3
Figure 4 - Design partitioned on clock boundaries

In 1995, while working on a multi-asynchronous-clock ASIC design to be used in In Focus projectors, I received an
e-mail message from Steve Golson in which he gave me the strong recommendation to only allow one clock per
module for each module in the ASIC design[5]. At that time we were permitting multiple clocks per module and
trying to handle timing analysis by including a large number of set_false_path commands in our synthesis scripts to
eliminate invalid timing-error messages.
After giving consideration to Steve's recommendation, I decided to completely re-partition the ASIC design I was
working on and to adhere to the recommendation to only permit one clock per module. I took a two-week hit to my
schedule to re-partition the entire ASIC. After repartitioning the design, many of the timing analysis and synthesis
tasks became trivial.
By partitioning a design to permit only one clock per module, static timing analysis becomes a significantly easier
task.

The next logical step was to partition the design so that every input module signal was already synchronized to the
same clock domain before entering the module. Why is this significant? If all signals entering and leaving the
module are synchronous to the clock used in the module, the design is now completely synchronous! Now the entire
module can be static timing analyzed without any "false paths" and Design Compiler can be used to "group" all of
the same-clock synchronous modules to perform complete, sequential static timing analysis within each clock
domain.
There is one exception to the above recommendation. Multi-clock designs require at least some RTL modules to
pass signals from one clock domain to modules that are clocked within a different clock domain. For the In Focus
ASIC designs, we created separate synchronizer modules that permitted signals from one and only one clock
domain to be passed into a module that synchronized the signals into a new clock domain.
Using the naming convention described in section 5.0, all processor-clock generated signals (u-signals) would be
used as inputs to a module that might be clocked by the video clock. This module was called the "sync_u2v" module
and the RTL code did nothing more than take each u-signal input and run it through a pair of flip-flops clocked by
vClk. Aside from the vClk and reset inputs, every other input signal to the "sync_u2v" module had a "u" prefix and
every output signal from that same module had a "v" prefix.
No worst-case timing analysis is required on the "sync" modules because we know that every input signal to these
modules will have timing problems; otherwise, we would not have to pass the signals through synchronizers. The
only timing analysis that we need to perform within synchronizer modules is min-time (hold time) analysis between
the first and second flip-flop stages for each signal.
In general, if there are n asynchronous clock domains, the design will require n(n-1) synchronizer modules, two for
each pair of clock signals (example: using the uClk and vClk signals: the two synchronizer modules required would
be sync_u2v and sync_v2u). Only if there are no signals that pass between two specific clock domains will a pair of
synchronizer modules not be required.
By the way, what happened to that repartitioned In Focus ASIC design? After modifying all of the RTL files to
create either completely synchronous modules or synchronizer modules, the task of generating synthesis scripts
became trivial. All of the script files which previously included "set_false_path" commands were either deleted or
significantly simplified. All timing problems were easily identified and fixed (because they were all within single-
clock domain groupings) and the final synthesis runs completed two weeks earlier than anticipated, putting the
project back on schedule and completely justifying the decision to repartition the design.
7.0 Synthesis Scripts & Timing Analysis

Following the guidelines of section 6.0, to only permit one clock per module, to require that all signals entering
non-synchronizer modules are also in the same clock domain that is used to clock that module and to require that
synchronizer modules only permit input signals from one other clock domain, helps to simplify the timing analysis
and synthesis scripting tasks associated with a multi-clock design.
Synthesis script commands used to address multiple clock domain issues now become a matter of grouping,
identifying false paths and performing min-max timing analysis.
7.1 Grouping
Group together all non-synchronizer modules that are clocked within each clock domain. One group should be
formed for each clock domain in the design. These groups will be timing verified as if each were a separate,
completely synchronous design.
7.2 Identifying False Paths

In general, only the inputs to the synchronizer modules require "set_false_path" commands. If a clock-prefix
naming scheme is used (see section 5.0), then wild-cards can be used to easily identify all asynchronous inputs. For
example, the sync_u2v module should have inputs that all start with the letter "u". The following dc_shell command
should be sufficient to eliminate all asynchronous inputs from timing analysis:
set_false_path -from { u* }

7.3 Performing Min-Max Timing Analysis
Each grouped set of modules for each clock domain is now a completely synchronous sub-design and tools such as
DesignTime or PrimeTime can be used to verify worst case timing (including setup time checks) and best case
timing (including hold time checks).
The synchronizer blocks are timing verified separately. Worst case timing checks are not required because these
modules are just composed of flip-flops to synchronize asynchronous input signals; therefore, there are no long path
delays and the outputs are fully registered. After setting false paths on all of the asynchronous inputs, best case
(minimum) timing verification is conducted to insure that hold times are met on all signals that are passed from the
first to second stage synchronizing flip-flops.
8.0 Synchronizing Fast Signals Into Slow Clock Domains

A general problem associated with synchronizers is the problem that a signal from a sending clock domain might
change values twice before it can be sampled into a slower clock domain. This problem must be considered any
time signals are sent from one clock domain to another.
Synchronizing slower control signals into a faster clock domain is generally not a problem since the faster clock
signal will sample the slower control signal one or more times. Recognizing that sampling slower signals into faster
clock domains causes fewer potential problems than sampling faster signals into slower clock domains, a designer
might want to take advantage of this fact and try to steer control signals towards faster clock domains.
8.1 Passing A Slow Control Signal

When passing one control signal between clock domains, a simple two-flip-flop synchronizer is typically sufficient
if other rules are followed (described below).
An exception to this rule occurs when trying to pass a control signal from a faster clock domain to a slower clock
domain, the control signal must be wider than the cycle time of the slower clock. If the control signal is only
asserted for one fast-clock cycle, the control signal could go high and low between the rising edges of a slower
clock and not be captured into the slower clock domain as shown in Figure 5 .
The adat signal is asserted

and de-asserted between the
This will aclk
two rising edges of bclk
cause
problems! adat
bdat1 and bdat2

bclk are never asserted
bdat1
bdat2
Figure 5 - Short control signal pulse missed during synchronization
One potential solution to this problem is to assert control signals for a period of time that exceeds the cycle time of
the sampling clock as shown in Figure 6. The assumption is that the control signal will be sampled at least once and
possibly twice by the receiver clock.

This pulse must
aclk be wider than
one bclk period!
adat
This insures that
adat is propagated
bclk
to bdat1 and bdat2
bdat1
bdat2
Figure 6 - Lengthened pulse to guarantee that the control signal will be sampled
A second potential solution to this problem is to assert a control signal, synchronize it into the new clock domain
and then pass the synchronized signal back through another synchronizer into the sending clock domain as an
acknowledge signal. Although synchronizing a feedback signal is a very safe technique to acknowledge that the first
control signal was recognized and sampled into the new clock domain, there is considerable delay associated with
synchronizing control signals in both directions before releasing the control signal[2].
abdat2 abdat1
adat adat1 bdat1 bdat2
aclk
bclk
aclk bclk
domain domain
Figure 7 - Feedback synchronization of a control signal
9.0 Passing Multiple Control Signals

A frequent mistake made by engineers when working on multi-clock designs is passing multiple control signals
from one clock domain to another and overlooking the importance of the sequencing of the control signals. Simply
using synchronizers on all control signals is not always good enough as will be shown in the following examples.
If the order or alignment of the control signals is significant, care must be taken to correctly pass the signals into the
new clock domain. All of the examples shown in this section are overly simplistic but they closely mimic situations
that often arise in real designs.
9.1 Problem - Two simultaneously required control signals.

In the simple example shown in Figure 8, a register in the new clock domain requires both a load signal and an
enable signal in order to load a data value into the register. If both the load and enable signals are being sent from
one clock domain, there is a chance that a small skew between the control signals could cause the two signals to be

synchronized into different clock cycles within the new clock domain. In this example, this would cause the data to
the register to not be loaded.
Small skew between
control signals
b_load
b_en
Synchronizing aclk
aclk
ab_load / ab_en
ab_load
adata abus
d q ab_en
b_load a_load
ld
"load" but no "enable"
b_en a_en en
a_load
aClk a_en
bClk aClk "enable" but no "load"

domain domain adata 00 FF
abus 00
Synchronizers
adata was not loaded
Figure 8 - Problem - Passing multiple control signals between clock domains

Only one
control signal
b_lden
Synchronizing aclk
aclk
ab_lden
ab_lden
adata abus
d q
b_lden a_lden
ld
"load" and "enable"
en
a_lden
aClk
bClk aClk
domain domain adata 00 FF
abus 00 FF
Synchronizer
adata is loaded
Figure 9 - Solution - Consolidating control signals before passing them between clock domains

The solution to the problem in this simple example is easy. As shown in Figure 9, drive both the load and enable
register input signals in the new clock domain from just one control signal. This will remove the potential for the
control signals arriving shifted in time.
9.2 Problem - Two phase-shifted sequencing control signals.

The diagram in Figure 10, shows two enable signals, aen1 and aen2, that are used to enable the sequential passing
of a data signal through a short pipeline design. The problem is that in the first clock domain, the aen1 control signal
might terminate slightly before the aen2 control signal is asserted, and the second clock domain might try to sample
the aen1 and aen2 control signals in the middle of this slight time gap, causing a one-cycle gap to form in the enable
control-signal chain in the second clock domain. This would cause the a2 output signal to be missed by the second
flip-flop.
Small skew between

control signals
ben1
ben2
Synchronizing aclk
aclk
ab_en1 / ab_en2
ab_en1
a1 a2 a3
q q ab_en2
ben1 aen1
ben2 aen2
aen1
aClk aen2
bClk aClk 2nd enable signal is too late

domain domain a1
a2
Synchronizers a3
a3 was not loaded
Figure 10 - Problem - Passing sequential control signals between clock domains
The solution to the problem, as shown in Figure 11, is to send only one control signal into the new clock domain
and generate the second phase-shifted sequential control signal within the new clock domain.

Only one
control signal
ben1
Synchronizing aclk
aclk
ab_en
ab_en1
a1 a2 a3
q q
ben1 aen1
aen2 aen1
aClk aen2
bClk aClk
domain domain a1
a2
Synchronizers a3
a3 loaded
Figure 11 - Solution - Logic to generate the proper sequencing signals in the new clock domains
9.3 Problem - Two encoded control signals.

bdec=0 bdec=3
bdec[1]
bdec[0]
aclk
ab_dec[1:0]
ab_dec[1]
aen[3]
ab_dec[0]
bdec[1] adec[1]
aen[2]
bdec[0] adec[0] aen[1]
adec[1]
aen[0]
aClk adec[0]
bClk aClk
aen[0] aen[0] aen[2] aen[3]
domain domain aen[3]
aen[2]
Synchronizers WRONG! aen[1]
aen[2] should not
aen[0]
be asserted
Figure 12 - Problem - Encoded control signals passed between clock domains

The diagram in Figure 12 shows two encoded control signals being passed between clock domains. If the two
encoded signals are slightly skewed when sampled, an erroneous decoded output could be generated for one clock
period in the new clock domain.
One potential solution to this problem, as shown in Figure 13, is to send a shaped enable signal to act as a "ready
flag" in the new clock domain. The sending clock domain must generate and enable signal one clock cycle after
asserting the decoder inputs. The sending clock domain must also remove the enable signal one clock cycle before
de-asserting the decoder inputs. As described earlier, the enable signal must be asserted for a time period that is
longer than the cycle time of the receiving clock domain.
bdec=0 bdec=3
bdec[1]
bdec[0]
bden_n
ab_dec[1:0] aclk
aen[3]
ab_dec[1]
bdec[1] adec[1]
aen[2] ab_dec[0]
bdec[0] adec[0] aen[1] ab_den_n
aen[0]
bden_n aden_n adec[1]
en adec[0]
aClk aden_n
bClk aClk
domain domain aen[3] "1" (off) aen[3]
aen[2] "1" (off)
Shaped enable aen[1] "1" (off)
pulse
Synchronizers aen[0] "1" (off)
Figure 13 - Solution #1 - Logic to synchronize and wave-shape an enable pulse to pass between clock domains
Under worst case conditions, the shaped enable signal will either be sampled at the same time as the encoded inputs
are sampled into the receiving clock domain, or the shaped enable signal will be de-asserted at the same time as the
encoded inputs are de-asserted in the receiving clock domain. Under best case conditions, the shaped enable pulse
will be asserted one receiving clock cycle later than the assertion of the encoded inputs and de-asserted one
receiving clock cycle before the de-assertion of the encoded inputs. This method insures that the encoded inputs are
valid before they are enabled into the receiving clock domain.
A second potential solution to this problem, as shown in Figure 14, is to decode the signals back in the sending
clock domain and then send the decoded outputs (where only one of the outputs is asserted) through synchronizers
into the new clock domain. Within the new clock domain, a state machine is used to determine when a new decoded
output has been asserted. If there are no decoded outputs, it means that one decoded output has been de-asserted and
that another decoded output is about to be asserted. If there are two asserted decoded output signals, the last
decoded output signal will cause the state machine to change states and the older decoded output signal will turn off
on the next rising clock edge in the new clock domain. It is important that the sender insure that the decoded outputs
are each asserted for a time period that is longer than the cycle time of the receiving clock domain.

Except where noted, the following outputs
are driven to the default, de-asserted state:
ab_ en[3:0] a_sel3=1, a_sel2=1, a_sel1=1, a_sel0=1
ben[3] aen[3] !arst_n a_sel3

!aen[3] !aen[0]
bdec[1]
ben[2] aen[2] EN0 a_sel2
!aen[0] !a_sel0 !aen[1]
bdec[0] ben[1] aen[1] a_sel1
ben[0] aen[0] !aen[2] a_sel0
!aen[1]
EN3 EN1
!a_sel3 !a_sel1
bClk aClk !aen[3]
!aen[0] !aen[2]
Synchronizers
!aen[3] EN2
!aen[1]
bClk aClk !a_sel2
!aen[2]
domain domain
Figure 14 - Solution #2 - FSM logic to detect one-hot control signals passed from a different clock domain
Any time there are multiple control signals crossing clock boundaries, caution must be taken to insure that the
sequencing of the control signals being passed is correct or that any potential mis-sequencing of the control signals
will not adversely impact the correct operation of the design.
10.0 Data-Path Synchronization

Passing data from one clock domain to another is an example of passing multiple randomly changing signals
between clock domains. Using synchronizers to handle the passing of data is generally unacceptable. There are far
too many opportunities for multi-bit data changes to be incorrectly sampled using synchronizers.
Two common methods for synchronizing data between clock domains are: (1) use handshake signals to pass data
between clock domains or, (2) use FIFOs (First In First Out memories) to store data using one clock domain and to
retrieve data using another clock domain.
10.1 Handshaking Data Between Clock Domains

Data can be passed between clock domains using two or three handshake control signals, depending on the
application and the paranoia of the design engineer. When it comes to handshaking, the more control signals that are
used, the longer the latency to pass data from one clock domain to another. The biggest disadvantage to using
handshaking is the latency required to pass and recognize all of the handshaking signals for each data word that is
transferred.
For many open-ended data-passing applications, a simple two-line handshaking sequence is sufficient. The sender
places data onto a data bus and then synchronizes a "data_valid" signal to the receiving clock domain. When the
"data_valid" signal is recognized in the new clock domain, the receiver clocks the data into a register in the new
clock domain (the data should have been stable for at least two rising clock edges in the sending clock domain) and
then passes an "acknowledge" signal through a synchronizer to the sender. When the sender recognizes the
synchronized "acknowledge" signal, the sender can change the value being driven onto the data bus.
Under some circumstances, it might be useful to use a third control signal, "ready", sent through a synchronizer
from the receiver to the sender to indicate that the receiver is indeed "ready" to receive data. The "ready" signal
should not be asserted while the "data_valid" signal is true. When the "data_valid" signal is de-asserted, a "ready"
signal can be passed to the sender. Of course, with the added handshake signal comes the penalty of longer latency
to synchronize and recognize the third control signal.

10.2 Passing Data By FIFO Between Clock Domains
One of the most popular methods of passing data between clock domains is to use a FIFO. A dual port memory is
used for the FIFO storage. One port is controlled by the sender which puts data into the memory as fast a one data
word (or one data bit for serial applications) per write clock. The other port is controlled by the receiver, which
pulls data out of memory one data word per read clock. Two control signals are used to indicate if the FIFO is
empty, full or partially full. Two additional control signals are frequently used to indicate if the FIFO is almost full
or almost empty.
In theory, placing data into a shared memory with one clock and removing the data from the shared memory with
another clock seems like an easy and ideal solution to passing data between clock domains. For the most part it is,
but generating accurate full and empty flags can be challenging.
10.3 FIFO Full & Empty

Determining that a FIFO is full or empty requires some type of mathematical manipulation and/or comparison of
write and read pointers. The problem is that the two pointers are generated in two different clock domains, so one or
both pointers must be synchronized into the opposite clock domain before mathematical and comparison operations
can be safely performed.
10.4 FIFO Pointers - Implemented as Binary Counters

Any FIFO pointer that must be synchronized into a different clock domain should not be implemented as a binary
counter.
One characteristic of binary counters is that half of all sequential binary incrementing operations require that two or
more counter bits must change. Trying to synchronize a binary counter into a new clock domain is more problematic
than trying to synchronize multiple control signals into a new clock domain. If a simple 4-bit binary counter
changes from address 7 (binary 0111) to address 8 (binary 1000), all four counter bits will change at the same time.
If a synchronizing clock edge comes in the middle of this transition, it is possible that any 4-bit binary pattern could
be sampled and synchronized into the new clock domain as shown in Figure 15.
Binary Count 07 -> 08 possible binary transitions

Values
0 1 1 1 -> 1 0 0 0 (07->08)
00 0 0 0 0 0 1 1 1 -> 0 0 0 0 (07->00)
01 0 0 0 1 0 1 1 1 -> 0 0 0 1 (07->01)
02 0 0 1 0 0 1 1 1 -> 0 0 1 0 (07->02)
03 0 0 1 1 0 1 1 1 -> 0 0 1 1 (07->03)
04 0 1 0 0 0 1 1 1 -> 0 1 0 0 (07->04)
05 0 1 0 1 0 1 1 1 -> 0 1 0 1 (07->05)
06 0 1 1 0 0 1 1 1 -> 0 1 1 0 (07->06)
07 0 1 1 1 0 1 1 1 -> 0 1 1 1 (07->07)
08 1 0 0 0 0 1 1 1 -> 1 0 0 0 (07->08)
09 1 0 0 1 0 1 1 1 -> 1 0 0 1 (07->09)
10 1 0 1 0 0 1 1 1 -> 1 0 1 0 (07->10)
11 1 0 1 1 0 1 1 1 -> 1 0 1 1 (07->11)
12 1 1 0 0 0 1 1 1 -> 1 1 0 0 (07->12)
13 1 1 0 1 0 1 1 1 -> 1 1 0 1 (07->13)
14 1 1 1 0 0 1 1 1 -> 1 1 1 0 (07->14)
15 1 1 1 1 0 1 1 1 -> 1 1 1 1 (07->15)
Figure 15 - Binary count values sampled in mid-transition
The new, synchronized binary value might trigger a false full or empty flag, or even worse, it might not trigger a
real full or empty flag causing data to be lost due to FIFO overflow or causing bogus data to be read from the FIFO
due to attempting to read data when the FIFO is really empty.
10.5 FIFO Pointers - Implemented as Gray-Code Counters
Although binary counters work fine for addressing the memory, trying to synchronize binary counters into a new
clock domain is problematic. A better approach for passing pointers between clock domains is to use a gray-code
counter for the two FIFO pointers. Gray code counters only change one bit at a time. If a synchronizing clock signal
comes in the middle of a gray code counter transition, the synchronized value will either be the old value or the new
value because only one bit is changing at a time.
10.6 Designing Gray Code Counters

A block diagram for a gray-code counter is shown in Figure 16. To design a gray code counter, a register is used to
store the gray code values. The register output is fed back to a gray-to-binary converter, the binary value is
incremented by one, the incremented binary value is then passed to a binary-to-gray converter that drives the inputs
to the gray-code register.
For-loop with two One line One line of code

lines of code of code with concatenations
Gray
Code
Gray to reg
Binary bin Binary
comb. bnext to Gray gnext
+ d q gray
logic comb.
logic
inc rst_n
clk
rst_n
The gray and

binary values The bnext output is
increment only the binary value +1
if inc is high (if inc is high)
Figure 16 - Gray-code counter block diagram
10.7 Gray To Binary Conversion

To convert a gray-code value to an equivalent binary-code value, using an n-bit gray code value as an example,
binary bit 0 is equal to the exclusive-or of gray code bit 0 exclusive-ored with all other gray code bits from 1 to n.
Binary bit 1 is equal gray code bit 1 exclusive-ored with all other gray code bits from 2 to n, etc. The most
significant binary bit is just equal to the most significant gray code bit. The equations for a 4-bit gray-to-binary
conversion are shown in Figure 17.
bin[0] = gray[3] ^ gray[2] ^ gray[1] ^ gray[0];

bin[1] = gray[3] ^ gray[2] ^ gray[1];
bin[2] = gray[3] ^ gray[2];
bin[3] = gray[3];
Figure 17 - 4-bit gray-to-binary conversion equations

The easiest way to code a gray-to-binary converter is to code a for-loop and do an exclusive-or reduction on a gray
code vector with variable index range, where each time through the loop the LSB of the index range increases until
we are left with a simple assignment of bin[MSB] = ^gray[MSB:MSB] (just the 1-bit MSB of the gray code vector),
as shown in Example 1.
module gray2bin_bad (bin, gray);

parameter SIZE = 4;
output [SIZE-1:0] bin;
input [SIZE-1:0] gray;
reg [SIZE-1:0] bin;
integer i;
// Syntax Error - variable index range

always @(gray)
for (i=0; i<SIZE; i=i+1)
bin[i] = ^(gray[SIZE-1:i]);
endmodule
Example 1 - Non-working but conceptually correct gray-to-binary Verilog model
Unfortunately, Verilog does not permit part selects using a variable index range so the code in Example 1, although
conceptually correct, will not compile.
Another way to think of a gray-to-binary conversion is to exclusive-or the significant gray-code bits with padded 0's
bin[0] = gray[3] ^ gray[2] ^ gray[1] ^ gray[0] ; // gray>>0

bin[1] = 1'b0 ^ gray[3] ^ gray[2] ^ gray[1] ; // gray>>1
bin[2] = 1'b0 ^ 1'b0 ^ gray[3] ^ gray[2] ; // gray>>2
bin[3] = 1'b0 ^ 1'b0 ^ 1'b0 ^ gray[3] ; // gray>>3
Figure 18 - 4-bit gray-to-binary conversion equations - 2nd method
The corresponding parameterized Verilog model for this algorithm is shown in Example 2. This example is
syntactically correct, will compile and does work.
module gray2bin (bin, gray);

parameter SIZE = 4;
output [SIZE-1:0] bin;
input [SIZE-1:0] gray;
reg [SIZE-1:0] bin;
integer i;
always @(gray)
endmodule
Example 2 - Parameterized and correct gray-to-binary Verilog model

10.8 Binary To Gray Conversion
To convert a binary value to an equivalent gray-code value, using an n-bit binary value as an example, gray-code bit
0 is equal to the exclusive-or of binary bits 0 and 1. Gray-code bit 1 is equal to the exclusive-or of binary bits 1 and
2, etc. The most significant gray-code bit is just equal to the most significant binary bit. The equations for a 4-bit
binary-to-gray conversion are shown in Figure 19.
gray[0] = bin[0] ^ bin[1];

gray[1] = bin[1] ^ bin[2];
gray[2] = bin[2] ^ bin[3];
gray[3] = bin[3];
Figure 19 - 4-bit binary-to-gray conversion equations
The easiest way to code a binary-to-gray converter is to code a simple continuous assignment that performs a bit-
wise exclusive-or operation between the binary vector and a right-shifted version of the same binary vector as
shown in Example 3. This example is syntactically correct, will compile and does work.
module bin2gray (gray, bin);

parameter SIZE = 4;
output [SIZE-1:0] gray;
input [SIZE-1:0] bin;
assign gray = (bin>>1) ^ bin;

endmodule
Example 3 - Parameterized binary-to-gray Verilog model
10.9 Gray Code Counter

The Verilog code for a gray-code counter incorporates a gray-to-binary converter, a binary-to-gray converter and
increments the binary value between conversions. The parameterized Verilog model for the gray-code counter is
shown in Example 4.
module graycntr (gray, clk, inc, rst_n);

parameter SIZE = 4;
output [SIZE-1:0] gray;
input clk, inc, rst_n;
reg [SIZE-1:0] gnext, gray, bnext, bin;
integer i;
always @(posedge clk or negedge rst_n)

if (!rst_n) gray <= 0;
else gray <= gnext;
always @(gray or inc) begin

bnext = bin + inc;
gnext = (bnext>>1) ^ bnext;
end
endmodule

Example 4 - Parameterized gray-code counter Verilog model
11.0 FIFO Design

NOTE: an updated FIFO design paper is available at the Sunburst Design web site[1] and is recommended over the
FIFO design description that follows in this section. Readers should review the referenced paper.
When passing data between two different clock domains, FIFOs, or First-In, First-Out memories, are the design-
block of choice for most engineers. Figure 20 shows a block diagram for a FIFO design.
Instantiated
memory
module
wclk rclk
module module
wdata rdata
wdata rdata
write
write
FIFO FIFO
wptr rptr
winc inc g waddr raddr g inc rinc
ginc FIFO Memory ginc

wclk (Dual Port RAM) rclk
rst_n rst_n
wrst_n rrst_n
Full Empty
wfull rempty
flag syncronize syncronize flag
logic to write clk to read clk logic
wclk Synchronizer rclk

wrst_n modules (2) rrst_n
Figure 20 - FIFO Block Diagram - partitioned on clock boundaries
11.1 FIFO Write and Read Operations

For the purposes of this paper, a FIFO write operation is an operation that loads a data word into the FIFO. FIFO
write operations are sometimes called FIFO fill, FIFO load, etc.
For the purposes of this paper, a FIFO read operation is an operation that removes a data word from the FIFO. FIFO
read operations are sometimes called FIFO drain, etc.
Since full and empty flags are generated by pointers where at least one of the pointers must be synchronized into a
second clock domain, clock-cycle accurate assertion and de-assertion of full and empty flags is not completely
possible.
One FIFO design technique is to insure that a full or empty flag is asserted exactly when full or empty conditions
occur, but de-asserting the flags might come a few clock cycles late. This is sometimes referred to as pessimistic full
and empty flags.

11.2 Pessimistic full and empty flags
A pessimistic full flag is a full signal that is asserted immediately when a FIFO becomes full but is de-asserted late
(it is not de-asserted until a few read-clock cycles later).
Because the write pointer does not have to be synchronized before testing for a full condition, the full flag will be
asserted immediately when the FIFO goes full. The FIFO might not actually be completely full because the read
pointer might have incremented but the new read pointer value might not have been synchronized into the write
clock domain. Using the block diagram shown in Figure 20, the read pointer synchronized into the write clock
domain is always two write clocks behind the actual read pointer value, so the full flag might be asserted for two
extra write clocks. This typically is not a problem since the full flag is simply holding off transmission of more data
from the data sending source for two extra write clock cycles. Pointers being synchronized into a new clock domain
should be gray code counters for reasons explained in sections 10.4 and section 10.5.
Similarly, because the read pointer does not have to be synchronized before testing for an empty condition, the
empty flag will be asserted immediately when the FIFO goes empty. The FIFO might not actually be completely
empty because the write pointer might have incremented but the new write pointer value might not have been
synchronized into the read clock domain. Using the block diagram shown in Figure 20, the write pointer
synchronized into the read clock domain is always two read clocks behind the actual write pointer value, so the
empty flag might be asserted for two extra read clocks. This typically is not a problem since the empty flag is
merely informing the data receiver that data is not ready to be sent for another two read clock cycles. Again,
pointers being synchronized into a new clock domain should be gray code counters for reasons explained in sections
10.4 and section 10.5.
11.3 Full & Empty

A FIFO is full when both pointers are equal. A FIFO is also empty when both pointers are equal, so the FIFO
pointers should be one bit larger than is necessary to address the full memory range. The extra bit is used as a flag to
help determine if the FIFO is empty or full. If the extra, pointer MSBs are equal, it means that the FIFO pointers
have wrapped back to address 0 an equal number of times and if the rest of the FIFO bits are equal, the FIFO is
empty. If the extra, pointer MSBs are not equal, it means that the write pointer has wrapped back to address 0 one
more time than the read pointer and if the rest of the FIFO bits are equal, the FIFO is full.
12.0 Simulation Issues

As mentioned in section 4.0, signals crossing clock boundaries through a synchronizer will experience setup and
hold violations. That is why synchronizers are added to a design, to filter out the metastability effects of a signal that
changes too close to the rising edge of a new clock domain clock signal.
When doing gate-level simulations on a multi-clock design, the ASIC library models of flip-flops are modeled with
setup and hold time expressions to match the timing specifications of the actual flip-flops. ASIC libraries typically
model flip-flops to drive X's (unknowns) on the flip-flop outputs when a timing violation occurs. When simulating
gate-level synchronizers, setup and hold time violations might cause ASIC libraries to issue setup and hold time
error messages and the offending signals are frequently driven to an X value. These X-values propagate to the rest
of the design causing problems when trying to verify the functionality of the entire gate-level design.
Most Verilog simulators have a command option to ignore all timing checks, but this would also ignore the desired
timing checks for the rest of the design.
It is possible to change the setup and hold time setting to zero for any ASIC library flip-flop that is used in a
synchronizer, but that would cause all setup and hold time checks of all instances of that same type of flip-flop to be
set to zero, including the flip-flops that you might want to use to test the rest of the design.
You could make copies of flip-flops from an ASIC library and store them into a new Verilog library with different
names, set to zero all setup and hold times, then modify the design gate-level netlist, replacing all first stage
synchronizer ASIC library flip-flops with the modified library flip-flops without timing checks, but this could be an
error prone and tedious process that might have to be repeated each time a new netlist is generated or it might
require the creation of a makefile and scripts to automatically make the modifications each time a new netlist is
generated.
A clever way to approach this problem suggested by Bhatnagar[3] is to use Synopsys commands to modify the SDF
backannotation of the setup and hold time on just the first stage flip-flop cells in the design. Bhatnagar points out
that the SDF file is instance based and therefore targeting the setup and hold times for the offending cells is more
easily accomplished. Bhatnagar notes:
Instead of manually removing the setup and hold-time constructs from the SDF file, a better way is to
zero out the setup and hold-times in the SDF file, only for the violating flops, i.e., replace the existing
setup and hold-time numbers with zero's.
Bhatnagar further points out that setup hold times of zero means that there can be no timing violation, therefore no
unknowns propagated to the rest of the design. The following dc_shell command, given by Bhatnagar, is used to
make setup and hold times zero:
set_annotated_check 0 -setup -hold -from REG1/CLK -to REG1/D
Using a creative naming convention for the output of the first stage flip-flop of a synchronizer might make wild card
expressions possible to easily backannotate all first stage flip-flop SDF setup and hold time values to zero using
very few dc_shell commands.
13.0 Conclusions
Completely synchronous one-clock design techniques are well known. Synthesis tools do their best work on
synchronous designs. Timing analysis tools are designed to report timing problems on one-clock synchronous
designs. Synthesis scripts are easy to create for one-clock synchronous clock designs. The techniques in this paper
are aimed at making the design look like multiple single clock designs!
• Partitioning non-synchronizer blocks so that there is only one clock per module permits easy verification of
correct timing by creating clock-domain sub-blocks that can be more easily verified with static timing analysis
tools.
• Partitioning synchronizer blocks to permit inputs from one and only one clock domain and clocking the
signals with only one asynchronous clock creates manageable synchronizer sub- blocks that can also be easily
timed.
• A clock-oriented naming convention can be useful to help identify signals that need to be timed within the
different asynchronous clock domains.
• Multiple control signals crossing clock domains require special attention to ensure that all control signals are
properly sequenced into a new clock domain.
The techniques described in this paper were developed to facilitate robust development and verification of multi-
clock designs.
14.0 Errata and Changes

Readers are encouraged to send email to Cliff Cummings ( cliffc@sunburst-design.com ) any time they find
potential mistakes or if they would like to suggest improvements. Cliff is always interested in other techniques that
engineers are using.
14.1 Revision 1.2 (June 2005) - What Changed?

Errata - A colleague, Zenja Chao, pointed out that the equations of Figure 19 had the binary (bin) and gray
(gray) labels swapped in the equations. The equations have been corrected.
References
[1] Clifford E. Cummings, “Simulation and Synthesis Techniques for Asynchronous FIFO Design,” SNUG 2002 (Synopsys
Users Group Conference, San Jose, CA, 2002) User Papers, March 2002, Also available at:
www.sunburst-design.com/papers
[2] ESNUG #281 - http://www.deepchip.com/posts/0281.html
[3] Himanshu Bhatnagar, Advanced ASIC Chip Synthesis, Kluwer Academic Publishers, 1999, pp. 202-203.
[4] Samir Palnitkar, Verilog HDL, A Guide to Digital Design and Synthesis, Sunsoft Press A Prentice Hall Title, 1996, pg. 193.
[5] Steve Golson, personal communication.
[6] William J. Dally and John W. Poulton, Digital Systems Engineering, Cambridge University Press, 1998, pg. 468.
[7] William J. Dally and John W. Poulton, Digital Systems Engineering, Cambridge University Press, 1998, pp. 462-513.
[8] William J. Dally and John W. Poulton, Digital Systems Engineering, Cambridge University Press, 1998, pp. 469-470.
Synopsys is a registered trademark of Synopsys, Inc.
Design Analyzer, DesignTime, PrimeTime and Synopsys Design Compiler are trademarks of Synopsys, Inc.
Author & Contact Information

Cliff Cummings, President of Sunburst Design, Inc., is an independent EDA consultant and trainer with 23 years of
ASIC, FPGA and system design experience and 13 years of Verilog, SystemVerilog, synthesis and methodology
training experience.
Mr. Cummings, a member of the IEEE 1364 Verilog Standards Group (VSG) since 1994, is the only Verilog and
SystemVerilog trainer to co-develop and co-author every IEEE 1364 Verilog Standard, the IEEE 1364.1 Verilog
RTL Synthesis Standard, every Accellera SystemVerilog Standard, and the IEEE 1800 SystemVerilog Standard.
Mr. Cummings holds a BSEE from Brigham Young University and an MSEE from Oregon State University.
Sunburst Design, Inc. offers Verilog, Verilog Synthesis and SystemVerilog training courses. For more information,
visit the www.sunburst-design.com web site.
An updated version of this paper can be downloaded from the web site: www.sunburst-design.com/papers
(Last updated June 20th, 2005)

Simulation and Synthesis Techniques for Asynchronous

FIFO Design
Clifford E. Cummings, Sunburst Design, Inc.

ABSTRACT
FIFOs are often used to safely pass data from one clock domain to another asynchronous clock domain. Using a
FIFO to pass data from one clock domain to another clock domain requires multi-asynchronous clock design
techniques. There are many ways to design a FIFO wrong. There are many ways to design a FIFO right but still
make it difficult to properly synthesize and analyze the design.
This paper will detail one method that is used to design, synthesize and analyze a safe FIFO between different clock
domains using Gray code pointers that are synchronized into a different clock domain before testing for "FIFO full"
or "FIFO empty" conditions. The fully coded, synthesized and analyzed RTL Verilog model (FIFO Style #1) is
included.
Post-SNUG Editorial Comment

A second FIFO paper by the same author was voted “Best Paper - 1st Place” by SNUG attendees, is listed as
reference [3] and is also available for download.
1.0 Introduction
An asynchronous FIFO refers to a FIFO design where data values are written to a FIFO buffer from one clock
domain and the data values are read from the same FIFO buffer from another clock domain, where the two clock
domains are asynchronous to each other.
Asynchronous FIFOs are used to safely pass data from one clock domain to another clock domain.
There are many ways to do asynchronous FIFO design, including many wrong ways. Most incorrectly implemented
FIFO designs still function properly 90% of the time. Most almost-correct FIFO designs function properly 99%+ of
the time. Unfortunately, FIFOs that work properly 99%+ of the time have design flaws that are usually the most
difficult to detect and debug (if you are lucky enough to notice the bug before shipping the product), or the most
costly to diagnose and recall (if the bug is not discovered until the product is in the hands of a dissatisfied
customer).
This paper discusses one FIFO design style and important details that must be considered when doing asynchronous
FIFO design.
The rest of the paper simply refers to an “asynchronous FIFO” as just “FIFO.”
2.0 Passing multiple asynchronous signals

Attempting to synchronize multiple changing signals from one clock domain into a new clock domain and insuring
that all changing signals are synchronized to the same clock cycle in the new clock domain has been shown to be
problematic[1]. FIFOs are used in designs to safely pass multi-bit data words from one clock domain to another.
Data words are placed into a FIFO buffer memory array by control signals in one clock domain, and the data words
are removed from another port of the same FIFO buffer memory array by control signals from a second clock
domain. Conceptually, the task of designing a FIFO with these assumptions seems to be easy.
The difficulty associated with doing FIFO design is related to generating the FIFO pointers and finding a reliable
way to determine full and empty status on the FIFO.
2.1 Synchronous FIFO pointers
For synchronous FIFO design (a FIFO where writes to, and reads from the FIFO buffer are conducted in the same
clock domain), one implementation counts the number of writes to, and reads from the FIFO buffer to increment (on
FIFO write but no read), decrement (on FIFO read but no write) or hold (no writes and reads, or simultaneous write
and read operation) the current fill value of the FIFO buffer. The FIFO is full when the FIFO counter reaches a
predetermined full value and the FIFO is empty when the FIFO counter is zero.
Unfortunately, for asynchronous FIFO design, the increment-decrement FIFO fill counter cannot be used, because
two different and asynchronous clocks would be required to control the counter. To determine full and empty status
for an asynchronous FIFO design, the write and read pointers will have to be compared.
2.2 Asynchronous FIFO pointers
In order to understand FIFO design, one needs to understand how the FIFO pointers work. The write pointer always
points to the next word to be written; therefore, on reset, both pointers are set to zero, which also happens to be the
next FIFO word location to be written. On a FIFO-write operation, the memory location that is pointed to by the
write pointer is written, and then the write pointer is incremented to point to the next location to be written.
Similarly, the read pointer always points to the current FIFO word to be read. Again on reset, both pointers are reset
to zero, the FIFO is empty and the read pointer is pointing to invalid data (because the FIFO is empty and the empty
flag is asserted). As soon as the first data word is written to the FIFO, the write pointer increments, the empty flag is
cleared, and the read pointer that is still addressing the contents of the first FIFO memory word, immediately drives
that first valid word onto the FIFO data output port, to be read by the receiver logic. The fact that the read pointer is
always pointing to the next FIFO word to be read means that the receiver logic does not have to use two clock
periods to read the data word. If the receiver first had to increment the read pointer before reading a FIFO data
SNUG San Jose 2002 2 Simulation and Synthesis Techniques for

Rev 1.2 Asynchronous FIFO Design
word, the receiver would clock once to output the data word from the FIFO, and clock a second time to capture the
data word into the receiver. That would be needlessly inefficient.
The FIFO is empty when the read and write pointers are both equal. This condition happens when both pointers are
reset to zero during a reset operation, or when the read pointer catches up to the write pointer, having read the last
word from the FIFO.
A FIFO is full when the pointers are again equal, that is, when the write pointer has wrapped around and caught up
to the read pointer. This is a problem. The FIFO is either empty or full when the pointers are equal, but which?
One design technique used to distinguish between full and empty is to add an extra bit to each pointer. When the
write pointer increments past the final FIFO address, the write pointer will increment the unused MSB while setting
the rest of the bits back to zero as shown in Figure 1 (the FIFO has wrapped and toggled the pointer MSB). The
same is done with the read pointer. If the MSBs of the two pointers are different, it means that the write pointer has
wrapped one more time that the read pointer. If the MSBs of the two pointers are the same, it means that both
pointers have wrapped the same number of times.
Figure 1 - FIFO full and empty conditions
Using n-bit pointers where (n-1) is the number of address bits required to access the entire FIFO memory buffer, the
FIFO is empty when both pointers, including the MSBs are equal. And the FIFO is full when both pointers, except
the MSBs are equal.
The FIFO design in this paper uses n-bit pointers for a FIFO with 2(n-1) write-able locations to help handle full and
empty conditions. More design details related to the full and empty logic are included in section 5.0.

2.3 Binary FIFO pointer considerations
Trying to synchronize a binary count value from one clock domain to another is problematic because every bit of an
n-bit counter can change simultaneously (example 7->8 in binary numbers is 0111->1000, all bits changed). One
approach to the problem is sample and hold periodic binary count values in a holding register and pass a
synchronized ready signal to the new clock domain. When the ready signal is recognized, the receiving clock
domain sends back a synchronized acknowledge signal to the sending clock domain. A sampled pointer must not
change until an acknowledge signal is received from the receiving clock domain. A count-value with multiple
changing bits can be safely transferred to a new clock domain using this technique. Upon receipt of an acknowledge
signal, the sending clock domain has permission to clear the ready signal and re-sample the binary count value.
Using this technique, the binary counter values are sampled periodically and not all of the binary counter values can
be passed to a new clock domain The question is, do we need to be concerned about the case where a binary counter
might continue to increment and overflow or underflow the FIFO between sampled counter values? The answer is
no[8].
FIFO full occurs when the write pointer catches up to the synchronized and sampled read pointer. The synchronized
and sampled read pointer might not reflect the current value of the actual read pointer but the write pointer will not
try to count beyond the synchronized read pointer value. Overflow will not occur[8].
FIFO empty occurs when the read pointer catches up to the synchronized and sampled write pointer. The
synchronized and sampled write pointer might not reflect the current value of the actual write pointer but the read
pointer will not try to count beyond the synchronized write pointer value. Underflow will not occur[8].More
observations about this technique of sampling binary pointers with a synchronized ready-acknowledge pair of
handshaking signals are detailed in section 7.0, after the discussion of synchronized Gray[6] code pointers.
A common approach to FIFO counter-pointers, is to use Gray code counters. Gray codes only allow one bit to
change for each clock transition, eliminating the problem associated with trying to synchronize multiple changing
signals on the same clock edge.
2.4 FIFO testing troubles

Testing a FIFO design for subtle design problems is nearly impossible to do. The problem is rooted in the fact that
FIFO pointers in an RTL simulation behave ideally, even though, if incorrectly implemented, they can cause
catastrophic failures if used in a real design.
In an RTL simulation, if binary-count FIFO pointers are included in the design all of the FIFO pointer bits will
change simultaneously; there is no chance to observe synchronization and comparison problems. In a gate-level
simulation with no backannotated delays, there is only a slight chance of observing a problem if the gate transitions
are different for rising and falling edge signals, and even then, one would have to get lucky and have the correct
sequence of bits changing just prior to and just after a rising clock edge. For higher speed designs, the delay
differences between rising and falling edge signals diminishes and the probability of detecting problems also
diminishes. Finding actual FIFO design problems is greatest for gate-level designs with backannotated delays, but
even doing this type of simulation, finding problems will be difficult to do and again the odds of observing the
design problems decreases as signal propagation delays diminish.
Clearly the answer is to recognize that there are potential FIFO design problems and to do the design correctly from
the start.
The behavioral model that I sometimes use for testing a FIFO design is a FIFO model that is simple to code, is
accurate for behavioral testing purposes and would be difficult to debug if it were used as an RTL synthesis model.
This FIFO model is only recommended for use in a FIFO testbench. The model accurately determines when FIFO
full and empty status bits should be set and can be used to determine the data values that should have been stored
into a working FIFO. THIS FIFO MODEL IS NOT SAFE FOR SYNTHESIS!
module beh_fifo (rdata, wfull, rempty, wdata,

winc, wclk, wrst_n, rinc, rclk, rrst_n);

parameter DSIZE = 8;
parameter ASIZE = 4;
output [DSIZE-1:0] rdata;
output wfull;
output rempty;
input [DSIZE-1:0] wdata;
input winc, wclk, wrst_n;
input rinc, rclk, rrst_n;
reg [ASIZE:0] wptr, wrptr1, wrptr2, wrptr3;

reg [ASIZE:0] rptr, rwptr1, rwptr2, rwptr3;
parameter MEMDEPTH = 1<<ASIZE;
reg [DSIZE-1:0] ex_mem [0:MEMDEPTH-1];
always @(posedge wclk or negedge wrst_n)

if (!wrst_n) wptr <= 0;
else if (winc && !wfull) begin
ex_mem[wptr[ASIZE-1:0]] <= wdata;
wptr <= wptr+1;
end

if (!wrst_n) {wrptr3,wrptr2,wrptr1} <= 0;
else {wrptr3,wrptr2,wrptr1} <= {wrptr2,wrptr1,rptr};
always @(posedge rclk or negedge rrst_n)

if (!rrst_n) rptr <= 0;
else if (rinc && !rempty) rptr <= rptr+1;

if (!rrst_n) {rwptr3,rwptr2,rwptr1} <= 0;
else {rwptr3,rwptr2,rwptr1} <= {rwptr2,rwptr1,wptr};
assign rdata = ex_mem[rptr[ASIZE-1:0]];

assign rempty = (rptr == rwptr3);
assign wfull = ((wptr[ASIZE-1:0] == wrptr3[ASIZE-1:0]) &&
(wptr[ASIZE] != wrptr3[ASIZE] ));
endmodule
Example 1 - Behavioral FIFO model for testbench use only - SHOULD NOT BE USED FOR SYNTHESIS!
In the behavioral model of Example 1, it is okay to use binary-count pointers, a Verilog array to represent the FIFO
memory buffer, multi-asynchronous clocks in the same module and non-registered outputs. THIS MODEL IS NOT
INTENDED FOR SYNTHESIS! (Hopefully enough capital letters have been used in this section to discourage
anyone from trying to synthesize this model!)
Two of the always blocks in the module (the always blocks with concatenations) are included to behaviorally
represent the synchronization that will be required in the actual RTL FIFO design. They are not important to the
testing of the data transfer through the FIFO, but they are important to the testing of the correctly timed full and
empty flags in the FIFO model. The exact number of synchronization stages required in the behavioral model is
FIFO-design dependent. This model can be used to help test the FIFO design described in this paper.

3.0 Gray code counter - Style #1
Gray codes are named for the person who originally patented the code back in 1953, Frank Gray[6]. There are
multiple ways to design a Gray code counter. This section details a simple and straight forward method to do the
design. The technique described in this paper uses just one set of flip-flops for the Gray code counter. A second
method that uses two sets of flip-flops to achieve higher speeds is detailed in shown in section 4.0.
3.1 Gray code patterns
For reasons that will be described later, it is desirable to create both an n-bit Gray code counter and an (n-1)-bit
Gray code counter. It would certainly be easy to create the two counters separately, but it is also easy and efficient
to create a common n-bit Gray code counter and then modify the 2nd MSB to form an (n-1)-bit Gray code counter
with shared LSBs. In this paper, this will be called a “dual n-bit Gray code counter.”
Figure 2 - n-bit Gray code converted to an (n-1)-bit Gray code

To better understand the problem of converting an n-bit Gray code to an (n-1)-bit Gray code, consider the example
of creating a dual 4-bit and 3-bit Gray code counter as shown in Figure 2.
The most common Gray code, as shown in Figure 2, is a reflected code where the bits in any column except the
MSB are symmetrical about the sequence mid-point[6]. This means that the second half of the 4-bit Gray code is a
mirror image of the first half with the MSB inverted.
To convert a 4-bit to a 3-bit Gray code, we do not want the LSBs of the second half of the 4-bit sequence to be a
mirror image of the LSBs of the first half, instead we want the LSBs of the second half to repeat the 4-bit LSB-
sequence of the first half.
Upon closer examination, it is obvious that inverting the second MSB of the second half of the 4-bit Gray code will
produce the desired 3-bit Gray code sequence in the three LSBs of the 4-bit sequence. The only other problem is
that the 3-bit Gray code with extra MSB is no longer a true Gray code because when the sequence changes from 7
(Gray 0100) to 8 (~Gray 1000) and again from 15 (~Gray 1100) to 0 (Gray 0000), two bits are changing instead of
just one bit. A true Gray code only changes one bit between counts.

3.2 Gray code counter basics
The first fact to remember about a Gray code is that the code distance between any two adjacent words is just 1
(only one bit can change from one Gray count to the next). The second fact to remember about a Gray code counter
is that most useful Gray code counters must have power-of-2 counts in the sequence. It is possible to make a Gray
code counter that counts an even number of sequences but conversions to and from these sequences are generally
not as simple to do as the standard Gray code. Also note that there are no odd-count-length Gray code sequences so
one cannot make a 23-deep Gray code. This means that the technique described in this paper is used to make a FIFO
n
that is 2 deep.
Figure 3 is a block diagram for a style #1 dual n-bit Gray code counter. The style #1 Gray code counter assumes that
the outputs of the register bits are the Gray code value itself (ptr, either wptr or rptr). The Gray code outputs
are then passed to a Gray-to-binary converter (bin), which is passed to a conditional binary-value incrementer to
generate the next-binary-count-value (bnext), which is passed to a binary-to-Gray converter that generates the
next-Gray-count-value (gnext), which is passed to the register inputs. The top half of the Figure 3 block diagram
shows the described logic flow while the bottom half shows logic related to the second Gray code counter as
described in the next section.
Figure 3 - Dual n-bit Gray code counter block diagram - style #1

3.3 Dual n-bit Gray code counter
A dual n-bit Gray code counter is a Gray code counter that generates both an n-bit Gray code sequence (described in
section 3.2) and an (n-1)-bit Gray code sequence.
The (n-1)-bit Gray code is simply generated by doing an exclusive-or operation on the two MSBs of the n-bit Gray
code to generate the MSB for the (n-1)-bit Gray code. This is combined with the (n-2) LSBs of the n-bit Gray code
counter to form the (n-1)-bit Gray code counter[5].

3.4 Additional Gray code counter considerations
The binary-value incrementer is conditioned with either an “if not full” or “if not empty” test as shown in Figure 3,
to insure that the appropriate FIFO pointer will not increment during FIFO-full or FIFO-empty conditions that could
lead to overflow or underflow of the FIFO buffer.
If the logic block that sends data to the FIFO reliably stops sending data when a FIFO full condition is asserted, the
FIFO design might be streamlined by removing the full-testing logic from the FIFO write pointer.
The FIFO pointer itself does not protect the FIFO buffer from being overwritten, but additional conditioning logic
could be added to the FIFO memory buffer to insure that a write_enable signal could not be activated during a FIFO
full condition.
An additional “sticky” status bit, either ovf (overflow) or unf (underflow), could be added to the pointer design to
indicate that an additional FIFO write operation occurred during full or an additional FIFO read operation occurred
during empty to indicate error conditions that could only be cleared during reset.
A safe, general purpose FIFO design will include the above safeguards at the expense of a slightly larger and
perhaps slower implementation. This is a good idea since a future co-worker might try to copy and reuse the code in
another design without understanding all of the important details that were considered for the current design.
4.0 Gray code counter - Style #2

Starting with version 1.2 of this paper, the FIFO implementation uses the Gray code counter style #2, which actually
employs two sets of registers to eliminate the need to translate Gray pointer values to binary values. The second set
of registers (the binary registers) can also be used to address the FIFO memory directly without the need to translate
memory addresses into Gray codes. The n-bit Gray-code pointer is still required to synchronize the pointers into the
opposite clock domains, but the n-1-bit binary pointers can be used to address memory directly. The binary pointers
also make it easier to run calculations to generate "almost-full" and "almost-empty" bits if desired (not shown in this
paper).
Figure 4 - Dual n-bit Gray code counter block diagram - style #2

FIFO style #1
The block diagram for FIFO style #1 is shown in Figure 5.
Figure 5 - FIFO1 partitioning with synchronized pointer comparison

To facilitate static timing analysis of the style #1 FIFO design, the design has been partitioned into the following six
Verilog modules with the following functionality and clock domains:
• fifo1.v - (see Example 2 in section 6.1) - this is the top-level wrapper-module that includes all clock
domains. The top module is only used as a wrapper to instantiate all of the other FIFO modules used in the
design. If this FIFO is used as part of a larger ASIC or FPGA design, this top-level wrapper would probably be
discarded to permit grouping of the other FIFO modules into their respective clock domains for improved
synthesis and static timing analysis.
• fifomem.v - (see Example 3 in section 6.2) - this is the FIFO memory buffer that is accessed by both the
write and read clock domains. This buffer is most likely an instantiated, synchronous dual-port RAM. Other
memory styles can be adapted to function as the FIFO buffer.
• sync_r2w.v - (see Example 4 in section 6.3) - this is a synchronizer module that is used to synchronize the
read pointer into the write-clock domain. The synchronized read pointer will be used by the wptr_full
module to generate the FIFO full condition. This module only contains flip-flops that are synchronized to the
write clock. No other logic is included in this module.
• sync_w2r.v - (see Example 5 in section 6.4) - this is a synchronizer module that is used to synchronize the
write pointer into the read-clock domain. The synchronized write pointer will be used by the rptr_empty
module to generate the FIFO empty condition. This module only contains flip-flops that are synchronized to the
read clock. No other logic is included in this module.
• rptr_empty.v - (see Example 6 in section 6.5) - this module is completely synchronous to the read-clock
domain and contains the FIFO read pointer and empty-flag logic.

• wptr_full.v - (see Example 7 in section 6.6) - this module is completely synchronous to the write-clock
domain and contains the FIFO write pointer and full-flag logic.
In order to perform FIFO full and FIFO empty tests using this FIFO style, the read and write pointers must be
passed to the opposite clock domain for pointer comparison.
As with other FIFO designs, since the two pointers are generated from two different clock domains, the pointers
need to be “safely” passed to the opposite clock domain. The technique shown in this paper is to synchronize Gray
code pointers to insure that only one pointer bit can change at a time.
5.0 Handling full & empty conditions

Exactly how FIFO full and FIFO empty are implemented is design-dependent.
The FIFO design in this paper assumes that the empty flag will be generated in the read-clock domain to insure that
the empty flag is detected immediately when the FIFO buffer is empty, that is, the instant that the read pointer
catches up to the write pointer (including the pointer MSBs).
The FIFO design in this paper assumes that the full flag will be generated in the write-clock domain to insure that
the full flag is detected immediately when the FIFO buffer is full, that is, the instant that the write pointer catches up
to the read pointer (except for different pointer MSBs).
5.1 Generating empty
As shown in Figure 1, the FIFO is empty when the read pointer and the synchronized write pointer are equal.
The empty comparison is simple to do. Pointers that are one bit larger than needed to address the FIFO memory
buffer are used. If the extra bits of both pointers (the MSBs of the pointers) are equal, the pointers have wrapped the
same number of times and if the rest of the read pointer equals the synchronized write pointer, the FIFO is empty.
The Gray code write pointer must be synchronized into the read-clock domain through a pair of synchronizer
registers found in the sync_w2r module. Since only one bit changes at a time using a Gray code pointer, there is
no problem synchronizing multi-bit transitions between clock domains.
In order to efficiently register the rempty output, the synchronized write pointer is actually compared against the
rgraynext (the next Gray code that will be registered into the rptr). The empty value testing and the
accompanying sequential always block has been extracted from the rptr_empty.v code of Example 6 and is
shown below:
assign rempty_val = (rgraynext == rq2_wptr);

if (!rrst_n) rempty <= 1'b1;
else rempty <= rempty_val;
5.2 Generating full

Since the full flag is generated in the write-clock domain by running a comparison between the write and read
pointers, one safe technique for doing FIFO design requires that the read pointer be synchronized into the write
clock domain before doing pointer comparison.
The full comparison is not as simple to do as the empty comparison. Pointers that are one bit larger than needed to
address the FIFO memory buffer are still used for the comparison, but simply using Gray code counters with an
extra bit to do the comparison is not valid to determine the full condition. The problem is that a Gray code is a
symmetric code except for the MSBs.

Figure 6 - Problems associated with extracting a 3-bit Gray code from a 4-bit Gray code
Consider the example shown in Figure 6 of an 8-deep FIFO. In this example, a 3-bit Gray code pointer is used to
address memory and an extra bit (the MSB of a 4-bit Gray code) is added to test for full and empty conditions. If the
FIFO is allowed to fill the first seven locations (words 0-6) and then if the FIFO is emptied by reading back the
same seven words, both pointers will be equal and will point to address Gray-7 (the FIFO is empty). On the next
write operation, the write pointer will increment the 4-bit Gray code pointer (remember, only the 3 LSBs are being
used to address memory), making the MSBs different on the 4-bit pointers but the rest of the write pointer bits will
match the read pointer bits, so the FIFO full flag would be asserted. This is wrong! Not only is the FIFO not full,
but the 3 LSBs did not change, which means that the addressed memory location will over-write the last FIFO
memory location that was written. This too is wrong!
This is one reason why the dual n-bit Gray code counter of Figure 4 and Section 4.0 is used.
The correct method to perform the full comparison is accomplished by synchronizing the rptr into the wclk
domain and then there are three conditions that are all necessary for the FIFO to be full:
(1) The wptr and the synchronized rptr MSB's are not equal (because the wptr must have wrapped
one more time than the rptr).
(2) The wptr and the synchronized rptr 2nd MSB's are not equal (because an inverted 2nd MSB from
one pointer must be tested against the un-inverted 2nd MSB from the other pointer, which is required if the
MSB's are also inverses of each other - see Figure 6 above).
(3) All other wptr and synchronized rptr bits must be equal.
In order to efficiently register the wfull output, the synchronized read pointer is actually compared against the
wgnext (the next Gray code that will be registered in the wptr). This is shown below in the sequential always
block that has been extracted from the wptr_full.v code of Example 7:

assign wfull_val = ((wgnext[ADDRSIZE] !=wq2_rptr[ADDRSIZE] ) &&
(wgnext[ADDRSIZE-1] !=wq2_rptr[ADDRSIZE-1]) &&
(wgnext[ADDRSIZE-2:0]==wq2_rptr[ADDRSIZE-2:0]));

if (!wrst_n) wfull <= 1'b0;
else wfull <= wfull_val;
In the above code, the three necessary conditions to check for FIFO-full are tested and the result is assigned to the
wfull_val signal, which is then registered in the subsequent sequential always block.
The continuous assignment to wfull_val can be further simplified using concatenations as shown below:
assign wfull_val = (wgraynext=={~wq2_rptr[ADDRSIZE:ADDRSIZE-1],

wq2_rptr[ADDRSIZE-2:0]});
5.3 Different clock speeds

Since asynchronous FIFOs are clocked from two different clock domains, obviously the clocks are running at
different speeds. When synchronizing a faster clock into a slower clock domain, there will be some count values
that are skipped due to the fact that the faster clock will semi-periodically increment twice between slower clock
edges. This raises discussion of the two following questions:
First question. Noting that a synchronized Gray code that increments twice but is only sampled once will
show multi-bit changes in the synchronized value, will this cause multi-bit synchronization problems?
The answer is no. Synchronizing multi-bit changes is only a problem when multiple bits are changing near the rising
edge of the synchronizing clock. The fact that a Gray code counter could increment twice (or more) between slower
synchronization clock edges means that the first Gray code change will occur well before the rising edge of the
slower clock and only the second Gray code transition could change near the rising clock edge. There is no multi-bit
synchronization problem with Gray code counters.
Second question. Again noting that a faster Gray code counter could increment more than once between the
rising edge of a slower clock signal, is it possible that the Gray code counter from the faster clock domain
could increment to a full-state and to a full+1-state before full is detected, causing the FIFO to overflow
without recognizing that the FIFO was ever full? (This question similarly applies to FIFO empty).
Again, the answer is no using the implementation described in this paper. Consider first the generation of FIFO full.
The FIFO goes full when the write pointer catches up to the synchronized read pointer and the FIFO-full state is
detected in the write clock domain. If the wclk-domain is faster than the rclk-domain, the write pointer will
eventually catch up to the synchronized read pointer, the FIFO will be full, the wfull bit will be set and the FIFO
will quit writing until the synchronized read pointer advances again. The write pointer cannot advance past the
synchronized read pointer in the wclk-domain.
A similar examination of the empty flag shows that the FIFO goes empty when the read pointer catches up to the
synchronized write pointer and the FIFO-empty state is detected in the read clock domain. If the rclk-domain is
faster than the wclk-domain, the read pointer will eventually catch up to the synchronized write pointer, the FIFO
will be empty, the rempty bit will be set and the FIFO will quit reading until the synchronized write pointer
advances again. The read pointer cannot advance past the synchronized write pointer in the rclk-domain.
Using this implementation, assertion of “full” or “empty” happens exactly when the FIFO goes full or empty.
Removal of “full” and “empty” status is pessimistic.

5.4 Pessimistic full & empty
The FIFO described in this paper has implemented full-removal and empty-removal using a “pessimistic” method.
That is, “full” and “empty” are both asserted exactly on time but removed late.
Since the write clock is used to generate the FIFO-full status and since FIFO-full occurs when the write pointer
catches up to the synchronized read pointer, full-detection is “accurate” and immediate. Removal of “full” status is
pessimistic because “full” comparison is being done with a synchronized read pointer. When the read pointer does
increment, the FIFO is no longer full, but the full-generation logic will not detect the change until two rising wclk
edges synchronize the updated rptr into the wclk domain. This is generally not a problem, since it means that the
data-sending hardware is being “held-off” or informed that the FIFO is still full for a couple of extra wclk edges.
The important detail is to insure that the FIFO does not overflow. Signaling the data-sender to not send more data
for a couple of extra wclk edges merely gives time for the FIFO to make room to receive more data.
Similarly, since the read clock is used to generate the FIFO-empty status and since FIFO-empty occurs when the
read pointer catches up to the synchronized write pointer, empty-detection is “accurate” and immediate. Removal of
“empty” status is pessimistic because “empty” comparison is being done with a synchronized write pointer. When
the write pointer does increment, the FIFO is no longer empty, but the empty-generation logic will not detect the
change until two rising rclk edges synchronize the updated wptr into the rclk domain. This is generally not a
problem, since it means that the data-receiving logic is being “held-off” or informed that the FIFO is still empty for
a couple of extra rclk edges. The important detail is to insure that the FIFO does not underflow. Signaling the
data-receiver to stop removing data from the FIFO for a couple of extra rclk edges merely gives time for the FIFO
to be filled with more data.
5.4.1 “Accurate” setting of full & empty
Note that setting either the full flag or empty flag might not be quite accurate if both pointers are incrementing
simultaneously. For example, if the write pointer catches up to the synchronized read pointer, the full flag will be
set, but if the read pointer had incremented at the same time as the write pointer, the full flag will have been set
early since the FIFO is not really full due to a read operation occurring simultaneous to the “write-to-full” operation,
but the read pointer had not yet been synchronized into the write-clock domain. The setting of the full flag was
slightly too early and slightly pessimistic. This is not a design problem.
5.5 Multi-bit asynchronous reset
Much attention has been paid to insuring that the FIFO pointers only change one bit at a time. The question is, will
there be a problem associated with an asynchronous reset, which generally causes multiple pointer bits to changes
simultaneously?
The answer is no. A reset indicates that the FIFO has also been reset and there is no valid data in the FIFO. On
assertion of the reset, all of the synchronizing registers, wclk-domain logic (including the registered full flag), and
rclk-domain logic are simultaneously and asynchronously reset. The registered empty flag is also set at the same
time. The more important question concerns orderly removal of the reset signals.
Note that the design included in this paper uses different reset signals for the wclk and rclk domains. The resets
used in this design are intended to be asynchronously set and synchronously removed using the techniques describe
in Mills and Cummings[2].
Asynchronous reset of the FIFO pointers is not a problem.

5.6 Almost full and almost empty
Many designs require notification of a pending full or empty status with the generation of “almost full” and “almost
empty” status bits. There are many ways to implement these two status bits and each implementation is dependent
upon the specified design requirements.
Some FIFO designs require programmable FIFO-full and FIFO-empty difference values, such that when the
difference between the two pointers is smaller than the programmed difference, the corresponding almost full or
almost empty bit is asserted. Other FIFOs may be implemented with a fixed difference to generate almost full or
empty. Other FIFOs may be satisfied with almost full and empty being loosely generated when the MSBs of the
FIFO pointers are close. And yet other designs might only require knowing when the FIFO is more, or less than half
full.
Remembering that the FIFO is full when the wptr catches up to the synchronized rptr, the almost full condition
could be described as the condition when (wptr+4) catches up to the synchronized rptr. The (wptr+4) value
could be generated in the Gray code pointer logic shown in
Figure 3 by placing a second adder after the Gray-to-binary combinational logic to add four to the binary value and
register the result. This registered value would then be used to do subtraction against the synchronized rptr after it
has been converted to a binary value in the wclk domain, and if the difference is less than four, an almost_full
bit could be set. A less-than operation insures that the almost_full bit is set for the full range when the wptr is
within 0-4 counts of catching up to the synchronized rptr. Similar logic could be used in the rclk-domain to
generate the almost_empty flag.
Almost full and almost empty have not been included in the Verilog RTL code shown in this paper.

6.0 RTL code for FIFO Style #1
The Verilog RTL code for the FIFO style #1 model is listed in this section.
6.1 fifo1.v - FIFO top-level module
The top -level FIFO module is a parameterized FIFO design with all sub-blocks instantiated using the recommended
practice of doing named port connections. Another common coding practice is to give the top-level module
instantiations the same name as the module name. This is done to facilitate debug, since referencing module names
in a hierarchical path will be straight forward if the instance names match the module names.
module fifo1 #(parameter DSIZE = 8,

parameter ASIZE = 4)
(output [DSIZE-1:0] rdata,
output wfull,
output rempty,
input [DSIZE-1:0] wdata,
input winc, wclk, wrst_n,
input rinc, rclk, rrst_n);
wire [ASIZE-1:0] waddr, raddr;

wire [ASIZE:0] wptr, rptr, wq2_rptr, rq2_wptr;
sync_r2w sync_r2w (.wq2_rptr(wq2_rptr), .rptr(rptr),

.wclk(wclk), .wrst_n(wrst_n));
sync_w2r sync_w2r (.rq2_wptr(rq2_wptr), .wptr(wptr),

.rclk(rclk), .rrst_n(rrst_n));
fifomem #(DSIZE, ASIZE) fifomem

(.rdata(rdata), .wdata(wdata),
.waddr(waddr), .raddr(raddr),
.wclken(winc), .wfull(wfull),
.wclk(wclk));
rptr_empty #(ASIZE) rptr_empty

(.rempty(rempty),
.raddr(raddr),
.rptr(rptr), .rq2_wptr(rq2_wptr),
.rinc(rinc), .rclk(rclk),
.rrst_n(rrst_n));
wptr_full #(ASIZE) wptr_full

(.wfull(wfull), .waddr(waddr),
.wptr(wptr), .wq2_rptr(wq2_rptr),
.winc(winc), .wclk(wclk),
.wrst_n(wrst_n));
endmodule
Example 2 - Top-level Verilog code for the FIFO style #1 design

6.2 fifomem.v - FIFO memory buffer
The FIFO memory buffer is typically an instantiated ASIC or FPGA dual-port, synchronous memory device. The
memory buffer could also be synthesized to ASIC or FPGA registers using the RTL code in this module.
About an instantiated vendor RAM versus a Verilog-declared RAM, the Synopsys DesignWare team did internal
analysis and found that for sizes up to 256 bits, there is no lost area or performance using the Verilog-declared
RAM compared to an instantiated vendor RAM[4].
If a vendor RAM is instantiated, it is highly recommended that the instantiation be done using named port
connections.
module fifomem #(parameter DATASIZE = 8, // Memory data word width

parameter ADDRSIZE = 4) // Number of mem address bits
(output [DATASIZE-1:0] rdata,
input [DATASIZE-1:0] wdata,
input [ADDRSIZE-1:0] waddr, raddr,
input wclken, wfull, wclk);
ìfdef VENDORRAM
// instantiation of a vendor's dual-port RAM
vendor_ram mem (.dout(rdata), .din(wdata),
.wclken(wclken),
.wclken_n(wfull), .clk(wclk));
èlse
// RTL Verilog memory model
localparam DEPTH = 1<<ADDRSIZE;
reg [DATASIZE-1:0] mem [0:DEPTH-1];
assign rdata = mem[raddr];
always @(posedge wclk)

if (wclken && !wfull) mem[waddr] <= wdata;
èndif
endmodule
Example 3 - Verilog RTL code for the FIFO buffer memory array

6.3 sync_r2w.v - Read-domain to write-domain synchronizer
This is a simple synchronizer module, used to pass an n-bit pointer from the read clock domain to the write clock
domain, through a pair of registers that are clocked by the FIFO write clock. Notice the simplicity of the always
block that concatenates the two registers together for reset and shifting. The synchronizer always block is only three
lines of code.
All module outputs are registered for simplified synthesis using time budgeting. All outputs of this module are
entirely synchronous to the wclk and all asynchronous inputs to this module are from the rclk domain with all
signals named using an “r” prefix, making it easy to set a false path on all “r*” signals for simplified static timing
analysis.
module sync_r2w #(parameter ADDRSIZE = 4)

(output reg [ADDRSIZE:0] wq2_rptr,
input [ADDRSIZE:0] rptr,
input wclk, wrst_n);
reg [ADDRSIZE:0] wq1_rptr;

if (!wrst_n) {wq2_rptr,wq1_rptr} <= 0;
else {wq2_rptr,wq1_rptr} <= {wq1_rptr,rptr};
endmodule
Example 4 - Verilog RTL code for the read-clock domain to write-clock domain synchronizer module
6.4 sync_w2r.v - Write-domain to read-domain synchronizer

This is a simple synchronizer module, used to pass an n-bit pointer from the write clock domain to the read clock
domain, through a pair of registers that are clocked by the FIFO read clock. Notice the simplicity of the always
block that concatenates the two registers together for reset and shifting. The synchronizer always block is only three
lines of code.
All module outputs are registered for simplified synthesis using time budgeting. All outputs of this module are
entirely synchronous to the rclk and all asynchronous inputs to this module are from the wclk domain with all
signals named using an “w” prefix, making it easy to set a false path on all “w*” signals for simplified static timing
analysis.
module sync_w2r #(parameter ADDRSIZE = 4)

(output reg [ADDRSIZE:0] rq2_wptr,
input [ADDRSIZE:0] wptr,
input rclk, rrst_n);
reg [ADDRSIZE:0] rq1_wptr;

if (!rrst_n) {rq2_wptr,rq1_wptr} <= 0;
else {rq2_wptr,rq1_wptr} <= {rq1_wptr,wptr};
endmodule
Example 5 - Verilog RTL code for the write-clock domain to read-clock domain synchronizer module

6.5 rptr_empty.v - Read pointer & empty generation logic
This module encloses all of the FIFO logic that is generated within the read clock domain (except synchronizers).
The read pointer is a dual n-bit Gray code counter. The n-bit pointer ( rptr ) is passed to the write clock domain
through the sync_r2w module. The (n-1)-bit pointer ( raddr ) is used to address the FIFO buffer.
The FIFO empty output is registered and is asserted on the next rising rclk edge when the next rptr value equals
the synchronized wptr value. All module outputs are registered for simplified synthesis using time budgeting. This
module is entirely synchronous to the rclk for simplified static timing analysis.
module rptr_empty #(parameter ADDRSIZE = 4)

(output reg rempty,
output [ADDRSIZE-1:0] raddr,
output reg [ADDRSIZE :0] rptr,
input [ADDRSIZE :0] rq2_wptr,
input rinc, rclk, rrst_n);
reg [ADDRSIZE:0] rbin;

wire [ADDRSIZE:0] rgraynext, rbinnext;
//-------------------
// GRAYSTYLE2 pointer
//-------------------
if (!rrst_n) {rbin, rptr} <= 0;
else {rbin, rptr} <= {rbinnext, rgraynext};
// Memory read-address pointer (okay to use binary to address memory)

assign raddr = rbin[ADDRSIZE-1:0];
assign rbinnext = rbin + (rinc & ~rempty);

assign rgraynext = (rbinnext>>1) ^ rbinnext;
//---------------------------------------------------------------
// FIFO empty when the next rptr == synchronized wptr or on reset
//---------------------------------------------------------------
assign rempty_val = (rgraynext == rq2_wptr);

if (!rrst_n) rempty <= 1'b1;
else rempty <= rempty_val;
endmodule
Example 6 - Verilog RTL code for the read pointer and empty flag logic

6.6 wptr_full.v - Write pointer & full generation logic
This module encloses all of the FIFO logic that is generated within the write clock domain (except synchronizers).
The write pointer is a dual n-bit Gray code counter. The n-bit pointer ( wptr ) is passed to the read clock domain
through the sync_w2r module. The (n-1)-bit pointer ( waddr ) is used to address the FIFO buffer.
The FIFO full output is registered and is asserted on the next rising wclk edge when the next modified wgnext
value equals the synchronized and modified wrptr2 value (except MSBs). All module outputs are registered for
simplified synthesis using time budgeting. This module is entirely synchronous to the wclk for simplified static
timing analysis.
module wptr_full #(parameter ADDRSIZE = 4)

(output reg wfull,
output [ADDRSIZE-1:0] waddr,
output reg [ADDRSIZE :0] wptr,
input [ADDRSIZE :0] wq2_rptr,
input winc, wclk, wrst_n);
reg [ADDRSIZE:0] wbin;

wire [ADDRSIZE:0] wgraynext, wbinnext;
if (!wrst_n) {wbin, wptr} <= 0;
else {wbin, wptr} <= {wbinnext, wgraynext};
// Memory write-address pointer (okay to use binary to address memory)

assign waddr = wbin[ADDRSIZE-1:0];
assign wbinnext = wbin + (winc & ~wfull);

assign wgraynext = (wbinnext>>1) ^ wbinnext;
//------------------------------------------------------------------
// Simplified version of the three necessary full-tests:
// assign wfull_val=((wgnext[ADDRSIZE] !=wq2_rptr[ADDRSIZE] ) &&
// (wgnext[ADDRSIZE-1] !=wq2_rptr[ADDRSIZE-1]) &&
// (wgnext[ADDRSIZE-2:0]==wq2_rptr[ADDRSIZE-2:0]));
//------------------------------------------------------------------
assign wfull_val = (wgraynext=={~wq2_rptr[ADDRSIZE:ADDRSIZE-1],
wq2_rptr[ADDRSIZE-2:0]});

if (!wrst_n) wfull <= 1'b0;
else wfull <= wfull_val;
endmodule
Example 7 - Verilog RTL code for the write pointer and full flag logic

7.0 Comparing Gray code pointers to binary pointers
As mentioned in section 2.3, binary pointers can be used to do FIFO design if the pointers are sampled and
handshaking control signals are used between the two clock domains to safely pass the sampled binary count values.
Some advantages of using binary pointers over Gray code pointers:
• The technique of sampling a multi-bit value into a holding register and using synchronized handshaking control
signals to pass the multi-bit value into a new clock domain can be used for passing ANY arbitrary multi-bit
value across clock domains. This technique can be used to pass FIFO pointers or any multi-bit value.
• Each synchronized Gray code pointer requires 2n flip-flops (2 per pointer bit). The sampled multi-bit register
requires 2n+4 flip-flops (1 per holding register bit in each clock domain, 2 flip-flops to synchronize a ready bit
and 2 flip-flops to synchronize an acknowledge bit). There is no appreciable difference in the chance that either
pointer style would experience metastability.
• The sampled multi-bit binary register allows arbitrary pointer changes. Gray code pointers can only increment
and decrement.
• The sampled multi-bit register technique permits arbitrary FIFO depths; whereas, a Gray code pointer requires
power-of-2 FIFO depths. If a design required a FIFO depth of at least 132 words, using a standard Gray code
pointer would employ a FIFO depth of 256 words. Since most instantiated dual-port RAM blocks are power-of-
2 words deep, this may not be an issue.
• Using binary pointers makes it easy to calculate “almost-empty” and “almost-full” status bits using simple
binary arithmetic between the pointer values.
One small disadvantage to using binary pointers over Gray code pointers is:
• Sampling and holding a binary FIFO pointer and then handshaking it across a clock boundary can delay the
capture of new samples by at least two clock edges from the receiving clock domain and another two clock
edges from the sending clock domain. This latency is generally not a problem but it will typically add more
pessimism to the assertion of full and empty and might require additional FIFO depth to compensate for the
added pessimism. Since most FIFOs are typically specified with excess depth, it is not likely that extra registers
or a larger dual-port FIFO buffer size would be required.
The above comparison is worthy of consideration when selecting a method to implement a FIFO design.
8.0 Conclusions
Asynchronous FIFO design requires careful attention to details from pointer generation techniques to full and empty
generation. Ignorance of important details will generally result in a design that is easily verified but is also wrong.
Finding FIFO design errors typically requires simulation of a gate-level FIFO design with backannotation of actual
delays and a whole lot of luck!
Synchronization of FIFO pointers into the opposite clock domain is safely accomplished using Gray code pointers.
Generating the FIFO-full status is perhaps the hardest part of a FIFO design. Dual n-bit Gray code counters are
valuable to synchronize and n-bit pointer into the opposite clock domain and to use an (n-1)-bit pointer to do “full”
comparison. Synchronizing binary FIFO pointers using techniques described in section 7.0 is another worthy
technique to use when doing FIFO design.
Generating the FIFO-empty status is easily accomplished by comparing-equal the n-bit read pointer to the
synchronized n-bit write pointer.
The techniques described in this paper should work with asynchronous clocks spanning small to large differences in
speed.
Careful partitioning of the FIFO modules along clock boundaries with all outputs registered can facilitate synthesis
and static timing analysis within the two asynchronous clock domains.

9.0 DesignWare FIFOs
It should be mentioned that DesignWare (DW) has a number of FIFO implementations that can be instantiated into
a design. It should also be noted that the DW FIFOs have not always been bug-free.
For additional documentation, go to SolvNet and search on "FIFO STAR" and you will find STAR 104287 and
STAR 105016 related to the FIFO DW components and the DW_16550 UART. All of these bugs had to do with the
DW FIFOs and FIFO sections of the UART. The DesignWare-110.html says that the bugs are fixed in the 1299-3
patch (December 1999).
There are too many ways to do a FIFO design wrong and I consider relying on the DW FIFO components to be
absolutely correct without more details on how they were designed to be very risky. Unless I could verify that IP
designers followed the important FIFO design guidelines outlined in this paper, I would be inclined to code my own
FIFO designs.
10.0 Acknowledgements
I am grateful to Ben Cohen for his willingness to discuss FIFO design issues with me in preparation for writing this
paper. I would also like to thank Peter Alfke of Xilinx for also discussing with me alternate interesting approaches
to FIFO design.
A special thanks to Steve Golson for doing a great review of the paper on short notice and adding the valuable
information, techniques and advantages related to using binary pointers in FIFO design in place of the Gray code
pointers. Also for finding the original patent information on Frank Gray’s “Pulse Code Communication.”
11.0 Additional Post-SNUG Editorial Comments

A second FIFO paper, voted “Best Paper - 1st Place” by SNUG attendees, is listed as reference [3] and is also
available for download.
Many of the techniques used in the second FIFO paper[3] can also be used in the FIFO1 design. In particular, the
“dual n-bit counter” of the FIFO1 design can be replaced with the quadrant detection logic described in the second
FIFO paper. The FIFO1 Gray code counter style #1 can also be replaced with the faster Gray code counter style #2
described in the second FIFO paper.

12.1 Revision 1.1 (2002) - What Changed?
Version 1.1 was the first release version of this paper on the sunburst-design.com web page and included the Post-
SNUG Editorial Comments.
Full flag detection - the first version of this paper sent the full flag back to the sending logic, which meant that the
sending logic had to use the full flag to generate the winc signal (used to enable memory writes) using
combinational logic. The updated version of this FIFO design shows that the full signal is also sent to the FIFO
memory to help determine if the memory should be written. This modification allows the full signal in the FIFO
design and the winc signal from the sending logic to both be registered, which is a good design and synthesis
coding practice, plus it simplifies the sending logic required to generate the winc signal. The updated block
diagram can be seen in Figure 5.
Full flag testing - the full flag testing and generation as described in section 5.2 has been simplified. The
simplification came from the realization that the 2nd MSB did not have to be generated using an exclusive-or
operation, but that the inverse 2nd MSB bits could be tested directly and the three conditions that are necessary to
detect full are easily described and implemented.
Errata - A colleague, Mario da Costa, pointed out that the combinational sensitivity lists contained in both the
wptr_full and rptr_empty logic listings were missing the wfull and rempty signals respectively. Mario
was correct, but the pointers in this version of the paper have been replaced with Gray code counter style #2
pointers, which also fixes the problems present in earlier versions of this paper.
Binary pointers to address memory - binary pointers can be safely used to address the FIFO memory buffer and
since the binary values are readily available in the wptr_full and rptr_empty logic blocks, they have been
used to address memory without conversion to Gray codes.
Naming convention - the naming convention for the pointers synchronized into the opposite clock domains was
somewhat confusing to many of my Advanced Verilog Students, so I changes the naming convention to make the
intent more clear. For example, the wptr synchronized into the rclk domain was rwptr, but the new name
reflects the double synchronization more clearly by using the name rq2_wptr, etc.
Verilog-2001 coding style - many examples were updated with more efficient Verilog-2001 coding styles.
Errata - A colleague, Zenja Chao, pointed out labeling errors in Figure 3. The figure referenced ptr[n:0] and
called it an n-bit pointer. The appropriate labels have been corrected. Zenja also found a typo in the second to last
paragraph of section 5.3. When the FIFO is empty, the FIFO will quit reading (instead of writing). Zenja also found
a typo in version 1.1 of this paper in the third paragraph after Figure 5 (rwptr2 should be wrptr2). This code
was fixed by replacing the full-testing functionality, which now does not require inversion of the second MSB of
either pointer.

References
[1] Clifford E. Cummings, “Synthesis and Scripting Techniques for Designing Multi-Asynchronous Clock Designs,” SNUG
2001 (Synopsys Users Group Conference, San Jose, CA, 2001) User Papers, March 2001, Section MC1, 3rd paper. Also
available at www.sunburst-design.com/papers
[2] Clifford E. Cummings and Don Mills, “Synchronous Resets? Asynchronous Resets? I am so confused! How will I ever
know which to use?,” SNUG 2002 (Synopsys Users Group Conference, San Jose, CA, 2002) User Papers, March 2002,
Section TB2, 1st paper. Also available at www.sunburst-design.com/papers
[3] Clifford E. Cummings and Peter Alfke, “Simulation and Synthesis Techniques for Asynchronous FIFO Design with
Asynchronous Pointer Comparisons,” SNUG 2002 (Synopsys Users Group Conference, San Jose, CA, 2002) User Papers,
March 2002, Section TB2, 3rd paper. Also available at www.sunburst-design.com/papers
[4] Dinesh Tyagi, former CAE Manager for Synopsys DesignWare product, personal communication
[5] Edward Paluch, personal communication
[6] Frank Gray, "Pulse Code Communication." United States Patent Number 2,632,058. March 17, 1953.
[7] John O’Malley, Introduction to the Digital Computer, Holt, Rinehart and Winston, Inc., 1972, pg. 190.
[9] Synopsys SolvNet, Doc Name: DesignWare-110.html, “Functional Bugs in DesignWare Components,” Updated:
11/30/2000


Simulation and Synthesis Techniques for Asynchronous

FIFO Design with Asynchronous Pointer Comparisons
SNUG-2002
San Jose, CA
Voted Best Paper
1st Place
Clifford E. Cummings Peter Alfke

Sunburst Design, Inc. Xilinx, Inc.
ABSTRACT
An interesting technique for doing FIFO design is to perform asynchronous comparisons between the FIFO write
and read pointers that are generated in clock domains that are asynchronous to each other. The asynchronous FIFO
pointer comparison technique uses fewer synchronization flip-flops to build the FIFO. The asynchronous FIFO
comparison method requires additional techniques to correctly synthesize and analyze the design, which are detailed
in this paper.
To increase the speed of the FIFO, this design uses combined binary/Gray counters that take advantage of the built-
in binary ripple carry logic.
The fully coded, synthesized and analyzed RTL Verilog model (FIFO Style #2) is included.
This FIFO design paper builds on information already presented in another FIFO design paper where the FIFO
pointers are synchronized into the opposite clock domain before running "FIFO full" or "FIFO empty" tests. The
reader may benefit from first reviewing the FIFO Style #1 method before proceeding to this FIFO Style #2 method.
Post-SNUG Editorial Comment (by Cliff Cummings)

Although this paper was voted “Best Paper - 1st Place” by SNUG attendees, this paper builds off of a second FIFO
paper listed as reference [1]. The first FIFO paper laid the foundation for some of the content of this paper;
therefore, it is highly recommended that readers download and read the FIFO1 paper[1] to acquire background
information already assumed to be known by the reader of this paper.
1.0 Introduction
An asynchronous FIFO refers to a FIFO design where data values are written sequentially into a FIFO buffer using
one clock domain, and the data values are sequentially read from the same FIFO buffer using another clock domain,
where the two clock domains are asynchronous to each other.
One common technique for designing an asynchronous FIFO is to use Gray[4] code pointers that are synchronized
into the opposite clock domain before generating synchronous FIFO full or empty status signals[1]. An interesting
and different approach to FIFO full and empty generation is to do an asynchronous comparison of the pointers and
then asynchronously set the full or empty status bits[6].
This paper discusses the FIFO design style with asynchronous pointer comparison and asynchronous full and empty
generation. Important details relating to this style of asynchronous FIFO design are included. The FIFO style
implemented in this paper uses efficient Gray code counters, whose implementation is described in the next section.
2.0 Gray code counter - style #2

One Gray code counter style uses a single set of flip-flops as the Gray code register with accompanying Gray-to-
binary conversion, binary increment, and binary-to-Gray conversion[1].
A second Gray code counter style, the one described in this paper, uses two sets of registers, one a binary counter
and a second to capture a binary-to-Gray converted value. The intent of this Gray code counter style #2 is to utilize
the binary carry structure, simplify the Gray-to-binary conversion; reduce combinational logic, and increase the
upper frequency limit of the Gray code counter.
The binary counter conditionally increments the binary value, which is passed to both the inputs of the binary
counter as the next-binary-count value, and is also passed to the simple binary-to-Gray conversion logic, consisting
of one 2-input XOR gate per bit position. The converted binary value is the next Gray-count value and drives the
Gray code register inputs.
Figure 1 shows the block diagram for an n-bit Gray code counter (style #2).
Figure 1 - Dual n-bit Gray code counter style #2
SNUG San Jose 2002 2 Simulation and Synthesis Techniques for Asynchronous
Rev 1.2 FIFO Design with Asynchronous Pointer Comparisons
This implementation requires twice the number of flip-flops, but reduces the combinatorial logic and can operate at
a higher frequency. In FPGA designs, availability of extra flip-flops is rarely a problem since FPGAs typically
contain far more flip-flops than any design will ever use. In FPGA designs, reducing the amount of combinational
logic frequently translates into significant improvements in speed.
The ptr output of the block diagram in Figure 1 is an n-bit Gray code pointer.
Note: since the MSB of a binary sequence is equal to the MSB of a Gray code sequence, this design can be further
simplified by using the binary MSB-flip-flop as the Gray code MSB-flip-flop. The Verilog code in this paper did
not implement this additional optimization. This would save one flip-flop per pointer.
3.0 Full & empty detection

As with any FIFO design, correct implementation of full and empty is the most difficult part of the design.
There are two problems with the generation of full and empty:
First, both full and empty are indicated by the fact that the read and write pointers are identical. Therefore,
something else has to distinguish between full and empty. One known solution to this problem appends an extra bit
to both pointers and then compares the extra bit for equality (for FIFO empty) or inequality (for FIFO full), along
with equality of the other read and write pointer bits[1].
Another solution, the one described in this paper, divides the address space into four quadrants and decodes the two
MSBs of the two counters to determine whether the FIFO was going full or going empty at the time the two pointers
became equal.
Figure 2 - FIFO is going full because the wptr trails the rptr by one quadrant
If the write pointer is one quadrant behind the read pointer, this indicates a "possibly going full" situation as shown
in Figure 2. When this condition occurs, the direction latch of Figure 4 is set.
Figure 3 - FIFO is going empty because the rptr trails the wptr by one quadrant
If the write pointer is one quadrant ahead of the read pointer, this indicates a "possibly going empty" situation as
shown in Figure 3. When this condition occurs, the direction latch of Figure 4 is cleared.
Figure 4 - FIFO direction quadrant detection circuitry
When the FIFO is reset the direction latch is also cleared to indicate that the FIFO “is going empty” (actually, it
is empty when both pointers are reset). Setting and resetting the direction latch is not timing-critical, and the
direction latch eliminates the ambiguity of the address identity decoder.
The Xilinx FPGA logic to implement the decoding of the two wptr MSBs and the two rptr MSBs is easily
implemented as two 4-input look-up tables.
The second, and more difficult, problem stems from the asynchronous nature of the write and read clocks.
Comparing two counters that are clocked asynchronously can lead to unreliable decoding spikes when either or both
counters change multiple bits more or less simultaneously. The solution described in this paper uses a Gray count
sequence, where only one bit changes from any count to the next. Any decoder or comparator will then switch only
from one valid output to the next one, with no danger of spurious decoding glitches.
4.0 FIFO style #2

For the purposes of this paper, FIFO style #1 refers to a FIFO implementation style that synchronizes pointers from
one clock domain to another before generating full and empty flags [1].
The FIFO style described in this paper (FIFO style #2) does asynchronous comparison between Gray code pointers
to generate an asynchronous control signal to set and reset the full and empty flip-flops.
The block diagram for FIFO style #2 is shown in Figure 5.
Figure 5 - FIFO2 partitioning with asynchronous pointer comparison logic
To facilitate static timing analysis of the style #2 FIFO design, the design has been partitioned into the following
five Verilog modules with the following functionality and clock domains:
• fifo2.v - (see Example 1 in section 5.1) - this is the top-level wrapper-module that includes all clock
domains. The top module is only used as a wrapper to instantiate all of the other FIFO modules used in the
design. If this FIFO is used as part of a larger ASIC or FPGA design, this top-level wrapper would probably be
discarded to permit grouping of the other FIFO modules into their respective clock domains for improved
synthesis and static timing analysis.
• fifomem.v - (see Example 2 in section 5.2) - this is the FIFO memory buffer that is accessed by both the
write and read clock domains. This buffer is most likely an instantiated, synchronous dual-port RAM. Other
memory styles can be adapted to function as the FIFO buffer.
• async_cmp.v - (see Example 3 in section 5.3) - this is an asynchronous pointer-comparison module that is
used to generate signals that control assertion of the asynchronous “full” and “empty” status bits. This module
only contains combinational comparison logic. No sequential logic is included in this module.
• rptr_empty.v - (see Example 4 in section 5.4) - this module is mostly synchronous to the read-clock
domain and contains the FIFO read pointer and empty-flag logic. Assertion of the aempty_n signal (an input
to this module) is synchronous to the rclk-domain, since aempty_n can only be asserted when the rptr
incremented, but de-assertion of the aempty_n signal happens when the wptr increments, which is
asynchronous to rclk.
• wptr_full.v - (see Example 5 in section 5.5) - this module is mostly synchronous to the write-clock domain
and contains the FIFO write pointer and full-flag logic. Assertion of the afull_n signal (an input to this
module) is synchronous to the wclk-domain, since afull_n can only be asserted when the wptr
incremented (and wrst_n), but de-assertion of the afull_n signal happens when the rptr increments,
which is asynchronous to wclk.
5.0 RTL code for FIFO style #2
The Verilog RTL code for the FIFO style #2 model is listed in this section.
5.1 fifo2.v - FIFO top-level module
The fifo2 top-level module is a parameterized module with all sub-blocks instantiated following safe coding
practices using named port connections.
module fifo2 (rdata, wfull, rempty, wdata,
winc, wclk, wrst_n, rinc, rclk, rrst_n);
parameter DSIZE = 8;
parameter ASIZE = 4;
output [DSIZE-1:0] rdata;
output wfull;
output rempty;
input [DSIZE-1:0] wdata;
wire [ASIZE-1:0] wptr, rptr;

wire [ASIZE-1:0] waddr, raddr;
async_cmp #(ASIZE) async_cmp

(.aempty_n(aempty_n), .afull_n(afull_n),
.wptr(wptr), .rptr(rptr), .wrst_n(wrst_n));
fifomem #(DSIZE, ASIZE) fifomem

(.rdata(rdata), .wdata(wdata),
.waddr(wptr), .raddr(rptr),
.wclken(winc), .wclk(wclk));
rptr_empty #(ASIZE) rptr_empty

(.rempty(rempty), .rptr(rptr),
.aempty_n(aempty_n), .rinc(rinc),
.rclk(rclk), .rrst_n(rrst_n));
wptr_full #(ASIZE) wptr_full

(.wfull(wfull), .wptr(wptr),
.afull_n(afull_n), .winc(winc),
.wclk(wclk), .wrst_n(wrst_n));
endmodule
Example 1 - Top-level Verilog code for the FIFO style #2 design
5.2 fifomem.v - FIFO memory buffer
The FIFO memory buffer could be an instantiated ASIC or FPGA dual-port, synchronous memory device. The
memory buffer could also be synthesized to ASIC or FPGA registers using the RTL code in this module.
If a vendor RAM is instantiated, it is highly recommended that the instantiation be done using named port
connections.
module fifomem (rdata, wdata, waddr, raddr, wclken, wclk);

parameter DATASIZE = 8; // Memory data word width
parameter ADDRSIZE = 4; // Number of memory address bits
parameter DEPTH = 1<<ADDRSIZE; // DEPTH = 2**ADDRSIZE
output [DATASIZE-1:0] rdata;
input [DATASIZE-1:0] wdata;
input [ADDRSIZE-1:0] waddr, raddr;
input wclken, wclk;
ìfdef VENDORRAM
// instantiation of a vendor's dual-port RAM
VENDOR_RAM MEM (.dout(rdata), .din(wdata),
.wclken(wclken), .clk(wclk));
èlse
reg [DATASIZE-1:0] MEM [0:DEPTH-1];
assign rdata = MEM[raddr];
always @(posedge wclk)

if (wclken) MEM[waddr] <= wdata;
èndif
endmodule
Example 2 - Verilog RTL code for the FIFO buffer memory array
5.3 async_cmp.v - Asynchronous the full/empty comparison logic
The logic used to determine the full or empty status on the FIFO is the most distinctive difference between FIFO
style #1 and FIFO style #2.
Async_cmp is an asynchronous comparison module, used to compare the read and write pointers to detect full and
empty conditions.
module async_cmp (aempty_n, afull_n, wptr, rptr, wrst_n);

parameter ADDRSIZE = 4;
parameter N = ADDRSIZE-1;
output aempty_n, afull_n;
input [N:0] wptr, rptr;
input wrst_n;
reg direction;
wire high = 1'b1;
wire dirset_n = ~( (wptr[N]^rptr[N-1]) & ~(wptr[N-1]^rptr[N]));

wire dirclr_n = ~((~(wptr[N]^rptr[N-1]) & (wptr[N-1]^rptr[N])) |
~wrst_n);
always @(posedge high or negedge dirset_n or negedge dirclr_n)

if (!dirclr_n) direction <= 1'b0;
else if (!dirset_n) direction <= 1'b1;
else direction <= high;
//always @(negedge dirset_n or negedge dirclr_n)

//if (!dirclr_n) direction <= 1'b0;
//else direction <= 1'b1;
assign aempty_n = ~((wptr == rptr) && !direction);

assign afull_n = ~((wptr == rptr) && direction);
endmodule
Example 3 - Verilog RTL code for the asynchronous comparator module
Three of the last seven lines of the Verilog code of Example 3 have been commented out in this model. In theory, a
synthesis tool should be capable of inferring an RS-flip-flop from the comment-removed code, but the LSI_10K
library that is included with the default installation of the Synopsys tools did not infer a correct RS-flip-flop with
this code when tested, so the always block immediately preceding the commented code was added to infer an RS-
flip-flop.
5.3.1 Asynchronous generation of full and empty
In the async_cmp code of Example 3, and shown in Figure 6, aempty_n and afull_n are the asynchronously
decoded signals. The aempty_n signal is asserted on the rising edge of an rclk, but is de-asserted on the rising
edge of a wclk. Similarly, the afull_n signal is asserted on a wclk and removed on an rclk.
The empty signal will be used to stop the next read operation, and the leading edge of aempty_n is properly
synchronous with the read clock, but the trailing edge needs to be synchronized to the read clock. This is done in a
two-stage synchronizer that generates rempty.
The wfull signal is generated in the symmetrically equivalent way.
Figure 6 - Asynchronous pointer comparison to assert full and empty
5.3.2 Resetting the FIFO
The first FIFO event of interest takes place on a FIFO-reset operation. When the FIFO is reset, four important
things happen within the async_cmp module and accompanying full and empty synchronizers of the wptr_full
and rptr_empty modules (the connections between the async_cmp, wptr_full and rptr_empty modules
are shown in Figure 7):
1. The reset signal directly clears the wfull flag. The rempty flag is not cleared by a reset.
2. The reset signal clears both FIFO pointers, so the pointer comparator asserts that the pointers are equal.
3. The reset clears the direction bit.
4. With the pointers equal and the direction bit cleared, the aempty_n bit is asserted, which presets the
rempty flag.
5.3.3 FIFO-writes & FIFO full

The second FIFO operational event of interest takes place when a FIFO-write operation takes place and the wptr is
incremented. At this point, the FIFO pointers are no longer equal so the aempty_n signal is de-asserted, releasing
the preset control of the rempty flip-flops. After two rising edges on rclk, the FIFO will de-assert the rempty
signal. Because the de-assertion of aempty_n happens on a rising wclk and because the rempty signal is
clocked by the rclk, the two-flip-flop synchronizer as shown in Figure 8 is required to remove metastability that
could be generated by the first rempty flip-flop.
The second FIFO operational event of interest takes place when the wptr increments into the next Gray code
quadrant beyond the rptr (see section 3.0 for a discussion of Gray code quadrants). The direction bit is
cleared (but it was already clear).
Figure 7 - async_cmp module connection to rptr_empty and wptr_full modules
The third FIFO operational event of interest occurs when the wptr is within one quadrant of catching up to the
rptr as described in section 3.0. When this happens, the dirset_n bit of Figure 6 is asserted low, which sets
the direction bit high. This means that the direction bit is set long before the FIFO is full and is not timing-
critical to assertion of the afull_n signal.
The fourth FIFO operational event of interest is when the wptr catches up to the rptr (and the direction bit is
set). When this happens, the afull_n signal presets the wfull flip-flops. The afull_n signal is asserted on a
FIFO-write operation and is synchronous to the rising edge of the wclk; therefore, asserting full is synchronous to
the wclk. See section 5.3.6 for a discussion of the critical timing path associated with assertion of the wfull
signal.
The fifth FIFO operational event of interest is when a FIFO-read operation takes place and the rptr is
incremented. At this point, the FIFO pointers are no longer equal so the afull_n signal is de-asserted, releasing
the preset control of the wfull flip-flops. After two rising edges on wclk, the FIFO will de-assert the wfull
signal. Because the de-assertion of afull_n happens on a rising rclk and because the wfull signal is clocked
by the wclk, the two-flip-flop synchronizer, shown in Figure 8, is required to remove metastability that could be
generated by the first wfull flip-flop capturing the inverted and asynchronously generated afull_n data input.
Figure 8 - Asynchronous empty and full generation

During operation, wfull is generated synchronous to the write clock, in a similar way that rempty is generated
synchronous to the read clock. The afull_n signal is asserted as a result of a write clock, and the leading (falling)
edge is thus naturally synchronous to the write clock. The trailing (rising) edge is, however caused by the read
clock, and must, therefore be synchronized to the write clock. The same timing issues related to the setting of the
full flag also apply to the setting of the empty flag.
5.3.4 FIFO-reads & FIFO empty
The sixth FIFO operational event of interest takes place when the rptr increments into the next Gray code
quadrant beyond the wptr. The direction bit is again set (but it was already set).
The seventh FIFO operational event of interest occurs when the rptr is within one quadrant of catching up to the
wptr. When this happens, the dirrst bit of Figure 6 is asserted high , which clears the direction bit. This
means that the direction bit is cleared long before the FIFO is empty and is not timing critical to assertion of the
aempty_n signal.
The eighth FIFO operational event of interest is when the rptr catches up to the wptr (and the direction bit is
zero). When this happens, the aempty_n signal presets the rempty flip-flops. The aempty_n signal is asserted
on a FIFO-read operation and is synchronous to the rising edge of the rclk; therefore, asserting empty is
synchronous to the rclk. See section 5.3.6 for a discussion of the critical timing path associated with assertion of
the rempty signal.
Finally, when a FIFO-write operation takes place and the wptr is incremented. At this point, the FIFO pointers are
no longer equal so the aempty_n signal is de-asserted, releasing the preset control of the rempty flip-flops. After
two rising edges on rclk, the FIFO will de-assert the rempty signal. Because the de-assertion of aempty_n
happens on a rising wclk and because the rempty signal is clocked by the rclk, the two-flip-flop synchronizer
as shown in Figure 8 is required to remove metastability that could be generated by the first rempty flip-flop.
5.3.5 Alternate method to preset the full & empty flags
Figure 9 - Self-timed preset assertion circuit

Another method for setting the rempty or wfull flags is to use a self-timed differentiating circuit as shown in
Figure 9. In this figure, the flip-flops are shown with high-true presets, similar to what is found on Xilinx FPGAs.
(equivalent circuitry could also be designed using low-true presets). When the aempty signal goes high, the
rempty output flip-flop is preset and assuming that the signal between the flip-flops was low, this signal combined
with aempty-high will drive the output of the and gate high and set the first flip-flop. When the first flip-flop is set,
the and gate will quit driving the preset signal to the first flip-flop. This is a self-timed preset signal that releases
preset immediately after preset occurs, well before the aempty signal goes low.
5.3.6 Full and empty critical timing paths

Using the asynchronous comparison technique described in this paper, there are critical timing paths associated with
the generation of both the rempty and wfull signals.
The rempty critical timing path, shown in Figure 10, consists of (1) the rclk-to-q incrementing of the rptr, (2)
comparison logic of the rptr to the wptr, (3) combining the comparator output with the direction latch output to
generate the aempty_n signal, (4) presetting the rempty signal, (5) any logic that is driven by the rempty
signal, and (6) resultant signals meeting the setup time of any down-stream flip-flops clocked within the
Figure 10 - Critical timing paths for asserting rempty and wfull
rclk domain. This critical timing path has a symmetrically equivalent critical timing path for the generation of the
wfull signal, also shown in Figure 10.
5.3.7 Asynchronous concerns, questions and answers

While writing this paper, the authors asked and answered numerous questions to address concerns over the highly
asynchronous nature of the generation and removal of the full and empty bits for the FIFO style described in this
paper. This section captures a number of the questions, concerns and answers that lead both authors to believe this
coding style does indeed work.
Generation of the aempty_n control signal is straightforward. Whenever the read pointer (rptr) equals the write
pointer (wptr), and the direction latch is clear, the FIFO is empty.
The empty flag is used only in the read clock domain and since the read pointer, incremented by a read clock,
causes the empty flag to be set, assertion of the empty flag is always synchronous in the read clock domain. As long
as the empty flag meets the critical empty-assertion timing path described in section 5.3.6, there is no
synchronization problems associated with asserting the empty flag.
The de-assertion of aempty_n is caused by the write clock incrementing the write pointer, and is thus unrelated to
the read clock. The de-assertion of aempty_n must, therefore, be synchronized in a dual-flip-flop synchronizer,
clocked by the read clock. The first flip-flop is subject to metastability but the second flip-flop is included to wait
for the metastability to subside, just like any other multi-clock synchronizer[2].
Since aempty_n is started by one clock and terminated by the other, it has an undefined duration, and might even
be a runt pulse. A runt pulse is a Low-High-Low signal transition where the transition to High may or may not pass
through the logic-“1” threshold level of the logic family being used.
If the aempty_n control signal is a runt pulse, there are four possible scenarios that should be addressed:
(1) the runt signal is not recognized by the rempty flip-flops and empty is not asserted. This is not a problem.
(2) The runt pulse might preset the first synchronizer flip-flop, but not the second flip-flop. This is highly unlikely,
but would result in an unnecessary, but properly synchronized rempty output, that will show up on the output
of the second flip-flop one read clock later. This is not a problem.
(3) The runt pulse might preset the second synchronizer flip-flop, but not the first flip-flop. This is highly unlikely,
but would result in an unnecessary, but properly synchronized rempty output (as long as the empty critical
timing is met), that will be set on the output of the second flip-flop until the next read clock, when it will be
cleared by the zero from the first flip-flop. This is not a problem.
(4) The most likely case is that the runt pulse sets both flip-flops, thus creating a properly synchronized rempty
output that is two read-clock periods long. The longer duration is caused by the two-flip-flop synchronizer ( to
avoid metastable problems as described below). This is not a problem.
The runt pulse cannot have any effect on the synchronizer data-input, since an aempty_n runt pulse can only
occur immediately after a read clock edge, thus long before the next read clock edge (as long as critical timing is
met).
The aempty_n signal might also stay high longer and go low at any moment, even perhaps coincident with the
next read clock edge. If it goes low well before the set-up time of the first synchronize flip-flop, the result is like
scenario (4) above. If it goes low well after the set-up time, the synchronizer will stretch rempty by one more read
clock period.
If aempty_n goes low within the metstability-catching set-up time window, the first synchronizer flip-flop output
will be indeterminate for a few nanoseconds, but will then be either high or low. In either case, the output of the
second synchronizer flip-flop will create the appropriate synchronized rempty output.
The next question is, what happens if the write clock de-asserts the aempty_n signal coincident with the rising
rclk on the dual synchronizer? The first flip-flop could go metastable, which is why there is a second flip-flop in
the dual synchronizer.
But the removal of the setting signal on the second flip-flop will violate the recovery time of the second flip-flop.
Will this cause the second flip-flop to go metastable? The authors do not believe this can happen because the preset
to the flip-flop forced the output high and the input to the same flip-flop is already high, which we believe is not
subject to a recovery time instability on the flip-flop.
Challenge: if anyone can prove that a flip-flop that is set high, and is also driven by a high-data-input signal, can go
metastable if the preset signal is removed coincident with the rising edge of the clock to the same flip-flop, the
authors would like to be made aware of any such claim. The authors believe that recovery time parameters are with
respect to removing a preset when the data input value is zero. The authors could not find any published reference
to discount the possibility of metastability on the output of the second flip-flop but we believe that metastability in
this case is not possible.
Last question. Can a runt-preset pulse, where the trailing edge of the runt pulse is caused by the wclk, preset the
second synchronizer flip-flop in close proximity to a rising rclk, violate the preset recovery time and cause
metastability on the output of the second flip-flop? The answer is no as long as the aempty_n critical timing path
is met. Assuming that critical timing is met, the aempty_n signal going low should occur shortly after a rising
rclk and well before the rising edge of the second flip-flop, so runt pulses can only occur well before the rising
edge of an rclk.
Again, symmetrically equivalent scenarios and arguments can be made about the generation of the wfull flag.
5.4 rptr_empty.v - Read pointer & empty generation logic
This module encloses all of the FIFO logic that is generated within the read clock domain (except synchronizers).
The read pointer is an n-bit Gray code counter. The FIFO rempty output is asserted when the aempty_n signal
goes low and the rempty output is de-asserted on the second rising rclk edge after aempty_n goes high (a rare
metastable state could cause the rempty output to be de-asserted on the third rising rclk edge). This module is
completely synchronous to the rclk for simplified static timing analysis, except for the aempty_n input, which is
de-asserted asynchronously to the rclk.
module rptr_empty (rempty, rptr, aempty_n, rinc, rclk, rrst_n);

output rempty;
output [ADDRSIZE-1:0] rptr;
input aempty_n;
reg [ADDRSIZE-1:0] rptr, rbin;
reg rempty, rempty2;
wire [ADDRSIZE-1:0] rgnext, rbnext;
//---------------------------------------------------------------
//---------------------------------------------------------------
if (!rrst_n) begin
rbin <= 0;
rptr <= 0;
end
else begin
rbin <= rbnext;
rptr <= rgnext;
end
//---------------------------------------------------------------
// increment the binary count if not empty
//---------------------------------------------------------------
assign rbnext = !rempty ? rbin + rinc : rbin;
assign rgnext = (rbnext>>1) ^ rbnext; // binary-to-gray conversion
always @(posedge rclk or negedge aempty_n)

if (!aempty_n) {rempty,rempty2} <= 2'b11;
else {rempty,rempty2} <= {rempty2,~aempty_n};
endmodule
Example 4 - Verilog RTL code for the read pointer and empty flag logic
The last always block in this module is the asynchronously preset rempty signal generation. The presetting signal
is the aempty_n input , which is asserted when the rptr is incremented by the rclk (synchronous to this block)
as long as the rempty critical timing path (described in section 5.3.6) is satisfied. Removal of the rempty signal
occurs when the write pointer increments, which is asynchronous to the rclk domain. Because reset removal is
asynchronous to the rclk domain, a two-flip-flop synchoronizer is required to synchronize aempty_n removal to
the rclk domain.
5.5 wptr_full.v - Write pointer & full generation logic
This module encloses all of the FIFO logic that is generated within the write clock domain (except synchronizers).
The write pointer is an n-bit Gray code counter. The FIFO wfull output is asserted when the afull_n signal
goes low and the wfull output is de-asserted on the second rising wclk edge after afull_n goes high (a rare
metastable state could cause the wfull output to be de-asserted on the third rising wclk edge). This module is
completely synchronous to the wclk for simplified static timing analysis, except for the afull_n input, which is
de-asserted asynchronously to the wclk.
module wptr_full (wfull, wptr, afull_n, winc, wclk, wrst_n);

output wfull;
output [ADDRSIZE-1:0] wptr;
input afull_n;
reg [ADDRSIZE-1:0] wptr, wbin;
reg wfull, wfull2;
wire [ADDRSIZE-1:0] wgnext, wbnext;
//---------------------------------------------------------------
//---------------------------------------------------------------
if (!wrst_n) begin
wbin <= 0;
wptr <= 0;
end
else begin
wbin <= wbnext;
wptr <= wgnext;
end
//---------------------------------------------------------------
// increment the binary count if not full
//---------------------------------------------------------------
assign wbnext = !wfull ? wbin + winc : wbin;
assign wgnext = (wbnext>>1) ^ wbnext; // binary-to-gray conversion
always @(posedge wclk or negedge wrst_n or negedge afull_n)

if (!wrst_n ) {wfull,wfull2} <= 2'b00;
else if (!afull_n) {wfull,wfull2} <= 2'b11;
else {wfull,wfull2} <= {wfull2,~afull_n};
endmodule
Example 5 - Verilog RTL code for the write pointer and full flag logic
The last always block in this module is the asynchronously preset wfull signal generation. The presetting signal is
the afull_n input , which is asserted when the wptr is incremented by the wclk (synchronous to this block) as
long as the wfull critical timing path (described in section 5.3.6) is satisfied. Removal of the wfull signal occurs
when the read pointer increments, which is asynchronous to the wclk domain. Because reset removal is
asynchronous to the wclk domain, a two-flip-flop synchoronizer is required to synchronize afull_n removal to
the wclk domain. The wfull signal must also go low when the FIFO is reset.
6.0 Conclusion
This paper describes an efficient technique to implement a high-speed asynchronous FIFO, using dual-port RAMs
addressed by Gray counters This design uses an asynchronous comparator for detecting full and empty status.
The technique described implements an asynchronous assertion of the full and empty flags that requires more effort
to analyze for static timing verification.
The technique described also does not have registered full and empty status flags, so care must be taken to insure
that the generation of these flags meets the required timing to recognize assertion of full and empty in the rest of the
system.
This efficient and interesting approach to FIFO design worthy of consideration.

7.1 Revision 1.1 (2002) - What Changed?

Additional Post-SNUG Editorial Comments (by Cliff Cummings) - Although this paper was voted “Best Paper -
1st Place” by SNUG attendees, this paper describes a FIFO design style that is generally incompatible with static
timing analysis (STA) and design-for-test (DFT). Readers should also reference the FIFO design style described by
reference [1] if a more STA-friendly and DFT-friendly style is desired.
Many of the techniques used in this paper can also be used in the FIFO1 design[1]. In particular, the “dual n-bit
counter” of the FIFO1 design can be replaced with the quadrant detection logic described in this paper. The FIFO1
Gray code counter style #1 can also be replaced with the faster Gray code counter style #2 described in this paper.
Version 1.1 was the first release version of this paper on the sunburst-design.com web page and included the Post-
SNUG Editorial Comments.

There were some minor formatting changes to this paper, along with additional changes noted below.
Full flag detection - this paper sends the full flag back to the sending logic, which means that the sending logic has
to use the full flag to generate the winc signal (used to enable memory writes) using combinational logic. An
updated version of this FIFO style #1 design[1] shows that the full signal can also be sent to the FIFO memory to
help determine if the memory should be written. This modification allows the full signal in the FIFO design and the
winc signal from the sending logic to both be registered, which is a good design and synthesis coding practice, plus
it simplifies the sending logic required to generate the winc signal. The logic in this paper does not include this
useful update and readers are encouraged to examine the changes in the FIFO style #1 paper[1].
This paper uses some very asynchronous techniques to generate full and empty status flags and Cliff Cummings
does not plan to update this version of the FIFO design.
The clever quadrant techniques described in this paper have been incorporated into a new synchronous FIFO design
style and are described in a paper using SystemVerilog coding styles (easily converted to plain Verilog coding
styles) that will also shortly be available on the www.sunburst-design.com/papers web page.
Errata - A colleague, Zenja Chao, pointed out that there was a typo at the end of section 5.5. Removal of the
wfull signal should happen when the READ POINTER (rptr) increments, not when the wptr increments.
References
[1] Clifford E. Cummings, “Simulation and Synthesis Techniques for Asynchronous FIFO Design,” SNUG 2002 (Synopsys
Users Group Conference, San Jose, CA, 2002) User Papers, March 2002, Section TB2, 2nd paper. Also available at
www.sunburst-design.com/papers
[2] Clifford E. Cummings, “Synthesis and Scripting Techniques for Designing Multi-Asynchronous Clock Designs,” SNUG
2001 (Synopsys Users Group Conference, San Jose, CA, 2001) User Papers, March 2001, Section MC1, 3rd paper. Also
available at www.sunburst-design.com/papers
[3] Clifford E. Cummings and Don Mills, “Synchronous Resets? Asynchronous Resets? I am So Confused! How Will I Ever
Know Which to Use?” SNUG 2002 (Synopsys Users Group Conference, San Jose, CA, 2002) User Papers, March 2002,
Section TB2, 1st paper. Also available at www.sunburst-design.com/papers
[4] Frank Gray, "Pulse Code Communication." United States Patent Number 2,632,058. March 17, 1953.
[5] John O’Malley, Introduction to the Digital Computer, Holt, Rinehart and Winston, Inc., 1972, pg. 190.
[6] Peter Alfke, “Asynchronous FIFO in Virtex-II™ FPGAs,” Xilinx techXclusives, downloaded from
www.xilinx.com/support/techXclusives/fifo-techX18.htm

Peter Alfke, Director, Applications Engineering, Xilinx, Inc, San Jose, CA. Email address: peter.alfke@xilinx.com
Peter Alfke came to the US in 1966, with a German MSEE degree and nine years experience in digital systems and
circuit design at LM Ericsson and Litton Industries in Sweden. He has been manager, later director of applications
engineering for 34 years, at Fairchild, Zilog, AMD, and, since 1988, at Xilinx.
He holds fifteen patents, has written many Application Notes, presented at numerous design conferences, and has
given many applications-oriented seminars in the US and in Europe. He is an active participant in the best
newsgroup for FPGA users, comp.arch.fpga.
(Data accurate as of April 19th, 2002)
Synchronous Resets? Asynchronous Resets?
I am so confused!
How will I ever know which to use?
Clifford E. Cummings Don Mills
Sunburst Design, Inc. LCDM Engineering
ABSTRACT
This paper will investigate the pros and cons of synchronous and asynchronous resets. It will then look at usage of
each type of reset followed by recommendations for proper usage of each type.
This paper will also detail an interesting synchronization technique using digital calibration to synchronize reset
removal on a multi-ASIC design.
1.0 resets, Resets, RESETS, and then there’s RESETS
One cannot begin to consider a discussion of reset usage and styles without first saluting the most common reset
usage of all. This undesired reset occurs almost daily in systems that have been tested, verified, manufactured, and
integrated into the consumer, education, government, and military environments. This reset follows what is often
called “The Blue Screen of Death” resulting from software incompatibilities between the OS from a certain software
company, the software programs the OS is servicing, and the hardware on which the OS software is executing.
Why be concerned with these annoying little resets anyway? Why devote a whole paper to such a trivial subject?
Anyone who has used a PC with a certain OS loaded knows that the hardware reset comes in quite handy. It will put
the computer back to a known working state (at least temporarily) by applying a system reset to each of the chips in
the system that have or require a reset.
For individual ASICs, the primary purpose of a reset is to force the ASIC design (either behavioral, RTL, or
structural) into a known state for simulation. Once the ASIC is built, the need for the ASIC to have reset applied is
determined by the system, the application of the ASIC, and the design of the ASIC. For instance, many data path
communication ASICs are designed to synchronize to an input data stream, process the data, and then output it. If
sync is ever lost, the ASIC goes through a routine to re-acquire sync. If this type of ASIC is designed correctly, such
that all unused states point to the “start acquiring sync” state, it can function properly in a system without ever being
reset. A system reset would be required on power up for such an ASIC if the state machines in the ASIC took
advantage of “don’t care” logic reduction during the synthesis phase.
It is the opinion of the authors that in general, every flip-flop in an ASIC should be resetable whether or not it is
required by the system. Further more, the authors prefer to use asynchronous resets following the guidelines detailed
in this paper. There are exceptions to these guidelines. In some cases, when follower flip-flops (shift register flip-
flops) are used in high speed applications, reset might be eliminated from some flip-flops to achieve higher
performance designs. This type of environment requires a number of clocks during the reset active period to put the
ASIC into a known state.
Many design issues must be considered before choosing a reset strategy for an ASIC design, such as whether to use
synchronous or asynchronous resets, will every flip-flop receive a reset, how will the reset tree be laid out and
buffered, how to verify timing of the reset tree, how to functionally test the reset with test scan vectors, and how to
apply the reset among multiple clock zones.
In addition, when applying resets between multiple ASICs that require a specific reset release sequence, special
techniques must be employed to adjust to variances of chip and board manufacturing. The final sections of this
paper will address this latter issue.
2.0 General flip-flop coding style notes

2.1 Synchronous reset flip-flops with non reset follower flip-flops
Each Verilog procedural block or VHDL process should model only one type of flip-flop. In other words, a designer
should not mix resetable flip-flops with follower flip-flops (flops with no resets)[12]. Follower flip-flops are flip-
flops that are simple data shift registers.
In the Verilog code of Example 1a and the VHDL code of Example 1b, a flip-flop is used to capture data and then its
output is passed through a follower flip-flop. The first stage of this design is reset with a synchronous reset. The
second stage is a follower flip-flop and is not reset, but because the two flip-flops were inferred in the same
procedural block/process, the reset signal rst_n will be used as a data enable for the second flop. This coding style
will generate extraneous logic as shown in Figure 1.
SNUG San Jose 2002 2 Synchronous Resets? Asynchronous Resets?

Rev 1.1 I am so confused! How will I ever know which to use?
module badFFstyle (q2, d, clk, rst_n);
output q2;
input d, clk, rst_n;
reg q2, q1;
always @(posedge clk)

if (!rst_n) q1 <= 1'b0;
else begin
q1 <= d;
q2 <= q1;
end
endmodule
Example 1a - Bad Verilog coding style to model dissimilar flip-flops
library ieee;
use ieee.std_logic_1164.all;
entity badFFstyle is
port (
clk : in std_logic;
rst_n : in std_logic;
d : in std_logic;
q2 : out std_logic);
end badFFstyle;
architecture rtl of badFFstyle is

signal q1 : std_logic;
begin
process (clk)
begin
if (clk'event and clk = '1') then
if (rst_n = '0') then
q1 <= '0';
else
q1 <= d;
q2 <= q1;
end if;
end if;
end process;
end rtl;
Example 1b - Bad VHDL coding style to model dissimilar flip-flops
Figure 1 - Bad coding style yields a design with an unnecessary loadable flip-flop

The correct way to model a follower flip-flop is with two Verilog procedural blocks as shown in Example 2a or two
VHDL processes as shown in Example 2b. These coding styles will generate the logic shown in Figure 2.
module goodFFstyle (q2, d, clk, rst_n);

output q2;
reg q2, q1;

if (!rst_n) q1 <= 1'b0;
else q1 <= d;

q2 <= q1;
endmodule
Example 2a - Good Verilog coding style to model dissimilar flip-flops
library ieee;
entity goodFFstyle is
port (
clk : in std_logic;
d : in std_logic;
q2 : out std_logic);
end goodFFstyle;
architecture rtl of goodFFstyle is

signal q1 : std_logic;
begin
process (clk)
begin
q1 <= '0';
else
q1 <= d;
end if;
end if;
end process;
process (clk)
begin
q2 <= q1;
end if;
end process;
end rtl;
Example 2b - Good VHDL coding style to model dissimilar flip-flops

Figure 2 - Two different types of flip-flops, one with synchronous reset and one without
It should be noted that the extraneous logic generated by the code in Example 1a and Example 1b is only a result of
using a synchronous reset. If an asynchronous reset approach had be used, then both coding styles would synthesize
to the same design without any extra combinational logic. The generation of different flip-flop styles is largely a
function of the sensitivity lists and if-else statements that are used in the HDL code. More details about the
sensitivity list and if-else coding styles are detailed in section 3.1.
2.2 Flip-flop inference style

Each inferred flip-flop should not be independently modeled in its own procedural block/process. As a matter of
style, all inferred flip-flops of a given function or even groups of functions should be described using a single
procedural block/process. Multiple procedural blocks/processes should be used to model macro level functional
divisions within a given module/architecture. The exception to this guideline is that of follower flip-flops as
discussed in the previous section (section 2.1) where multiple procedural blocks/processes are required to efficiently
model the function itself.
2.3 Assignment operator guideline
In Verilog, all assignments made inside the always block modeling an inferred flip-flop (sequential logic) should be
made with nonblocking assignment operators[3]. Likewise, for VHDL, inferred flip-flops should be made using
signal assignments.
3.0 Synchronous resets

As research was conducted for this paper, a collection of ESNUG and SOLV-IT articles was gathered and reviewed.
Around 80+% of the gathered articles focused on synchronous reset issues. Many SNUG papers have been
presented in which the presenter would claim something like, “we all know that the best way to do resets in an ASIC
is to strictly use synchronous resets”, or maybe, “asynchronous resets are bad and should be avoided.” Yet, little
evidence was offered to justify these statements. There are some advantages to using synchronous resets, but there
are also disadvantages. The same is true for asynchronous resets. The designer must use the approach that is
appropriate for the design.
Synchronous resets are based on the premise that the reset signal will only affect or reset the state of the flip-flop on
the active edge of a clock. The reset can be applied to the flip-flop as part of the combinational logic generating the
d-input to the flip-flop. If this is the case, the coding style to model the reset should be an if/else priority style
with the reset in the if condition and all other combinational logic in the else section. If this style is not strictly
observed, two possible problems can occur. First, in some simulators, based on the logic equations, the logic can
block the reset from reaching the flip-flop. This is only a simulation issue, not a hardware issue, but remember, one
of the prime objectives of a reset is to put the ASIC into a known state for simulation. Second, the reset could be a
“late arriving signal” relative to the clock period, due to the high fanout of the reset tree. Even though the reset will
be buffered from a reset buffer tree, it is wise to limit the amount of logic the reset must traverse once it reaches the
local logic. This style of synchronous reset can be used with any logic or library. Example 3 shows an
implementation of this style of synchronous reset as part of a loadable counter with carry out.

module ctr8sr ( q, co, d, ld, rst_n, clk);
output [7:0] q;
output co;
input [7:0] d;
input ld, rst_n, clk;
reg [7:0] q;
reg co;

if (!rst_n) {co,q} <= 9'b0; // sync reset
else if (ld) {co,q} <= d; // sync load
else {co,q} <= q + 1'b1; // sync increment
endmodule
Example 3a - Verilog code for a loadable counter with synchronous reset
library ieee;
use ieee.std_logic_unsigned.all;
entity ctr8sr is
port (
clk : in std_logic;
d : in std_logic;
ld : in std_logic;
q : out std_logic_vector(7 downto 0);
co : out std_logic);
end ctr8sr;
architecture rtl of ctr8sr is

signal count : std_logic_vector(8 downto 0);
begin
co <= count(8);
q <= count(7 downto 0);
process (clk)
begin
count <= (others => '0'); -- sync reset
elsif (ld = '1') then
count <= '0' & d; -- sync load
else
count <= count + 1; -- sync increment
end if;
end if;
end process;
end rtl;
Example 3b - VHDL code for a loadable counter with synchronous reset

Figure 3 - Loadable counter with synchronous reset
A second style of synchronous resets is based on the availability of flip-flops with synchronous reset pins and the
ability of the designer and synthesis tool to make use of those pins. This is sometimes the case, but more often the
first style discussed above is the implementation used[22][26].
3.1 Coding style and example circuit
The Verilog code of Example 4a and the VHDL code of 4b show the correct way to model synchronous reset flip-
flops. Note that the reset is not part of the sensitivity list. For Verilog omitting the reset from the sensitivity list is
what makes the reset synchronous. For VHDL omitting the reset from the sensitivity list and checking for the reset
after the “if clk’event and clk = 1” statement makes the reset synchronous. Also note that the reset is
given priority over any other assignment by using the if-else coding style.
module sync_resetFFstyle (q, d, clk, rst_n);

output q;
reg q;

if (!rst_n) q <= 1'b0;
else q <= d;
endmodule
Example 4a - Correct way to model a flip-flop with synchronous reset using Verilog
library ieee;
entity syncresetFFstyle is
port (
clk : in std_logic;
d : in std_logic;
q : out std_logic);
end syncresetFFstyle;
architecture rtl of syncresetFFstyle is

begin
process (clk)
begin

q <= '0';
else
q <= d;
end if;
end if;
end process;
end rtl;
Example 4b - Correct way to model a flip-flop with synchronous reset using VHDL
For flip-flops designed with synchronous reset style #1 (reset is gated with data to the d-input), Synopsys has a
switch that the designer can use to help infer flip-flops with synchronous resets.
Compiler directive: sync_set_reset
In general, the authors recommend only using Synopsys switches when they are required and make a difference;
however, our colleague Steve Golson pointed out that the sync_set_reset directive does not affect the
functionality of a design, so its omission would not be recognized until gate-level simulation, when discovery of a
failure would require re-synthesizing the design late in the project schedule. Since this directive is only required once
per module, adding it to each module with synchronous resets is recommended[19].
A few years back, another ESNUG contributor recommended adding the compile_preserve_sync_resets
= "true" compiler directive[13]. Although this directive might have been useful a few years ago, it was
discontinued starting with Synopsys version 3.4b[22].
3.2 Advantages of synchronous resets

Synchronous reset will synthesize to smaller flip-flops, particularly if the reset is gated with the logic generating the
d-input. But in such a case, the combinational logic gate count grows, so the overall gate count savings may not be
that significant. If a design is tight, the area savings of one or two gates per flip-flop may ensure the ASIC fits into
the die. However, in today’s technology of huge die sizes, the savings of a gate or two per flip-flop is generally
irrelevant and will not be a significant factor of whether a design fits into a die.
Synchronous reset can be much easier to work with when using cycle based simulators. For this very reason,
synchronous resets are recommend in section 3.2.4(2nd edition, section 3.2.3 in the 1st edition) of the Reuse
Methodology Manual (RMM)[18].
Synchronous resets generally insure that the circuit is 100% synchronous.
Synchronous resets insure that reset can only occur at an active clock edge. The clock works as a filter for small
reset glitches; however, if these glitches occur near the active clock edge, the flip-flop could go metastable.
In some designs, the reset must be generated by a set of internal conditions. A synchronous reset is recommended
for these types of designs because it will filter the logic equation glitches between clocks.
By using synchronous resets and a number of clocks as part of the reset process, flip-flops can be used within the
reset buffer tree to help the timing of the buffer tree keep within a clock period.
3.3 Disadvantages of synchronous resets
Synchronous resets may need a pulse stretcher to guarantee a reset pulse width wide enough to ensure reset is present
during an active edge of the clock[14].
A designer must work with pessimistic vs. optimistic simulators. This can be an issue if the reset is generated by
combinational logic in the ASIC or if the reset must traverse many levels of local combinational logic. During
simulation, based on how the reset is generated or how the reset is applied to a functional block, the reset can be
masked by X’s. A large number of the ESNUG articles addressed this issue. Most simulators will not resolve some
X-logic conditions and therefore block out the synchronous reset[5][6][7][8][9][10][11][12][13][20].

By it’s very nature, a synchronous reset will require a clock in order to reset the circuit. This may not be a
disadvantage to some design styles but to others, it may be an annoyance. The requirement of a clock to cause the
reset condition is significant if the ASIC/FPGA has an internal tristate bus. In order to prevent bus contention on an
internal tristate a tristate bus when a chip is powered up, the chip must have a power on asynchronous reset[17].
4.0 Asynchronous resets

Asynchronous resets are the authors preferred reset approach. However, asynchronous resets alone can be very
dangerous. Many engineers like the idea of being able to apply the reset to their circuit and have the logic go to a
known state. The biggest problem with asynchronous resets is the reset release, also called reset removal. The
subject will be elaborated in detail in section 5.0.
Asynchronous reset flip-flops incorporate a reset pin into the flip-flop design. The reset pin is typically active low
(the flip-flop goes into the reset state when the signal attached to the flip-flop reset pin goes to a logic low level.)
4.1 Coding style and example circuit
The Verilog code of Example 5a and the VHDL code of Example 5b show the correct way to model asynchronous
reset flip-flops. Note that the reset is part of the sensitivity list. For Verilog, adding the reset to the sensitivity list is
what makes the reset asynchronous. In order for the Verilog simulation model of an asynchronous flip-flop to
simulate correctly, the sensitivity list should only be active on the leading edge of the asynchronous reset signal.
Hence, in Example 5a, the always procedure block will be entered on the leading edge of the reset, then the if
condition will check for the correct reset level.
Synopsys requires that if any signal in the sensitivity list is edge-sensitive, then all signals in the sensitivity list must
be edge-sensitive. In other words, Synopsys forces the correct coding style. Verilog simulation does not have this
requirement, but if the sensitivity list were sensitive to more than just the active clock edge and the reset leading
edge, the simulation model would be incorrect[4]. Additionally, only the clock and reset signals can be in the
sensitivity list. If other signals are included (legal Verilog, illegal Verilog RTL synthesis coding style) the
simulation model would not be correct for a flip-flop and Synopsys would report an error while reading the model
for synthesis.
For VHDL, including the reset in the sensitivity list and checking for the reset before the “if clk’event and
clk = 1” statement makes the reset asynchronous. Also note that the reset is given priority over any other
assignment (including the clock) by using the if/else coding style. Because of the nature of a VHDL sensitivity
list and flip-flop coding style, additional signals can be included in the sensitivity list with no ill effects directly for
simulation and synthesis. However, good coding style recommends that only the signals that can directly change the
output of the flip-flop should be in the sensitivity list. These signals are the clock and the asynchronous reset. All
other signals will slow down simulation and be ignored by synthesis.
module async_resetFFstyle (q, d, clk, rst_n);
output q;
reg q;
// Verilog-2001: permits comma-separation

// @(posedge clk, negedge rst_n)
if (!rst_n) q <= 1'b0;
else q <= d;
endmodule
Example 5a - Correct way to model a flip-flop with asynchronous reset using Verilog
library ieee;

entity asyncresetFFstyle is
port (
clk : in std_logic;
d : in std_logic;
q : out std_logic);
end asyncresetFFstyle;
architecture rtl of asyncresetFFstyle is

begin
process (clk, rst_n)
begin
q <= '0';
elsif (clk'event and clk = '1') then
q <= d;
end if;
end process;
end rtl;
Example 5b - Correct way to model a flip-flop with asynchronous reset using VHDL
The approach to synthesizing asynchronous resets will depend on the designers approach to the reset buffer tree. If
the reset is driven directly from an external pin, then usually doing a set_drive 0 on the reset pin and doing a
set_dont_touch_network on the reset net will protect the net from being modified by synthesis. However,
there is at least one ESNUG article that indicates this is not always the case[16].
One ESNUG contributor[15] indicates that sometimes set_resistance 0 on the reset net might also be needed.
And our colleague, Steve Golson, has pointed out that you can set_resistance 0 on the net, or create a custom
wireload model with resistance=0 and apply it to the reset input port with the command:
set_wire_load -port_list reset
A recently updated SolvNet article also notes that starting with Synopsys release 2001.08 the definition of ideal nets
has slightly changed[24] and that a set_ideal_net command can be used to create ideal nets and “get no timing
updates, get no delay optimization, and get no DRC fixing.”
Another colleague, Chris Kiegle, reported that doing a set_disable_timing on a net for pre-v2001.08 designs helped
to clean up timing reports[2], which seems to be supported by two other SolvNet articles, one related to synthesis
and another related to Physical Synthesis, that recommend usage of both a set_false_path and a
set_disable_timing command[21][25].
4.2 Modeling Verilog flip-flops with asynchronous reset and asynchronous set
One additional note should be made here with regards to modeling asynchronous resets in Verilog. The simulation
model of a flip-flop that includes both an asynchronous set and an asynchronous reset in Verilog might not simulate
correctly without a little help from the designer. In general, most synchronous designs do not have flop-flops that
contain both an asynchronous set and asynchronous reset, but on the occasion such a flip-flop is required. The
coding style of Example 6 can be used to correct the Verilog RTL simulations where both reset and set are asserted
simultaneously and reset is removed first.
First note that the problem is only a simulation problem and not a synthesis problem (synthesis infers the correct flip-
flop with asynchronous set/reset). The simulation problem is due to the always block that is only entered on the
active edge of the set, reset or clock signals. If the reset becomes active, followed then by the set going active, then
if the reset goes inactive, the flip-flop should first go to a reset state, followed by going to a set state. With both

these inputs being asynchronous, the set should be active as soon as the reset is removed, but that will not be the case
in Verilog since there is no way to trigger the always block until the next rising clock edge.
For those rare designs where reset and set are both permitted to be asserted simultaneously and then reset is removed
first, the fix to this simulation problem is to model the flip-flop using self-correcting code enclosed within the
translate_off/translate_on directives and force the output to the correct value for this one condition. The best
recommendation here is to avoid, as much as possible, the condition that requires a flip-flop that uses both
asynchronous set and asynchronous reset. The code in Example 6 shows the fix that will simulate correctly and
guarantee a match between pre- and post-synthesis simulations. This code uses the translate_off/translate_on
directives to force the correct output for the exception condition[4].
// Good DFF with asynchronous set and reset and self-

// correcting set-reset assignment
module dff3_aras (q, d, clk, rst_n, set_n);
output q;
input d, clk, rst_n, set_n;
reg q;
always @(posedge clk or negedge rst_n or negedge set_n)

if (!rst_n) q <= 0; // asynchronous reset
else if (!set_n) q <= 1; // asynchronous set
else q <= d;
// synopsys translate_off
always @(rst_n or set_n)
if (rst_n && !set_n) force q = 1;
else release q;
// synopsys translate_on
endmodule
Example 6 – Verilog Asynchronous SET/RESET simulation and synthesis model
4.3 Advantages of asynchronous resets

The biggest advantage to using asynchronous resets is that, as long as the vendor library has asynchronously reset-
able flip-flops, the data path is guaranteed to be clean. Designs that are pushing the limit for data path timing, can
not afford to have added gates and additional net delays in the data path due to logic inserted to handle synchronous
resets. Of course this argument does not hold if the vendor library has flip-flops with synchronous reset inputs and
the designer can get Synopsys to actually use those pins. Using an asynchronous reset, the designer is guaranteed not
to have the reset added to the data path. The code in Example 7 infers asynchronous resets that will not be added to
the data path.
module ctr8ar ( q, co, d, ld, rst_n, clk);

output [7:0] q;
output co;
input [7:0] d;
input ld, rst_n, clk;
reg [7:0] q;
reg co;

if (!rst_n) {co,q} <= 9'b0; // async reset
else if (ld) {co,q} <= d; // sync load
else {co,q} <= q + 1'b1; // sync increment
endmodule
Example 7a- Verilog code for a loadable counter with asynchronous reset
library ieee;
use ieee.std_logic_unsigned.all;
entity ctr8ar is
port (
clk : in std_logic;
d : in std_logic;
ld : in std_logic;
q : out std_logic_vector(7 downto 0);
co : out std_logic);
end ctr8ar;
architecture rtl of ctr8ar is

signal count : std_logic_vector(8 downto 0);
begin
co <= count(8);
q <= count(7 downto 0);
process (clk)
begin
count <= (others => '0'); -- sync reset
if (ld = '1') then
count <= '0' & d; -- sync load
else
count <= count + 1; -- sync increment
end if;
end if;
end process;
end rtl;
Example 7b- VHDL code for a loadable counter with asynchronous reset

Figure 4 - Loadable counter with asynchronous reset
Another advantage favoring asynchronous resets is that the circuit can be reset with or without a clock present.
The experience of the authors is that by using the coding style for asynchronous resets described in this section, the
synthesis interface tends to be automatic. That is, there is generally no need to add any synthesis attributes to get the
synthesis tool to map to a flip-flop with an asynchronous reset pin.
4.4 Disadvantages of asynchronous resets
There are many reasons given by engineers as to why asynchronous resets are evil.
The Reuse Methodology Manual (RMM) suggests that asynchronous resets are not to be used because they cannot
be used with cycle based simulators. This is simply not true. The basis of a cycle based simulator is that all inputs
change on a clock edge. Since timing is not part of cycle based simulation, the asynchronous reset can simply be
applied on the inactive clock edge.
For DFT, if the asynchronous reset is not directly driven from an I/O pin, then the reset net from the reset driver must
be disabled for DFT scanning and testing. This is required for the synchronizer circuit shown in section 6.
Some designers claim that static timing analysis is very difficult to do with designs using asynchronous resets. The
reset tree must be timed for both synchronous and asynchronous resets to ensure that the release of the reset can
occur within one clock period. The timing analysis for a reset tree must be performed after layout to ensure this
timing requirement is met.
The biggest problem with asynchronous resets is that they are asynchronous, both at the assertion and at the de-
assertion of the reset. The assertion is a non issue, the de-assertion is the issue. If the asynchronous reset is released
at or near the active clock edge of a flip-flop, the output of the flip-flop could go metastable and thus the reset state
of the ASIC could be lost.
Another problem that an asynchronous reset can have, depending on its source, is spurious resets due to noise or
glitches on the board or system reset. See section 8.0 for a possible solution to reset glitches. If this is a real
problem in a system, then one might think that using synchronous resets is the solution. A different but similar
problem exists for synchronous resets if these spurious reset pulses occur near a clock edge, the flip-flops can still go
metastable.

5.0 Asynchronous reset problem
In discussing this paper topic with a colleague, the engineer stated first that since all he was working on was FPGAs,
they do not have the same reset problems that ASICs have (a misconception). He went on to say that he always had
an asynchronous system reset that could override everything, to put the chip into a known state. The engineer was
then asked what would happen to the FPGA or ASIC if the release of the reset occurred on or near a clock edge such
that the flip-flops went metastable.
Too many engineers just apply an asynchronous reset thinking that there are no problems. They test the reset in the
controlled simulation environment and everything works fine, but then in the system, the design fails intermittently.
The designers do not consider the idea that the release of the reset in the system (non-controlled environment) could
cause the chip to go into a metastable unknown state, thus voiding the reset all together. Attention must be paid to
the release of the reset so as to prevent the chip from going into a metastable unknown state when reset is released.
When a synchronous reset is being used, then both the leading and trailing edges of the reset must be away from the
active edge of the clock
As shown in Figure 5, an asynchronous reset signal will be de-asserted asynchronous to the clock signal. There are
two potential problems with this scenario: (1) violation of reset recovery time and, (2) reset removal happening in
different clock cycles for different sequential elements.
Figure 5 - Asynchronous reset removal recovery time problem

5.1 Reset recovery time
Reset recovery time refers to the time between when reset is de-asserted and the time that the clock signal goes high
again. The Verilog-2001 Standard[17] has three built-in commands to model and test recovery time and signal
removal timing checks: $recovery, $removal and $recrem (the latter is a combination of recovery and removal timing
checks).
Recovery time is also referred to as a tsu setup time of the form, “PRE or CLR inactive setup time before CLK↑”[1].
Missing a recovery time can cause signal integrity or metastability problems with the registered data outputs.
5.2 Reset removal traversing different clock cycles
When reset removal is asynchronous to the rising clock edge, slight differences in propagation delays in either or
both the reset signal and the clock signal can cause some registers or flip-flops to exit the reset state before others.

6.0 Reset synchronizer
Guideline: EVERY ASIC USING AN ASYNCHRONOUS RESET SHOULD INCLUDE A RESET
SYNCHRONIZER CIRCUIT!!
Without a reset synchronizer, the usefulness of the asynchronous reset in the final system is void even if the reset
works during simulation.
The reset synchronizer logic of Figure 6 is designed to take advantage of the best of both asynchronous and
synchronous reset styles.
Figure 6 - Reset Synchronizer block diagram

An external reset signal asynchronously resets a pair of master reset flip-flops, which in turn drive the master reset
signal asynchronously through the reset buffer tree to the rest of the flip-flops in the design. The entire design will
be asynchronously reset.
Reset removal is accomplished by de-asserting the reset signal, which then permits the d-input of the first master
reset flip-flop (which is tied high) to be clocked through a reset synchronizer. It typically takes two rising clock
edges after reset removal to synchronize removal of the master reset.
Two flip-flops are required to synchronize the reset signal to the clock pulse where the second flip-flop is used to
remove any metastability that might be caused by the reset signal being removed asynchronously and too close to the
rising clock edge. As discussed in section 4.4, these synchronization flip-flops must be kept off of the scan chain.

Figure 7 - Predictable reset removal to satisfy reset recovery time
A closer examination of the timing now shows that reset distribution timing is the sum of the a clk-to-q propagation
delay, total delay through the reset distribution tree and meeting the reset recovery time of the destination registers
and flip-flops, as shown in Figure 7.
The code for the reset synchronizer circuit is shown in Example 8.
module async_resetFFstyle2 (rst_n, clk, asyncrst_n);

output rst_n;
input clk, asyncrst_n;
reg rst_n, rff1;
always @(posedge clk or negedge asyncrst_n)

if (!asyncrst_n) {rst_n,rff1} <= 2'b0;
else {rst_n,rff1} <= {rff1,1'b1};
endmodule
Example 8a - Properly coded reset synchronizer using Verilog
library ieee;
entity asyncresetFFstyle is
port (
clk : in std_logic;
asyncrst_n : in std_logic;
rst_n : out std_logic);
end asyncresetFFstyle;
architecture rtl of asyncresetFFstyle is

signal rff1 : std_logic;
begin
process (clk, asyncrst_n)
begin
if (asyncrst_n = '0') then

rff1 <= '0';
rst_n <= '0';
rff1 <= '1';
rst_n <= rff1;
end if;
end process;
end rtl;
Example 8b - Properly coded reset synchronizer using VHDL
7.0 Reset distribution tree

The reset distribution tree requires almost as much attention as a clock distribution tree, because there are generally
as many reset-input loads as there are clock-input loads in a typical digital design, as shown in Figure 8. The timing
requirements for reset tree are common for both synchronous and asynchronous reset styles.
Figure 8 - Reset distribution tree

One important difference between a clock distribution tree and a reset distribution tree is the requirement to closely
balance the skew between the distributed resets. Unlike clock signals, skew between reset signals is not critical as
long as the delay associated with any reset signal is short enough to allow propagation to all reset loads within a
clock period and still meet recovery time of all destination registers and flip-flops.
Care must be taken to analyze the clock tree timing against the clk-q-reset tree timing. The safest way to clock a
reset tree (synchronous or asynchronous reset) is to clock the internal-master-reset flip-flop from a leaf-clock of the
clock tree as shown in Figure 9. If this approach will meet timing, life is good. In most cases, there is not enough
time to have a clock pulse traverse the clock tree, clock the reset-driving flip-flop and then have the reset traverse the
reset tree, all within one clock period.

Figure 9 - Reset tree driven from a delayed, buffered clock
In order to help speed the reset arrival to all the system flip-flops, the reset-driver flip-flop is clocked with an early
clock as shown in Figure 10. Post layout timing analysis must be made to ensure that the reset release for
asynchronous resets and both the assertion and release for synchronous reset do not beat the clock to the flip-flops;
meaning the reset must not violate setup and hold on the flops. Often detailed timing adjustments like this can not be
made until the layout is done and real timing is available for the two trees.

Figure 10 - Reset synchronizer driven in parallel to the clock distribution tree
Ignoring this problem will not make it go away. Gee, and we all thought resets were such a basic topic.
8.0 Reset-glitch filtering

As stated earlier in this paper, one of the biggest issues with asynchronous resets is that they are asynchronous and
therefore carry with them some characteristics that must be dealt with depending on the source of the reset. With
asynchronous resets, any input wide enough to meet the minimum reset pulse width for a flip-flop will cause the flip-
flop to reset. If the reset line is subject to glitching, this can be a real problem. Presented here is one approach that
will work to filter out the glitches, but it is ugly! This solution requires that a digital delay (meaning the delay will
vary with temperature, voltage and process) to filter out small glitches. The reset input pad should also be a Schmidt
triggered pad to help with glitch filtering. Figure 11 shows the implementation of this approach.

Figure 11 - Reset glitch filtering
In order to add the delay, some vendors provide a delay hard macro that can be hand instantiated. If such a delay
macro is not available, the designer could manually instantiate the delay into the synthesized design after
optimization – remember not to optimize this block after the delay has been inserted or it will be removed. Of
course the elements could have don’t touch attributes applied to prevent them from being removed. A second
approach is to instantiated a slow buffer in a module and then instantiated that module multiple times to get the
desired delay. Many variations could expand on this concept.
This glitch filter is not needed in all systems. The designer must research the system requirements to determine
whether or not a delay is needed.
9.0 DFT for asynchronous resets

Applying Design for Test (DFT) functionality to a design is a two step process. First, the flips-flops in the design are
stitched together into a scan chain accessible from external I/O pins, this is called scan insertion. The scan chain is
typically not part of the functional design. Second, a software program is run to generate a set of scan vectors that,
when applied to the scan chain, will test and verify the design. This software program is called Automatic Test
Program Generation or ATPG. The primary objective of the scan vectors is to provide foundry vectors for
manufacture tests of the wafers and die as well as tests for the final packaged part.
The process of applying the ATPG vectors to create a test is based on:
1. scanning a known state into all the flip-flops in the chip,
2. switching the flip-flops from scan shift mode, to functional data input mode,
3. applying one functional clock,
4. switching the flip-flops back to scan shift mode to scan out the result of the one functional clock while
scanning in the next test vector.
The DFT process usually requires two control pins. One that puts the design into “test mode.” This pin is used to
mask off non-testable logic such as internally generated asynchronous resets, asynchronous combinational feedback
loops, and many other logic conditions that require special attention. This pin is usually held constant during the
entire test. The second control pin is the shift enable pin.
In order for the ATPG vectors to work, the test program must be able to control all the inputs to the flip-flops on the
scan chain in the chip. This includes not only the clock and data, but also the reset pin (synchronous or

asynchronous). If the reset is driven directly from an I/O pin, then the reset is held in a non-reset state. If the reset is
internally generated, then the master internal reset is held in a non-reset state by the test mode signal. If the
internally generated reset were not masked off during ATPG, then the reset condition might occur during scan
causing the flip-flops in the chip to be reset, and thus lose the vector data being scanned in.
Even though the asynchronous reset is held to the non-reset state for ATPG, this does not mean that the reset/set
cannot be tested as part of the DFT process. Before locking out the reset with test mode and generating the ATPG
vectors, a few vectors can be manually generated to create reset/set test vectors. The process required to test
asynchronous resets for DFT is very straight forward and may be automatic with some DFT tools. If the scan tool
does not automatic test the asynchronous resets/sets, then they must be setup manually. The basic steps to manually
test the asynchronous resets/sets are as follows:
1. scan in all ones into the scan chain
2. issue and release the asynchronous reset
3. scan out the result and scan in all zeros
4. issue and release the reset
5. scan out the result
6. set the reset input to the non reset state and then apply the ATPG generated vectors.
This test approach will scan test for both asynchronous resets and sets. These manually generated vectors will be
added to the ATPG vectors to provide a higher fault coverage for the manufacture test. If the design uses flip-flops
with synchronous reset inputs, then modifying the above manual asynchronous reset test slightly will give a similar
test for the synchronous reset environment. Add to the steps above a functional clock while the reset is applied. All
other steps would remain the same.
For the reset synchronizer circuit discussed in this paper, the two synchronizer flips-flops should not be included in
the scan chain, but should be tested using the manual process discussed above.
10.0 Multi-clock reset issues

For a multi-clock design, a separate asynchronous reset synchronizer circuit and reset distribution tree should be
used for each clock domain. This is done to insure that reset signals can indeed be guaranteed to meet the reset
recovery time for each register in each clock domain.
As discussed earlier, asynchronous reset assertion is not a problem. The problem is graceful removal of reset and
synchronized startup of all logic after reset is removed.
Depending on the constraints of the design, there are two techniques that could be employed: (1) non-coordinated
reset removal, and (2) sequenced coordination of reset removal.

Figure 12 - Multi-clock reset removal
10.1 Non-coordinated reset removal

For many multi-clock designs, exactly when reset is removed within one clock domain compared to when it is
removed in another clock domain is not important. Typically in these designs, any control signals crossing clock
boundaries are passed through some type of request-acknowledge handshaking sequence and the delayed
acknowledge from one clock domain to another is not going to cause invalid execution of the hardware. For this type
of design, creating separate asynchronous reset synchronizers as shown in Figure 12 is sufficient, and the fact that
arst_n, brst_n and crst_n could be removed in any sequence is not important to the design.
10.2 Sequenced coordination of reset removal

For some multi-clock designs, reset removal must be ordered and proper sequence. For this type of design, creating
prioritized asynchronous reset synchronizers as shown in Figure 13 might be required to insure that all aclk domain
logic is activated after reset is removed before the bclk logic, which must also be activated before the cclk logic
becomes active.

Figure 13 - Multi-clock ordered reset removal
For this type of design, only the highest priority asynchronous reset synchronizer input is tied high. The other
asynchronous reset synchronizer inputs are tied to the master resets from higher priority clock domains.
11.0 Multi-ASIC reset synchronization

There are designs with multiple ASICs that require precise synchronization of reset removal across all of the multiple
ASICs. One approach to satisfy this type of design, described in this section, is to use a different asynchronous reset
synchronization scheme, one that only requires one reset removal flip-flop instead of the two flip-flops described in
section 6.0, plus a digitally calibrated synchronization delay to properly sequence reset removal from the multiple
ASICs.
Consider the actual design of a data acquisition board on a Digital Storage Oscilloscope (DSO). In rudimentary
terms, a DSO is a test instrument that probes an analog signal, continuously does sampling and Analog-to-Digital
(A2D) conversion of the signal, and continuously stores the sampled digital data into memory as fast as it can. After
the requested trigger condition occurs, the rest of the data associated with the trigger condition is stored to memory
and then DSO control logic (typically a commercial microprocessor) accesses the data and draws a waveform of the
data values onto a screen for visual inspection.
For an actual design of this type, the data acquisition board contained four digital demultiplexer (demux) ASICs,
each of which captured one-fourth of the datain samples to send to memory, as shown in Figure 14.

Figure 14 - Multi-ASIC design with synchronized reset removal problem
For this digital acquisition system, as soon as reset is removed, the ASICs must start capturing data and generating
memory addresses to write the data to memory. Both data acquisition and address generation are continuously
running, capturing data samples and overwriting previous written memory locations until a trigger circuit causes the
address counters to stop and hold the data that has been most recently captured. Frequently, the trigger is set to hold
and show 90% of the waveform as pre-trigger data and 10% of the waveform as post-trigger data. Since it is
generally impossible to predict when the trigger will occur, it is necessary to continuously acquire data after reset
removal until a trigger signal stops the data acquisition.
The approach that was used in this design to do high-speed data acquisition was to use four demux ASICs that
capture every fourth point of the digitized waveform. Since the demux ASICs typically ran at very fast clock rates,
and since each demux ASIC also had to generate accompanying address count values to store the data samples to
memory, it was important that all four demux ASICs start their respective address counters in the correct sequence to
insure that the data samples stored in memory could be easily read-back to draw waveforms on the DSO display.
The problem with this type of design was to accurately remove the reset signal from the four ASIC devices at the
same time (in the same relative clock period) so that the four ASICs captured the correctly sequenced data samples
that corresponded to address-#0 on all four ASICs, followed by address-#1 on all four ASICs, etc., so that the data
stored to memory could be read back from memory (after triggering the DSO) in the correct sequence to display an
accurate waveform on the DSO screen.
For this type of design, there are a number of factors that work against correct reset-removal and hence correct
sequencing of the data values being written to memory.
First, for very high-speed designs (DSOs are typically very high-speed designs in order to capture an adequate
number of data samples while probing other high-speed circuits), the relative board trace length of reset signals to
the four ASICs would have to be held to a very tight tolerance; hence, board layout is an issue.
Second, process variations within or between batches of manufactured ASICs can create delays that exceed the ultra-
short ASIC clock periods. Choosing four ASICs to insert during manufacture can result in selection of four devices

with different relative delays being placed on the same data acquisition board. The relative process speeds of the four
ASICs placed on a board cannot be guaranteed (which of the four ASICs will always be the fastest? Who knows!)
Third, temperature swings in different test environments can also add to differences in delays. Relative positioning of
the ASICs inside of a DSO enclosure might account for significant differences in temperature for this high-speed
system.
Fourth, removing the covers of the DSO to troubleshoot prototypes could introduce different temperature variations
across the four ASICs than when the covers are closed.
For the actual design, a common reset signal (reset_n) was routed to all four demux ASICs to assert reset, but the
reset signal did not de-assert reset from the demux ASICs. A separate sync signal was used to flag reset removal
permission on each demux ASIC.
Figure 15 - Reset removal synchronization logic block diagram
The multi-ASIC reset removal synchronization logic is handled using the logic shown in Figure 15. This logic is
common to both master and slave ASICs.
Asserting reset (rst_n going low in Figure 15) asynchronously resets the master reset signal, mstrrst_n, which
is driven through a reset-tree to the rest of the resetable logic on all ASICs (both master and slave ASICs); therefore,
reset is asynchronous and immediate.
Each ASIC has three pins dedicated to reset-removal synchronization.
The first pin on each ASIC is a dedicated master/slave selection pin. When this pin is tied high, the ASIC is placed
into master mode. When the pin is tied low, the ASIC is placed into slave mode.
The second pin on each ASIC is the sync_out pin. On the slave ASICs, the sync_out pin is unused and left
dangling. The master ASIC generates the sync_out pulse when reset is removed (when reset_n goes high). The
sync_out signal is driven out of the master ASIC and is tied to the sync_in input on both master and slave
ASICs through board-trace connections. The sync_out pin is the pin that controls reset removal on both the
master ASIC and the slave ASICs.

The third pin on each ASIC is the sync_in pin. The sync_in pin is the input pin that is used to control reset
removal on both master and slave ASICs. The sync_in signal is connected to a programmable delay block and is
then enabled by a high-assertion on the reset input, that is then passed to a synchronous reset removal flip-flop. The
next rising clock edge on the ASIC will cause the reset to be synchronously removed, permitting the address counters
on each ASIC to start counting in a synchronized and orderly manner.
The problem, as explained earlier, is to insure that the sync_in signal removes the reset on the four ASICs in the
correct order.
Figure 16 - Programmable digital delay block diagram
The programmable digital delay block, shown in Figure 16, is a set of delay stages connected in series with each
delay-stage output driving both the next delay-stage input and an input on a multiplexer. The delay stages could be
simple buffers or they could be pairs of inverters. The number of delay stages selected was equal to almost three
ASIC clock cycles.
A processor interface is used to program the delay select register, which enables the multiplexer select lines to
choose which delayed sync_in signal (sdly0 to sdly15) would be driven to the mux output and used to remove
the reset on the ASIC.
In order to determine the correct delay settings for each ASIC, a software digital calibration technique was
employed.
To help calibrate the demux ASICs, as well as other analog devices on the data acquisition board, the board was
designed to capture a selectable on-board ramp signal through the data acquisition path. The ramp signal was used to
calibrate the delays on the four demux ASICs.
In Figure 17-Figure 19, the software programmable, digital calibration procedure is shown with just two of the four
demux ASICs.

Figure 17 - Two-ASIC reset-removal calibration - early data sampling on ASIC #2
ASIC #1 is given the initial delay setting of 11 (to drive the sdly11 signal to the mux output). ASIC #2 is given
another delay setting and a ramp signal is captured by the data acquisition board. If the delay setting on ASIC #2 is
too small, such as a delay value of 0-4 as shown in Figure 17, the ramp values captured by ASIC #2 will be sampled
early compared to the data points sampled by ASIC #1. This is manifest by the fact that each ramp data point
captured by ASIC #2 is larger than the next data point captured by ASIC #1.
If the delay setting on ASIC #2 is in the correct range, such as a delay value of 5-11 as shown in Figure 18, the ramp
values captured by ASIC #2 will be sampled in the correct order compared to the data points sampled by ASIC #1.
This is manifest by the fact that each ramp data point captured by ASIC #2 is larger than the previous data point
captured by ASIC #1 and smaller than the next data point captured by ASIC #1.
If the delay setting on ASIC #2 is too large, such as a delay value of 12-15 as shown in Figure 19, the ramp values
captured by ASIC #2 will be sampled late compared to the data points sampled by ASIC #1. This is manifest by the
fact that each ramp data point captured by ASIC #2 is smaller than the next data point captured by ASIC #1.
Once the correct range is determined for ASIC #2, the center point in the range is chosen to be the ASIC #2
sync_in delay setting. The center point is the safest setting in the range since this setting is approximately a half-
cycle between the previous and next rising clock edges for the reset-removal synchronization flip-flop.
After determining the correct ASIC #2 setting, the correct ASIC #1 range surrounding the initial setting (the setting
of 11 is used in Figure 17) must be determined to find the correct ASIC #1 mid-point setting. After determining the
correct ASIC #1 setting, a similar process is used to find the correct delay setting for ASIC #3, followed by finding
the correct setting for ASIC #4.

Figure 18 - Two-ASIC reset-removal calibration - correctly timed data sampling on ASIC #2
Figure 19 - Two-ASIC reset-removal calibration - late data sampling on ASIC #2
After digital calibration, there was no need to use a second reset-removal synchronization flip-flop because a mid-
clock setting was used to insure that the flip-flop recovery time was met and to insure that no metastability problems
would arise.

The full block diagram of the four-demux ASIC design with master/slave pin and sync_in/sync_out pins on
each ASIC and how they were connected is shown in Figure 20.
Figure 20 - Multi-ASIC synchronized reset removal solution
In the actual design, after determining a valid set of mid-point delay settings for the four ASICs on one of the data
acquisition prototype boards, these values were programmed into a ROM and used as initial settings for all
manufactured boards and variations from the initial settings were tracked. What was interesting was that the
calibrated delay values for each board rarely strayed more than one or two delay stages up or down from the original
settings of the initial data acquisition prototype board.
12.0 Conclusions
Using asynchronous resets is the surest way to guarantee reliable reset assertion. Although an asynchronous reset is a
safe way to reliably reset circuitry, removal of an asynchronous reset can cause significant problems if not done
properly.
The proper way to design with asynchronous resets is to add the reset synchronizer logic to allow asynchronous reset
of the design and to insure synchronous reset removal to permit safe restoration of normal design functionality.
Using DFT with asynchronous resets is still achievable as long as the asynchronous reset can be controlled during
test.

References
[1] ALS/AS Logic Data Book, Texas Instruments, 1986, pg. 2-78.
[2] Chris Kiegle, personal communication
[3] Clifford E. Cummings, “Nonblocking Assignments in Verilog Synthesis, Coding Styles That Kill!,” SNUG (Synopsys
st
Users Group) 2000 User Papers, section-MC1 (1 paper), March 2000.
Also available at www.sunburst-design.com/papers
[4] Don Mills and Clifford E. Cummings, “RTL Coding Styles That Yield Simulation and Synthesis Mismatches,” SNUG
nd
(Synopsys Users Group) 1999 Proceedings, section-TA2 (2 paper), March 1999.
Also available at www.lcdm-eng.com/papers.htm and www.sunburst-design.com/papers
[5] ESNUG #60, Item 1- http://www.deepchip.com/posts/0060.html
[6] ESNUG #240, Item 7- http://www.deepchip.com/posts/0240.html
[7] ESNUG #242, Item 6 - http://www.deepchip.com/posts/0242.html
[17] IEEE Standard Verilog Hardware Description Language, IEEE Computer Society, IEEE, New York, NY, IEEE Std 1364-
2001.
[18] Michael Keating, and Pierre Bricaud, Reuse Methodology Manual, Second Edition, Kluwer Academic Publishers, 1999,
pg. 35.
[20] Synopsys SolvNet, Doc Name: METH-933.html, “Methodology and limitations of synthesis for synchronous set and
reset,” Updated 09/07/2001.
[21] Synopsys SolvNet, Doc Name: Physical_Synthesis-231.html, “Handling High Fanout Nets in 2001.08” Updated:
11/01/2001.
[22] Synopsys SolvNet, Doc Name: Star-15.html, “Is the compile_preserve_sync_reset Switch Still Valid?,” Updated:
09/07/2001.
[23] Synopsys SolvNet, Doc Name: Synthesis-452.html, “Why can't I synthesize synchronous reset flip-flops?,” Updated:
08/16/1999.
[24] Synopsys SolvNet, Doc Name: Synthesis-780.html, “How can I use the high_fanout_net_threshold commands to simplify
the net delay calculation?” Updated: 01/25/2002.
[25] Synopsys SolvNet, Doc Name: Synthesis-482109.html, “How to Eliminate Transition Time Calculation Side Effects From
Arcs That Are Fal” Updated: 08/11/1997
[26] Synopsys SolvNet, Doc Name: Synthesis-799.html, “Data and Synchronous Reset Swapped,” Updated: 05/01/2001.

ASIC, FPGA and system design experience and nine years of Verilog, synthesis and methodology training
experience.
Mr. Cummings, a member of the IEEE 1364 Verilog Standards Group (VSG) since 1994, chaired the VSG
Behavioral Task Force, which was charged with proposing enhancements to the Verilog language. Mr. Cummings is
also a member of the IEEE Verilog Synthesis Interoperability Working Group.
E-mail Address: cliffc@sunburst-design.com
Don Mills is an independent EDA consultant, ASIC designer, and Verilog/VHDL trainer with 16 years of
experience.
Don has inflicted pain on Aart De Geuss for too many years as SNUG Technical Chair. Aart was more than happy to
see him leave! Not really, Don chaired three San Jose SNUG conferences: 1998-2000, the first Boston SNUG 1999,
and is currently chair of the Europe SNUG 2001- present.
Don holds a BSEE from Brigham Young University.
E-mail Address: mills@lcdm-eng.com
An updated version of this paper can be downloaded from the web site: www.sunburst-design.com/papers or from
www.lcdm-eng.com
(Data accurate as of April 19th, 2002)

Clock Domain Crossing Demystified:
The Second Generation Solution
for CDC Verification
Abstract: With the increasing trend towards SOCs, designs with multiple
asynchronous clock domains are getting commonplace today. The design of
asynchronous clock domain crossing (CDC) interfaces is required to follow strict
design principles to ensure reliable operation in the presence of metastability.
With the emergence of CDC verification tools, users have a way to verify their
designs. However, first generation CDC tools provide inadequate support for a
top-down, bottom-up CDC verification methodology. Also, extensive manual
setup and signoff requirements create serious deployment limitations. This
causes inconsistent usage of the tools and wasted engineering resources without
covering the CDC failure risk.
This paper describes an efficient, reliable and practical CDC verification

methodology for effective timing closure verification (TCV).
1) Introduction: The increase in SOC designs is leading to extensive usage of
asynchronous clock domains. The clock domain crossing (CDC) interfaces are
required to follow strict design principles for reliable operation. Also, verification
of proper CDC design is not possible using standard simulation and static timing
analysis techniques. As a result, CDC verification tools have found increasing
usage in the design flows.
The first generation CDC tools allowed improved verification of CDC interfaces.
However, these tools suffered from extensive manual and signoff requirements.
As a result, the deployment of these tools posed challenges as the design size
and complexity continued to increase. Hence, we are seeing ineffective usage of
these tools, wasting engineering resources without covering the CDC failure risk.
This paper makes a fundamental observation that the inability to accomplish

timing closure across the CDC interface is the root cause of the CDC problem.
This implies that effective data transfer across CDC interface requires design of a
multi-cycle timing path. Also, the observation is made that traditional simulation
and formal analysis techniques are incapable of analyzing transient design
behavior, which is at the root of CDC failures. Hence, special formal analysis
techniques are required for the CDC problem. Finally, the paper identifies
practical considerations for effective CDC verification. It recommends a
hierarchical top-down, bottom-up methodology, with result inheritance and
effective use of formal analysis, to minimize the manual engineering effort in
CDC verification.
The paper is organized into eight sections, including this introduction. The
second section introduces metastability and the CDC reliability problem. The
third section introduces metastability fundamentals and principles for CDC
interface design. The fourth section identifies the verification principles for CDC
verification. The fifth section discusses practical usage of CDC tools and the
engineering resource requirements. The sixth section outlines an efficient CDC
verification methodology. Finally, the seventh section summarizes the metrics to
evaluate the quality of CDC solutions.
2) Metastability and CDC fundamentals: A good understanding of the CDC

problem requires an understanding of metastability and the associated design
challenge. In this section we describe metastability and the CDC problem.
2.1) Understanding Metastability: When the latch input transitions within the
setup and hold window around the latching clock transition, the latch output can
become metastable at an intermediate voltage between logical zero and one.
Figure 1 shows a simplified latch implementation. Metastable state is a very
high-energy state as shown in Figure 2. Because of noise in the chip
environment, this metastable voltage gets disturbed and eventually resolves to a
logical value. The resolution time is dependent upon the load on the latch output

Real Intent Inc. & Sunburst Design, Inc. Page 2

and the gain through the feedback loop. However, it is impossible to predict this
logical value. Also, there is an inherent delay in the resolution of the metastable
output as shown in the timing diagram of Figure 3. This logical and timing
uncertainty introduces unreliable behavior in the design and, without proper
protection, can cause it to fail in unpredictable manners.
Figure 1: A Simplified Latch
Figure 2: The Metastability Energy Curve
Figure 3: Metastability Timing Diagram


For synchronous clock design, timing closure with static timing analysis ensures
that all paths meet timing. Metastability is avoided and the design operates
reliably.
2.2) Limitations of Functional Verification: Static Timing Analysis and

Timing Closure Verification is needed: The prevalent functional verification
methodology is based upon functional simulation. A simplified view of the
simulation model is that the design behavior is evaluated using zero delay
evaluation for logic, unit delay for flops and ideal clock behavior. Also, Formal
Analysis makes use of the same evaluation assumptions. Hence, both of these
techniques have an inherent limitation that they only analyze the steady state
behavior of the design. Functional verification makes a fundamental assumption
that Static Timing Analysis will account for clock behavior uncertainty caused by
jitter and skews and ensure that all hazards in the design subside before the
clock event (Timing Closure). This is the default timing rule. Functional
verification will be invalidated if this assumption is violated. Static Timing
Analysis allows users to specify exceptions to the default timing rules. These
exceptions invalidate the functional verification and default timing assumptions.
It is imperative that these exceptions be properly verified using Timing Closure
Verification (TCV) for a robust design methodology. Since static timing of CDC
interfaces is not possible and requires timing exceptions, CDC verification is a
unique and essential component of TCV.
2.3) CDC terminology: A clock domain is defined as the set of all flops that are
clocked by the associated clock. A clock domain crossing (CDC) is defined as a
flop-to-flop path where the transmitting flop is triggered by a clock that is
asynchronous to the receiving flop clock. These two clock domains are
considered to be relatively asynchronous. Figure 4 describes the CDC
terminology used in this paper. The receiving flops are referred to as CDC flops.
The signals feeding the CDC flops are referred to as CDC signals.
Figure 4: Defining CDC Terminology


2.4) Unavoidable Metastability and the CDC problem: Asynchronous clocks
operate without any mutual frequency and phase relationships. As a result, it is
impossible to guarantee timing on CDC paths because the launch and capture
clock edges can be arbitrarily close. Hence, metastability is unavoidable for CDC
designs. This invalidates both functional simulation and formal verification
assumptions and robust design behavior can’t be assured using simulation and
Static Timing Analysis. Without proper design, CDC errors can cause random
and unpredictable failures in the chip that are impossible to debug.
2.5) CDC Errors: Metastability introduces the following failure modes in the
design:
E1) Loss of correlation. This happens when two or more correlated CDC flops
become metastable as shown in Figure 5. Figure 6 shows the timing diagram
where these flops resolve to arbitrary logical values and hence, lose correlation
leading to a bad design state.
E2) Hazard capture. A hazard on a CDC path can get captured in the CDC flop
leading to bad design state as shown in Figure 7.
E3) Loss of signal. CDC signals that are stable for less than one clock cycle of
the receiving clock may not get captured in the receiving domain because of
clock network uncertainties, clock alignment and metastability. Figure 8 shows
the situation where functional verification view concludes signal transmission.
However, the signal transmission can actually fail leading to a bad state in the
design.
E4) Metastability propagation. Metastability may propagate to the next level of

flops in the design if it is not resolved in a timely manner. The resolution time is
dependent upon the load on the flop. Propagation of metastability may lead to
cascading of errors E1-E3.
Figure 5: Loss of Correlation


Figure 6: Loss of Correlation Timing Diagram
Figure 7: Glitch Capture


Figure 8: Loss of Signal
3) CDC Design Principles: Metastability is unavoidable in CDC designs. As a

result, robust design of CDC interfaces is required to follow some strict design
principles as covered in this section.
3.1) Containing Metastability with Synchronizers: A synchronizer is a device

to contain the metastability effects from propagating into the design. Figure 9
shows a double flop synchronizer. This configuration minimizes the load on the
metastable flop. The single fanout protects against loss of correlation since the
metastable signal does not fan out to multiple flops. The probability that
metastability will last longer than time t is governed by the following equation:
where τ is the resolution time constant dependent upon the latch characteristics
and ambient noise. This configuration resolves metastability with a very high
probability leading to a very large mean time between failures as governed by the
equation:


where P is the probability that metastability is not resolved within one clock cycle.
Triple or higher flop configurations may be used for very fast designs.
Figure 9: Double Flop Synchronizer Contains Metastability
3.2) Designing CDC Interfaces: A CDC interface is designed for reliable

transfer of correlated data across the data bus. Reliable design of a CDC
interface must follow a simple set of rules:
1) The CDC data bus must be designed for 2-cycle multi-cycle-path
operation (MCP). This means that data is captured in the CDC flops on
the second clock edge or later, following the launch of data. This also
gives one clock cycle of the receiving clock as the timing constraint on the
path. Static timing analysis should ensure that the timing constraints are
met on these paths. This rule eliminates metastability for these paths. As
data bus signals are correlated, their CDC flops can not be allowed to
become metastable.
2) The control signals implementing the MCP protocol can become
metastable and hence must follow the following rules:
a. The controls must be properly synchronized to prevent propagation

of metastability in the design.
b. The MCP is enabled by one and only one control signal transition to
eliminate loss of correlation errors (Gray Coding).
c. The control signals should be free of hazards/glitches.
d. The control signals must be stable for more than one clock cycle of
the receiving clock.
These principles can be implemented using handshake protocols or FIFO based

protocols. Figure 10 shows a simple handshake CDC protocol. This interface is
transmitting data from CLK1 domain to CLK2 domain. While Data Ready is
asserted, the data on the bus Data In is transmitted across the clock domain.
The data availability is signaled by a transition on Control Signal. Transmit Data
is launched on the same clock edge. Control Signal is synchronized in the CLK2
domain and the transition is detected to signal Load Data. Since,
synchronization requires at-least one cycle of CLK2, Transmit Data is received at
the second edge of CLK2 or later. This creates a multi-cycle path for Transmit


Data across the interface. Feedback Signal completes the handshake.
Transition on Feedback Signal is detected to drive Next Data to the interface.
Figure 11 shows the timing diagram for the protocol. It should be noted that this
is a simplified concept of the interface. We have not incorporated the logic
initializing the interface, detecting transition in Data Ready and dealing with
stalling conditions. All these considerations, combined with latency minimization
add complexity to the design of the interface.
A good treatise on FIFO based protocol can be found in (Cummings, 2001)
Figure 10: Simple Handshake CDC Protocol


Figure 11: CDC Protocol Timing Diagram
4) Verifying CDC Interfaces: A typical SOC is made up of a large number of

CDC interfaces. From the discussion above, CDC verification can be
accomplished by executing the following steps in order:
1) Identification of CDC signals
2) Classification of CDC signals as control and data
3) Hazard/Glitch robustness of control signals
4) Verification of single signal transition (Gray Coding) of control signals
5) Verification of control stability (Pulse Width requirement)
6) Verification of MCP operation (stability) of data signals
All verification processes are iterative and achieves design quality by iteratively
identifying design errors, debugging and fixing errors and re-running verification
until no more errors are detected
5) Practical considerations for CDC Verification: Effective deployment of

CDC tools in the design flow requires due consideration of multiple factors. We
have discovered that the first generation CDC tools were not being used
effectively in design flows. Based upon feedback from users, we have identified
the following factors as most important considerations for CDC deployment:
1) Coverage of error sources
2) Design setup cost
3) Debugging and sign-off cost


4) Verification run-time cost
5) Template recognition vs. Report quality tradeoff
6) Top level vs. Block Level verification tradeoff
7) RTL vs. Netlist verification tradeoff
There is consistent feedback from the users that the minimization of engineering
cost for high quality verification is critical for effective deployment of the CDC
tools.
5.1) Coverage of error sources: CDC errors can creep into a design from
multiple sources. The first is inadvertent clock domain crossing where there is an
assumption mismatch or oversight at block interfaces. The second is faulty block
level design. The designers, because of oversight or because of the pressure to
design correct and high performing interface, can make a design error. As an
example, consider the protocol in Figure 12. Here, tapping Feedback Signal
from an earlier flop stage can reduce the latency across the interface. However,
correct operation of this interface requires that the transmitting clock frequency
be slower than the receiving clock frequency. Otherwise, it is possible to signal
New Data before Load Data is completed.
Figure 12: Reduced Latency Protocol
These two error sources are properly covered by RTL analysis. They can also be
covered by Netlist analysis. However, not all CDC error sources are covered by
RTL analysis. That is because CDC errors are dependent upon glitches and
hazards. It is a well-known phenomenon that synthesis transformations can
introduce hazards in the design. Hazards in CDC logic lead to CDC failures.
Figure 13 shows an example of a design failure caused by synthesis. Here the
multiplexor implementation created a logic hazard that violated the multi-cycle
path requirement on the data bus. We are aware of multiple design failures
because of this phenomenon.


Figure 13: Logic Hazard Caused CDC Failure
With the increasing complexity of SOCs and increasing number of CDC

interfaces on the chip, the contribution of this risk factor is increasing. As a
result, CDC verification must be run on both RTL and Netlist views of the design.
5.2) Design setup cost: Design setup starts with importing the design. With the
increasing complexity of SOCs, designs include RTL and netlist blocks in a
Verilog and VHDL mixed language environment. In addition functional setup is
required for good quality of verification. A typical SOC has multiple modes of
operation characterized by clocking schemes, reset sequences and mode
controls. Functional setup requires the design to be set up in functionally valid
modes for verification, by proper identification of clocks, resets and mode select
pins. Bad setup can lead to poor quality of verification results.
Given the design management complexity for the multitude of design tasks, it is
highly desirable that there be a large overlap between setup requirements for
different flows. For example, design compilation can be accomplished by
processing the existing simulation scripts. Also, there is a large overlap between
the functional setup requirements for CDC and that for static timing analysis.
Hence, STA setup, based upon SDC, can be leveraged for cost effective
functional setup.
Correct functional setup of large designs may require setup of a very large
number of signals. This cumbersome and time-consuming drudgery can be
avoided with automatic setup generation. Also, setup has the first order effect on


the quality of verification. Hence, early feedback on setup quality can lead to
easy and effective setup refinement for high quality of verification.
Figure 14: Design Setup Flow
5.3) Debugging and sign-off cost: The debugging cost is dependent upon the
number of errors flagged by the CDC tool. Assuming good setup, this in turn
depends upon the size and CDC complexity of the design and the maturity of the
design. Typically, debugging cost for top-level runs on immature designs will be
high. This is because the design may contain a large number of immature CDC
interfaces. This can generate a large number of failures requiring significant
debugging effort. Also, the ownership of these CDC interfaces may be
distributed between multiple designers.
Debugging cost is heavily dependent upon the reporting style of the tools. The
source code oriented reporting relates the errors to the real source: HDL


functionality. Also, it produces much more compact reports. CDC verification
employs multiple technologies of increasing sophistication, like structural analysis
and formal analysis. Hence, a composite report is essential to determine the
overall quality of CDC verification.
Good clock domain, functional, structural and VCD visualization is essential for
effective debugging. Automated and advanced preprocessing of these views, to
isolate the error context, further reduces the debugging cost. Finally, debugging
support requires advanced sign-off capabilities so that the same issues are not
analyzed multiple times in the iterative verification flow.
5.4) Verification run-time cost: CDC checking is based upon multiple

technologies with varying degree of precision. In the first step structural
techniques are used to identify clock domain crossings and to identify possible
error sources in the design. Structural analysis tends to be relatively fast and is
very useful at detecting gross errors in the design. However, in order to
guarantee design correctness, structural analysis identifies all potential errors in
the design. This set can be very large.
As an example, consider the design in Figure 12. This reduced latency design
can operate correctly or can be erroneous depending upon the relative frequency
of the clock domains. Also, this structure can be included in a more complex
interface that handles stall and other issues making precise structural
identification difficult. If a structural technique does not compromise the quality of
checking, it has to flag this interface for manual review and signoff.
Formal analysis is an excellent technology to filter out false failures from

structural analysis and to precisely identify failures in the design. As mentioned
before, traditional formal analysis is built to analyze steady state design behavior.
Hence, these formal techniques are incapable of formally analyzing uncertain
behavior because of metastability and glitches. As a result, special formal
analysis techniques, that are capable of handling behavioral uncertainty, are
needed for CDC applications. For example, consider the failure shown in Figure
13. Here the MCP on data path is violated because of a hazard. Vanilla formal
analysis will pass the data stability check (MCP) for this structure. Data stability
for CDC interfaces can only be proven with glitch sensitive formal analysis
techniques.
Formal application needs to be seamlessly integrated into the application all the
way from invocation to reporting and debugging. This eliminates a huge
overhead of integrating external formal analysis tools into the flow and to
correlate the results from these different tools to arrive at an integrated view of
verification status.


As the computational complexity of formal analysis is very high, this can require a
large amount of computation time. However, this cost is well worth as it provides
significant savings in debugging and sign-off cost.
Figure 15: Verification and Debug Flow
5.5) Template recognition vs. Report quality tradeoff: The first generation
CDC tools employed structural analysis as the primary verification technology.
Given the lack of precision of this technology, users are often required to specify
structural templates for verification. Given the size and complexity to today’s
SOCs, this template specification becomes a cumbersome process where
debugging cost is traded off for setup cost. Also, the checking limitations
imposed by templates may reduce the report volume but they also increase the
risk of missing errors. In general, template based checking requires large
manual effort for effective utilization.
5.6) Top-level vs. Block Level verification tradeoff: The top-level verification
reduces the setup requirements for CDC verification but can result in higher


debugging cost as the design maturity improves iteratively. On the other hand,
block-level verification identifies errors earlier and at smaller complexity levels
there by creating a cleaner top-level verification. Hence, top-level debugging
cost reduces but the overall setup and run-time cost increases.
5.7) RTL vs. Netlist verification tradeoff: As mentioned before, Netlist analysis
can cover all the CDC error sources. The debugging cost is very high for
application at the Netlist level. Also, the delay in detecting errors until much later
in the design cycle can have a serious schedule impact. However, RTL analysis
does not cover all CDC error sources. This requires that CDC verification must
also be run on Netlists.
6) A practical and efficient CDC verification methodology: After evaluating

the various considerations as mentioned above, we recommend the following
CDC verification methodology to accomplish high quality verification with minimal
engineering cost:
1) Automatically create the functional setup the top-level design leveraging
SDC.
2) Automatically complete the functional setup.
3) Use setup verification techniques to refine top-level functional setup.
4) Identify the sub-blocks for initial CDC verification.
5) Automatically generate block level functional setup from the top-level.
6) Run thorough block level CDC verification.
a. Examine the generated functional setup for correctness.
b. Run structural analysis.
c. Identify and fix gross design errors or refine functional setup.
d. Run formal analysis for precise error identification.
e. Debug and fix design or refine functional setup.
f. Iterate verification steps until clean.
7) Run thorough top-level CDC verification with block level result inheritance.
8) Run thorough Netlist CDC verification.


Figure 16: Top Down-Bottom Up Verification Flow


7) A framework for evaluating CDC solutions: Based upon the above
presentation, we have formulated the metrics for evaluating the quality of CDC
solutions. Our attempt is to summarize the relationship between various
attributes. We use a generic symbol f(attribute) to represent the quantification
associated with the attribute. For example, f(types_of_err) is the factor
contributed the extensiveness of error checking conducted by the tools and
f(templ_req) is the factor measuring the complexity of setting up the checking
templates. These metrics are defined below.
1 _ – _

2 _ _ _ _ – 2 _

3 .

4
_ _ _ _

# _
5
# _

,
6
# _ _

7 _ , _ _ _
The diagram below shows graphically the results of these equations.
Figure 17: Spiderchart for 1st and 2nd Generation CDC Tools


About the Authors:
Prakash Narain, PhD
Real Intent, Inc.

Dr. Prakash Narain is the President and CEO of Real Intent, Inc.
Prior to founding Real Intent, his career spanned IBM, AMD and
Sun where he got hands on experience with all aspects of IC
design, CAD tools design and methodology.
He was the project leader for test and verification for UltraSPARC
IIi at Sun Microsystems. He was an architect of the Mercury
Design System at AMD He has architected and developed CAD
tools for test and verification for IBM EDA.
Dr. Narain has a Ph.D. from the University of Illinois at

Champaign-Urbana where his thesis focus was on algorithms for high level
testing and verification.
Clifford E. Cummings
Sunburst Design, Inc.

Cliff Cummings is President of Sunburst Design, Inc., a company
that specializes in world-class Verilog, SystemVerilog and synthesis
training.
Cliff has presented more than 90 SystemVerilog seminars and
training classes over the past five years and was the featured
speaker at the worldwide SystemVerilog NOW! seminars in 2003.
Cliff has participated on every IEEE & Accellera Verilog, Verilog

Synthesis, and SystemVerilog committee, and has presented some
40 papers on Verilog & SystemVerilog related design, synthesis and
verification techniques.
Cliff holds a BSEE from Brigham Young University and an MSEE from Oregon
State University.


Bibliography
Cummings, C. (2001, 3 31). Synthesis and Scripting Techniques for Designing
Multi-Asynchronous Clock Designs. Retrieved 2 1, 2008, from Sunburst Design,
Inc.: http://www.sunburst-
design.com/papers/CummingsSNUG2001SJ_AsyncClk.pdf

Clock Domain Crossing (CDC) Design & Verification Techniques Using Systemverilog

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clock Domain Crossing (CDC) Design & Verification Techniques Using Systemverilog

Uploaded by

Copyright:

Available Formats

SNUG-2008

World Class Verilog & SystemVerilog Training

Clock Domain Crossing (CDC) Design & Verification

Important design considerations require that multi-clock designs be carefully constructed at

Figure 1 - Asynchronous clocks and synchronization failure ......................................................... 6

Example 1 - Non-working but conceptually correct gray-to-binary SystemVerilog model ......... 30

Figure 1 - Asynchronous clocks and synchronization failure

2.1 Why is metastability a problem?

3.1 Two synchronization scenarios

3.2 Two flip-flop synchronizer

3.3 MTBF - mean time before failure

Figure 4 - Primary contributing factors to short MTBF values

3.4 Three flip-flop synchronizer

Figure 5 - Three flip-flop synchronizer used in higher speed designs

3.5 Synchronizing signals from the sending clock domain

Figure 6 - Unregistered signals sent across a CDC boundary

3.6 Synchronizing signals into the receiving clock domain

4.1 Requirement for reliable signal passing between clock domains

4.1.1 The "three edge" requirement

Figure 8 - Short CDC signal pulse missed during synchronization

Disadvantage: there is potentially considerable delay associated with synchronizing control

Figure 11 - Signal with feedback to acknowledge receipt

5.1 Multi-bit CDC strategies

Each of these strategies is detailed in the remainder of this section.

5.2 Multi-bit signal consolidation

Figure 12 - Problem - Passing multiple control signals between clock domains

Figure 14 - Problem - Passing sequential control signals between clock domains

Figure 16 - Problem - Encoded control signals passed between clock domains

5.5.1 Solutions for passing multiple CDC signals

5.6 Multi-Cycle Path (MCP) formulation

5.6.1 MCP formulation using a synchronized enable pulse

Figure 18 - Synchronized pulse generation logic

Figure 20 shows a typical send-receive toggle-pulse generation design.

Figure 20 - Multi-Cycle Path (MCP ) formulation toggle-pulse generation

Figure 21 - Multi-Cycle Path (MCP ) formulation toggle-pulse generation with acknowledge

Figure 22 - Multi-Cycle Path (MCP ) formulation toggle-pulse generation with ready-ack

5.7.1 Binary counters

Binary Count 07 -> 08 possible binary transitions

5.7.2 Gray codes

5.7.3 Gray-to-binary conversion

// Syntax Error - variable index range

5.7.4 Binary-to-gray conversion

assign gray = (bin>>1) ^ bin;

Example 3 - Parameterized binary-to-gray SystemVerilog model

5.7.5 Gray code counter style #1

logic [SIZE-1:0] gnext, bnext, bin;

always_ff @(posedge clk or negedge rst_n)

5.7.6 Gray code counter style #2

module graycntr #(parameter SIZE = 5)

logic [SIZE-1:0] gnext, bnext, bin;

always_ff @(posedge clk or negedge rst_n)

assign bnext = !full ? bin + inc : bin;

5.8 Additional multi-bit CDC techniques

5.8.1 Multi-bit CDC signal passing using asynchronous FIFOS

Figure 29 - 1-deep / 2-register FIFO synchronizer block diagram

6.1 Clock & signal naming conventions

6.1.1 Multi-clock / multi-source modules with no naming convention

6.2 Timing verification for each clock domain

6.3 Clock oriented design partitioning

Guideline: Only allow one clock per module[9].

Guideline: Partition the design blocks into one-clock modules.

Figure 30 - Design partitioned on clock boundaries

6.3.1 Timing analysis of clock-partitioned modules