Digital ASIC Design

A Tutorial on the Design Flow
Digital ASIC Group
October 20, 2005
ElectroScience — 2 —
Contents
Contents 3
1 Introduction to VHDL 11
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Event-driven simulation . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Design units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.1 Entity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.3 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.4 Component . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.5 Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Data types and modelling . . . . . . . . . . . . . . . . . . . . . . 16
1.4.1 Type declaration . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.2 Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.3 Generics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.4 Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.5 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.6 Common operations . . . . . . . . . . . . . . . . . . . . . 18
1.5 Process statement . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5.1 Sequential vs. combinatorial logic . . . . . . . . . . . . . 19
1.6 Instantiation and port maps . . . . . . . . . . . . . . . . . . . . . 21
1.7 Generate statements . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.8 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.8.1 Testbenches . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.8.2 Modelsim . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.8.3 Non-synthesizable code . . . . . . . . . . . . . . . . . . . 25
1.9 Design example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2 Design Methodology 31
2.1 Structured VHDL programming . . . . . . . . . . . . . . . . . . . 32
2.1.1 Source code disposition . . . . . . . . . . . . . . . . . . . 32
2.1.2 Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1.3 Clock and reset signal . . . . . . . . . . . . . . . . . . . . 33
2.1.4 Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.1.5 Local variables . . . . . . . . . . . . . . . . . . . . . . . . 35
2.1.6 Subprograms . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
— 3 — Lund Institute of Technology
Contents
2.3 Technology independence . . . . . . . . . . . . . . . . . . . . . . 37
2.3.1 ASIC Memories . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.2 ASIC Pads . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.3 FPGA modules . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3.4 FPGA Pads . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3.5 ALU unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3 Arithmetic 43
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 VHDL Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.1 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.3 VHDL examples . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 DesignWare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.1 Arithmetic Operations using DesignWare . . . . . . . . . 48
3.3.2 Manual Selection of the Implementation in dc shell . . . . 51
3.4 Addition and Subtraction . . . . . . . . . . . . . . . . . . . . . . 53
3.4.1 Basic VHDL example . . . . . . . . . . . . . . . . . . . . 53
3.4.2 Increasing wordlength . . . . . . . . . . . . . . . . . . . . 54
3.4.3 Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4.4 Multioperand addition . . . . . . . . . . . . . . . . . . . . 56
3.4.5 Implementations . . . . . . . . . . . . . . . . . . . . . . . 57
3.5 Comparison operation . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5.1 Basic VHDL example . . . . . . . . . . . . . . . . . . . . 59
3.6 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.6.1 Multiplication by constants . . . . . . . . . . . . . . . . . 61
3.6.2 Multiplication of signed numbers . . . . . . . . . . . . . . 62
3.7 Datapath Manipulation . . . . . . . . . . . . . . . . . . . . . . . 63
4 Memories 67
4.1 SR example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Memory split example . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Cache example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5 Synthesis 75
5.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Setting up libraries for synthesis . . . . . . . . . . . . . . . . . . 76
5.3 Baseline synthesis flow . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.1 Sample synthesis script for a simple design . . . . . . . . 78
5.4 Advanced topics . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4.1 Wireload model . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4.2 Constraining the design for better performance . . . . . . 81
5.4.3 Compile mode concepts . . . . . . . . . . . . . . . . . . . 82
5.4.4 Compile strategy . . . . . . . . . . . . . . . . . . . . . . . 83
5.4.5 Design Optimization and constraints . . . . . . . . . . . . 84
5.4.6 Resolving multiple instances . . . . . . . . . . . . . . . . . 85
5.4.7 Optimization flow . . . . . . . . . . . . . . . . . . . . . . 86
5.4.8 Behavioral Optimization of Arithmetic and Behavioral
Re-timing . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5 Coding style for synthesis . . . . . . . . . . . . . . . . . . . . . . 89
ElectroScience — 4 —
Contents
5.5.1 Coding style if statement . . . . . . . . . . . . . . . . . . 89
5.5.2 Coding style for loop statement . . . . . . . . . . . . . . . 90
6 Place & Route 93
6.1 Introduction and Disclaimer . . . . . . . . . . . . . . . . . . . . . 93
6.2 Starting Silicon Ensemble . . . . . . . . . . . . . . . . . . . . . . 93
6.3 Import to SE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.4 Creating a mac file . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.5 Floorplanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.6 Block and Cell Placement . . . . . . . . . . . . . . . . . . . . . . 101
6.7 Power Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.8 Connecting Rings . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.9 Clock Tree Generation . . . . . . . . . . . . . . . . . . . . . . . . 107
6.10 Filler Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.11 Clock and Signal Routing . . . . . . . . . . . . . . . . . . . . . . 110
6.12 Verification and Tapeout . . . . . . . . . . . . . . . . . . . . . . . 111
6.13 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7 Optimization strategies for power 113
7.1 Sources of power consumption . . . . . . . . . . . . . . . . . . . . 113
7.1.1 Dynamic power consumption . . . . . . . . . . . . . . . . 113
7.1.2 Static power consumption . . . . . . . . . . . . . . . . . . 114
7.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.2.1 Architecural issues—Pipelining and parallelization . . . . 115
7.2.2 Clock gating . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.2.3 Operand isolation . . . . . . . . . . . . . . . . . . . . . . 119
7.2.4 Leakage power optimization . . . . . . . . . . . . . . . . . 120
7.2.5 Beyond clock gating . . . . . . . . . . . . . . . . . . . . . 121
7.3 Power estimation with Synopsys Power Compiler . . . . . . . . . 122
7.3.1 Switching Activity Interchange Format (SAIF) . . . . . . 123
7.3.2 Information provided by the cell library . . . . . . . . . . 125
7.3.3 Calculating the power consumption . . . . . . . . . . . . . 125
7.4 Power-aware design flow at the gate level . . . . . . . . . . . . . 126
7.5 Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.5.1 Design flow . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.5.2 Patches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.6 Commands in DC . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.6.1 Some useful DC commands . . . . . . . . . . . . . . . . . 131
7.6.2 nc-verilog . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.7 The ALU example . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8 Formal Verification 135
8.1 Concepts within Formal Verification . . . . . . . . . . . . . . . . 135
8.1.1 Formal Specification . . . . . . . . . . . . . . . . . . . . . 136
8.1.2 The System Model . . . . . . . . . . . . . . . . . . . . . . 136
8.1.3 Formal Verification . . . . . . . . . . . . . . . . . . . . . . 137
8.1.4 Formal System Construction . . . . . . . . . . . . . . . . 137
8.2 Areas of Use for Formal Verification Tools . . . . . . . . . . . . . 137
8.2.1 Model Checking . . . . . . . . . . . . . . . . . . . . . . . 137
8.2.2 Equivalence Checking . . . . . . . . . . . . . . . . . . . . 137
— 5 — Lund Institute of Technology
Contents
8.2.3 Limitations in Formal Verification Tools . . . . . . . . . . 138
8.3 Other Methods to Decrease Simulation Time . . . . . . . . . . . 138
8.3.1 Running the simulation in a cluster of several computers . 138
8.3.2 ASIC-emulators . . . . . . . . . . . . . . . . . . . . . . . . 138
8.3.3 FPGA-Prototypes . . . . . . . . . . . . . . . . . . . . . . 138
8.4 Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
9 Multiple Supply Voltages 141
9.1 Demonstration of Dual Supply Voltage in Silicon Ensemble . . . 143
9.1.1 Editing Power Connections in the Netlist . . . . . . . . . 143
9.1.2 Floorplan and Placement . . . . . . . . . . . . . . . . . . 145
9.1.3 Power and Signal Routing . . . . . . . . . . . . . . . . . . 146
9.1.4 Filler Cells . . . . . . . . . . . . . . . . . . . . . . . . . . 148
ElectroScience — 6 —
Preface
When buying a book on hardware design, the focus is often limited to one
area. It could be on signal processing, system level design, VHDL and other
programming languages or arithmetic. In this manual, we will try to describe
the design flow from developing code to chip layout, see Figure 1. The manual
is divided into the following main sections:
Function
Function
Synthesis
Layout
Tape out
RTL
Function
Timing
Prelayout
Postlayout
Function
Timing
Behavioral
Standard
Library
Constraints
Test Vectors
Timing
Information
Specification
RTL Coding
Synopsys: Design Compiler
C/C++
MATLAB
Cadence: Silicon Ensemble
Floating/Fixed
Point Modeling
VHDL/Verilog
Mentor Graphics: ModelSim
Standard
Library
Timing
Information
Figure 1: Common digital ASIC design flow.
— 7 — Lund Institute of Technology
Contents
VHDL
Chapter 1 presents an introduction to VHDL and the basic concepts of de-
scribing and simulating hardware. This is however not a complete reference to
the language itself, more an overview and a selection of the most important
parts of the language. Chapter 2 presents a strucured way of writing and using
VHDL. The design methodology described here is developed by Jiri Gaisler at
Chalmers. This chapter also describes how to develop platform independent
code, supporting several technologies such as programmable logic and various
cell libraries.
Arithmetic
The chapter on arithmetic describes how to infer a component for an arithmetic
operation in a digital design. The most commonly used arithmetic operations
are described and the connection between the VHDL description and the fi-
nal hardware structure realized during synthesis is illustrated. This chapter
gives the reader an overview on how the determine a highly effective hardware
structure by coding and controlling the synthesis tool.
Memories
One of the most important topics in digital ASIC design today is memories. Not
only the amount of memory but also the memory hierarchy, including caches
and off-chip memories, has to be considered. Each memory hierarchy should
be optimized for high speed, low power, small area or a combination of these,
depending on the application. However, this is a too large topic to cover in this
compendium. Instead three different designs are presented in this Chapter in
order to show possible optimizations to various memory structures.
Synthesis
This chapter gives a overview of the whole synthesis process, and covers the basic
synthesis concepts and guidelines for design optimization. It starts with the
basic concept of what synthesis is, followed by environment setups for synthesis,
and wireload model basics. In order to better understand the effect of setting
constraints, the synthesis process is described. At last, VHDL coding style is
mentioned as a guideline for better synthesis results.
Place & Route
Place and Route is the final step before sending the design for fabrication. The
standard cells determined during synthesis are placed automatically on the cell
core. Location of the pads and placement of the on-chip memory has to be
defined and carried out by the user. The core supply, power and ground rings,
are added. A design-rule check (DRC) verifies that the fabrication constraints
are met and finalizing your design for fabrication.
ElectroScience — 8 —
Contents
Power Estimation
This chapter gives an introduction to power estimation and optimization tech-
niques in an ASIC design flow with Synopsys Power Compiler. After a short
review of the sources of power consumption in a digital circuit, tool-independent
optimization techniques are presented for different abstraction levels. It is also
shown how the design tool interacts with information from the cell library and
external simulation tools to estimate and optimize power at gate level. For those
who like to probe further, additional information is found in the reference list
at the end of the chapter.
— 9 — Lund Institute of Technology
ElectroScience — 10 —
Chapter 1
Introduction to VHDL
This chapter is a brief introduction to hardware design using a Hardware De-
scription Language (HDL). A language describing hardware is quite different
from C, Pascal, or other software languages. A computer program is dynamic,
i.e., sharing the same resources, allocating resources when needed and not al-
ways optimized for maximum speed, optimal memory management, or lowest
resource requirements. The main focus is functionality, but it is still not un-
common that software programs can behave quite unexpected. When problems
arise, new versions of the programs are distributed by the vendor, usually with
a new version number and a higher price tag.
The demands on hardware design are high compared to software. Often it is not
possible, or at least very tricky, to patch hardware after fabrication. Clearly, the
functionality must be correct and in addition how the code is written will affect
the size and speed of the resulting hardware. Each mm
2
of a chip costs money,
lots of money. The amount of logic cells, memory blocks and input/output con-
nections will affect the size of the design and therefore also the manufacturing
cost. A software designer using a HDL has to be careful. The degrees of freedom
compared with software design have dramatically increased and must be taken
into account.
The focus of this chapter is not to present VHDL in detail. A programming
language often specifies more functionality than actually needed. This is espe-
cially true for VHDL, since it is only a small part of the program language that
can be synthesized to hardware. In this chapter, the basics parts of VHDL pro-
gramming that can be used in hardware design will be presented. In the next
chapter, a design methodology is applied and a structured way of writing VHDL
will be presented. There are numerous available books that addresses VHDL
coding and to probe further, the book written by J. Bhasker [1] is recommended
as first choice. Another book worth mentioning is written by P. J. Ashenden [2]
and is used at many Universities.
1.1 Background
VHDL is a acronym which stands for Very High Speed Integrated Circuit Hard-
ware Description Language. VHDL has been standardized by IEEE and the
most widely used version is called VHDL’87 (1079-1987). In 1994, an updated
— 11 — Lund Institute of Technology
Chapter 1. Introduction to VHDL
version called VHDL’93 introduced new functionality and a more symmetric
syntax. However, using the VHDL’87 syntax guarantees compatibility with
both old and new synthesis and simulation tools. A recommendation is to
use VHDL’87 syntax for synthesizable code, while testbenches and other non-
synthesizable parts can be written using VHDL’87/93. Although being a hard-
ware description language, VHDL still supports file access, floating point, and
complex numbers, and other features not directly associated with hardware de-
sign. This comes in handy when writing reference designs or testbenches, but
is not even close to being synthesizable, i.e., transferred into physical gates and
how they are connected which is equivalent to a gate-level netlist or simply
called a netlist.
1.2 Event-driven simulation
A computer program running on a microprocessor executes one instruction at
a time. Therefore, it is easy to debug and understand the behaviour of the
program. A compiler translates the source code into machine language and
executes the instructions sequentially. Hardware is by nature concurrent, i.e.,
statements are executed in parallel depending on the chosen architecture. In-
stead of compiling the VHDL code into a sequential program, it is executed using
an event-driven simulator. This simulator understands the concept of time and
delay. The smallest time quantum is called delta (∆), which is the time it will
take to assign a value to a signal. When an external event triggers the system,
the timescale is incremented with ∆ until all signals are stable, i.e. until there
are no more events in the system to process. The external event could be, e.g.,
the master clock, updating the system registers on a rising or falling edge. Fig-
ure 1.1 shows an example of event driven simulation. It is shown that it takes
two (∆) to propagate signals from input to output in this design.
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1 1
Stable t = T − 1 New inputs t = T t = T + ∆ t = T + 2∆
Figure 1.1: Event-driven simulation of a small design.
1.3 Design units
This chapter will present the five design units in VHDL. The entity, describing
the interface between building blocks. The architecture, describing the struc-
ture or behaviour of a block. The configuration associating an architecture
with an entity. The component that is used to include a building block as
part of a new building block. Finally, packages to store common or global
information.
ElectroScience — 12 —
1.3. Design units
1.3.1 Entity
The concept of VHDL coding can be compared to building a design using dis-
crete logical circuits and wires. The once so popular 74xx circuits are series of
TTL logic blocks with different logical functions, Boolean logic, inverters, bi-
nary counters and flip-flops. They come in small packages with a number of pins
connected to the outside world. This can be compared to an entity in VHDL.
The entity describes the interface, the input signals, and the output signals,
without any information about the functionality. In Figure 1.2 the interface of
a TTL circuit is described in VHDL. All signals except for power and ground
are declared in the entity together with signal direction, i.e., input or output.
Power and ground signals will not be part of the design flow until the design is
placed and routed, see Chapter 6 for more details.
entity TTL is
port( A1,B1 : in std_logic;
...
A4,B4 : in std_logic;
Y1 : out std_logic;
...
Y4 : out std_logic
);
end;
Figure 1.2: The entity describes the interface of the block.
1.3.2 Architecture
Every entity has one or several architectures describing the behaviour of the
block. For example, the circuit in Figure 1.2 contains 4 NAND gates. However,
using the same entity but replacing the architecture with a description of 4 NOR
gates changes the behaviour but not the interface to the block.
architecture OR of TTL is architecture NAND of TTL is
begin begin
Y1 <= A1 or B1; Y1 <= not (A1 and B1);
Y2 <= A2 or B2; Y2 <= not (A2 and B2);
Y3 <= A3 or B3; Y3 <= not (A3 and B3);
Y4 <= A4 or B4; Y4 <= not (A4 and B4);
end; end;
1.3.3 Configuration
A configuration specifies which architecture to use for a certain instantiation
of the entity. Normally, there is only one architecture for each entity, and in
this case no configuration is needed. However, it is useful to write a config-
uration for post-synthesis simulations, selecting and simulating the gate-level
netlist generated by the synthesis tool. This way, it is possible to create sev-
eral configurations to simulate before synthesis, after synthesis, and finally after
place and route. To use two architecture to describe different behaviourals of
the same entity, as shown in Section 1.3.2, is not commonly used, since it is not
always supported in the synthesis tool. However, to use it in the testbench is
okay, since it is not synthesized.
— 13 — Lund Institute of Technology
Chapter 1. Introduction to VHDL
configuration cfg_pre_syn_adder of tb_adder is -- configuration before synthesis
for behavioural -- testbench architecture name
for iadd: adder -- DUT instantiation name
use entity WORK.ADDER(STRUCTURAL); -- point to behavioural adder
end for;
end for;
end;
configuration cfg_post_syn_adder of tb_adder is -- configuration after synthesis
for behavioural -- testbench architecture name
for iadd: adder -- DUT instantiation name
use entity WORK.ADDER(SYN_STRUCTURAL); -- point to synthesized adder
end for;
end for;
end;
If no configuration exists, a default architecture will be selected. The default
architecture is selected by the simulation or synthesis tool, taking the most
recent compiled architecture with the exact same name and port description as
the entity. If no such architecture exists, the entity will be linked to an empty
black box. This could be used for inserting components in the empty space after
synthesis.
1.3.4 Component
Component declarations is used to include a pre-made design as part of a new
design. To be able to use the pre-made design in the new design it has to
be declared in the code as a component. The component declaration is either
included in a package or placed in the architecture part of the code before
begin. For example, to include the previously mention TTL design in a new
larger design, the code could look like this:
entity larger_design is
port( A : in std_logic;
Y : out std_logic);
end;
architecture my_first_arch of larger_design is
component TTL
port( A1,B1 : in std_logic;
...
A4,B4 : in std_logic;
Y1 : out std_logic;
...
Y4 : out std_logic
);
end comonent;
begin
-- use TTL here
end;
1.3.5 Package
A package can be used to store global information such as constants, types,
component declarations or subroutines/functions. A package in VHDL is like
the header files used in C. The package is separated in a declaration and a body.
The body contains implementations of the functions in the declaration part.
package MY_STUFF is
constant ZERO : std_logic := ’0’;
constant ONE : std_logic := ’1’;
function invert(v : std_logic_vector) return std_logic_vector;
ElectroScience — 14 —
1.3. Design units
end;
package body MY_STUFF is
function invert(v : std_logic_vector) return std_logic_vector is
variable result : std_logic_vector((v’length)-1 downto 0);
begin
result := not v; -- Input v is returned bit wise inverted
return result; -- from the function invert
end;
end;
The package can be included at the beginning of a VHDL file using
library MY_LIB
use MY_LIB.my_stuff.all;
— 15 — Lund Institute of Technology
Chapter 1. Introduction to VHDL
1.4 Data types and modelling
After describing the design units, the next step is to connect the building blocks.
The entity describes the interface of each block and these interfaces are con-
nected using signals. Signals are like wires on a PCB (Printed Circuit Board),
routing the traces between the individual chips. Signals are connected to the
building blocks with port maps that specifies which input/output is connected
to which signal. This chapter also presents generics, specified in the entity,
which makes it possible to pass constant values to a module and to dynamically
configure the wordlength of input and output signals in the port map. Finally,
constants and variables will be presented.
1.4.1 Type declaration
A constant, variable, signal, or generic can be of any type. To simplify this
section, only the most useful types will be described. Types and subtypes can
be declared in the architecture, or in a package if the types are going to be used
between entities. Subtypes are limited sets of a type, e.g., the range 0 to 7 is
a subtype of the type integers. There are two reasons to use subtypes; One, to
give informative names to the subtypes and secondly, the simulator will raise
a warning if values exceeds the subtype range. Integers and enumerated data
types are built-in (package standard), while std logic is defined in a package
standardized by IEEE. Records are structures containing several values of any
type, and will be further discussed in the next chapter about design methodol-
ogy.
library IEEE;
use IEEE.std_logic_1164.all; -- std_logic package
architecture ... of ... is
type state_type is (IDLE, START, STOP); -- enumeration
subtype byte is integer range 0 to 255; -- 8 bit integer
subtype data_type is std_logic_vector(7 downto 0); -- 8 bit vector
type reg_type is record -- define record
state : state_type; -- using types above
data : data_type;
end record;
begin
Type Description Assigning Representation
integer numbers i := 32 32-bit number max
std logic a single bit v := ’0’ U,X,0,1,Z,W,L,H,-
std logic vector bit vector v := ”00” U,X,0,1,Z,W,L,H,-
enumeration set of names s := IDLE any legal name
record set of types r.state := IDLE
1.4.2 Signals
Signals can be viewed as physical connections between blocks and inside mod-
ules. Signals are used as interconnects between ports and are assigned a value
using the <= operator. Inputs and outputs described in the entity are also
signals but with a predefined direction. The value of an input signal can only
be read, and an output signal can only be assigned a value. For signals declared
in the architecture body, the direction is not specified.
ElectroScience — 16 —
1.4. Data types and modelling
entity INVERTER is
port( x : in std_logic;
y : out std_logic;
);
end entity;
architecture behavioural of INVERTER is
signal internal : std_logic; -- internal signal, no direction is specified
begin
internal <= ’0’ when x = ’1’ else ’1’; -- using WHEN statement
y <= not x; -- boolean function
end;
1.4.3 Generics
One important feature in VHDL is the possibility of building generic compo-
nents, i.e., creating dynamic designs. For example, when designing a component
for adding two values, it would be inconvenient if the number of bits of the in-
put operands were static values. If the adder component is reused in several
places in the design, a static approach would force the designer to create several
versions of the same component with various wordlengths. Therefore, the entity
can be described using generic values in addition to the interface. The generic
value of a specific instantiation is given a value with the generic map command
that is similar to the port map command. The value of a generic parameter can
be used in the port specification of the entity as well as in the architecture.
...
architecture struct of ALU is
component adder -- component declaration
generic( WIDTH : integer ); -- generic parameter
port( a : in std_logic_vector(WIDTH-1 downto 0); -- signals controlled by
b : in std_logic_vector(WIDTH-1 downto 0); -- generic parameter
z : out std_logic_vector(WIDTH downto 0)
);
end component;
begin
signal c,d : std_logic_vector(8 downto 0);
signal q : std_logic_vector(9 downto 0);
adder_1 : adder generic map(WIDTH => 9) port map(a => c, b => d, z => q);
-- WIDTH is given the value 9 and the signals c, d, and q are connected to
-- the in and out ports of the adder instantiation adder_1.
...
1.4.4 Constants
It is often useful to define constants instead of hard-coding values in the VHDL
code. If the constant names are chosen with care, e.g., MEM ADDR BITS to
describe the width of an address bus to a memory, it will improve the readability
and simplify the maintenance of the code. Constants can be declared in a
package or in the architecture. Constants declared in the architecture are only
local, while constants defined in a package can be used as global values.
package my_constants is -- in package
constant GND : std_logic := ’0’; -- logical zero
constant VDD : std_logic := ’1’; -- logical one
constant MAGIC : std_logic_vector(3 downto 0) := "1010"; -- constant vector
end;
— 17 — Lund Institute of Technology
Chapter 1. Introduction to VHDL
architecture arch_counter of counter is -- in architecture
constant MAXVAL : std_logic_vector(3 downto 0) := "1111"; -- all one vector
begin
...
end;
1.4.5 Variables
Variables can be used to temporarily store values in a process. Assigning a
value to a variable has a different syntax than for a signal, using the := opera-
tor. When discussing design methodologies, we will show how variables can be
used to avoid signal assignment and to change the concurrent behaviour into
sequential execution. The advantage of variables is that they are sequentially
executed since a value is assigned immediately, as in software programs. From
a simulation point of view, it is also faster to work with variables than signals.
Normally, variables are only used inside a process, but in later versions of VHDL
it is also possible to declare shared variables directly in the architecture.
It is important to note that the hardware will be the same whether variables
or signals are used! The hardware will not be sequential, it is only the code
that is executed sequentially. Variables are used to increase the readability of
the code and to reduce simulation time. Below is an example of the differences
using variables and signals, but the hardware will be exactly the same in both
cases. The process to the right uses only signals, which all change values at
the same time. The first time this process is executed due to a change in z,
signals a will become z +1 and b will become 3 +1, the old value of a plus one.
These changes will trigger the process one more time, since a and b are in the
sensitivity list of the process. The second time correct values will be assigned
to b and q. The process to the left that uses variables, will assign correct values
to a, b, and q the first and only time the process is triggered by a change in z.
For more details of processes see Section 1.5.
process(z) -- sensitive to changes on z process(a,b,z) -- sensitive to a, b, z
variable a,b integer; begin -- assume that z=5 and a=3 then
begin -- assume that z=5 then a<=z+1; -- a=6 after 1 delta
a:=z+1; -- a=6 b<=a+1; -- b=4 after 1 delta and b=7 after 2 delta
b:=a+1; -- b=7 -- changes to a and b will trigger process again
q<=b; -- signal q=7 after one delta q<=b; -- q=4 after 1 delta and 7 after 2 delta
end process; end process;
1.4.6 Common operations
Table 1.1 shows a list of common operators and which data types that can be
used with them. All logical operations, e.g., and and or, are performed bit wise
and both input most be of the same length. The input to comparison operations
do not have to be of the same length, but the smallest input will be extended to
the same length as the widest input in the hardware. For wordlengths concerning
arithmetic operations see Chapter 3.
ElectroScience — 18 —
1.5. Process statement
1.5 Process statement
Processes are used to increase code readability, reduce simulation time, and
for the synthesis tool to recognize some common hardware blocks, e.g., flip-
flops and latches. A hardware architecture can contain an arbitrary amount
of processes. However, to increase code readability it is recommended that a
maximum of two processes are used, one containing sequential logic and one
containing combinatorial logic, e.g., flip-flops and latches.
A process can be viewed as a self-contained module, sensing changes in the
input signals and changing the output signals accordingly. A process has a
sensitivity list which reacts to changes in the input signals. If a signal in
the sensitivity list changes, all statements in the process will be executed once.
Therefore, it is very important to add all dependent signals, i.e., signals that
are being read in the process, to the sensitivity list. Otherwise, the simulation
result will not be correct.
A process without a sensitivity list is executed continuously and has to con-
tain some kind of wait statement. Otherwise, the simulation tool will lock up.
A wait statement could for example be wait until sel=’1’, continuing execution
when the condition is met. A simple MUX is presented below. The process is
sensitive to changes to the input operands a and b and the select signal sel.
entity mux is
port( a : in std_logic;
b : in std_logic;
sel : in std_logic;
q : out std_logic
);
end entity;
architecture behavioural of mux is
begin
process(a,b,sel) -- sensitive to changes on A, B, SEL
begin
if (sel = ’1’)
q <= a;
else
q <= b;
end if;
end process;
end;
1.5.1 Sequential vs. combinatorial logic
All the examples so far have demonstrated combinatorial circuits, for example
Boolean functions, adders or multiplexers. Normally, the combinational blocks
are separated by sequential logic, updated only on the rising or falling edge of
the clock signal, see Figure 1.3. By combining sequential and combinatorial
logic, it is possible to describe state machines, counters, and other sequential
units. The accumulator in Figure 1.4 uses a sequential unit to store the sum
of all previous input values. It uses a feedback signal to add the current input
to the accumulated sum. The reset signal forces the sequential unit back to
zero. The code example below shows the implementation of the accumulator,
using one process to describe the combinatorial logic and one process for the
sequential unit. The command a <= (others =>

0

) is used to assign zero to
all bits in a, independently of the wordlength of a.
entity ACCUMULATOR is
— 19 — Lund Institute of Technology
Chapter 1. Introduction to VHDL
clk clk clk
Figure 1.3: Combinatorial logic divided by sequential registers.
1
0
0
d
q
rst clk
Figure 1.4: Simple accumulator with reset. The adder and MUX are the com-
binatorial part of the circuit. The register is sequentially updated by the clock
signal.
generic( BITS : integer );
port( clk : in std_logic;
rst : in std_logic;
d : in std_logic_vector(BITS-1 downto 0);
q : out std_logic_vector(BITS-1 downto 0));
end;
architecture MY_GOD_I_HOPE_THIS_WORKS of ACCUMULATOR is
signal sum, feedback : std_logic_vector(BITS-1 downto 0);
begin
comb: process(rst, d, feedback) -- combinatorial update
begin
if (rst = ’0’) then
sum <= (others => ’0’); -- assign ’0’ to all vector elements
else
sum <= feedback + d;
end if;
end process;
seq: process(clk) -- sequential update
begin
if rising_edge(clk) then -- update on the rising clock edge
feedback <= sum;
end if;
end process;
q <= feedback;
end;
ElectroScience — 20 —
1.6. Instantiation and port maps
1.6 Instantiation and port maps
Hardware design is well suited for a hierarchical approach. Small logical cells
build arithmetic units such as adders. Adders are basic building blocks in signal
processing, and several signal processing algorithms can form a complete system.
This way of designing is called bottom-up, first specifying the smallest and
most fundamental building block, then using (instantiating) these basic blocks
to build more complex designs. Using previously designed components in a new
component is called instantiation. For example, an FIR filter instantiates a
number of adders and multipliers and connects wires between the instantiated
blocks. The procedure for instantiation is presented below. Examples of how to
instantiate components can be found in Section 1.9.
1. Create a new entity adder, as shown in Section 1.3.1.
2. Specify the adder architecture, as shown in Section 1.3.2.
3. Create a new entity and architecture to the FIR.
4. Include a component declaration of the adder, in the FIR architecture
(before begin).
5. Instantiate the adder in the FIR architecture (after begin)
6. Connect the signals in the port map to create a filter function
ent it y ADDERis
...
end;
ent it y ADDERis
...
end;
archit ect ure ... of ADDERis
...
end;
ent it y FIRis
...
end;
archit ect ure ... of FIRis
...
end;
archit ect ure ... of FIRis
component ADDER
...
end component ;
begin
...
end;
begin
myadd1: ADDER
generic map ( WIDTH => 4)
port map( A => A,
B => B,
Z => Z );
myadd2: ADDER
...
FIR FIR FIR ADD ADD
1 2 3 4 5
Figure 1.5: FIR filter instantiating a generic adder component.
1.7 Generate statements
When building highly regular structures, for example a direct map FIR filter,
instantiating components by hand would be a cumbersome procedure. In the
case of an FIR filter, the number of components depends on the actual filter size.
Changing the filter size would require manual work to instantiate a different
number of components. However, VHDL has a construct to cope with the
situation of dynamic instantiation. The generate statement, similar to the for
statement, can be used for instantiating blocks or connecting signals using the
loop counter as index.
A frequently used signal processing element is the CORDIC (Coordinate
Rotation Digital Computer). It performs a series of pseudo-rotations based
on shift and add operations, useful for evaluating trigonometrically functions.
An N bit CORDIC element can be implemented as a time multiplexed unit
performing N iterations or by pipelining N units, as shown in Figure 1.6. The
— 21 — Lund Institute of Technology
Chapter 1. Introduction to VHDL
ctrl
xyz
ctrl
xyz
0 1 N-1
Figure 1.6: A pipelined CORDIC element with N bit precision.
example in this section shows how to instantiate N units and connecting them
in a pipeline.
architecture structural of cordic is
subtype data_type is std_logic_vector(N-1 downto 0);
type cordic_data is record
ctrl : cordic_control_type;
x, y, z : data_type;
end record;
type cordic_pipeline is array(0 to N) of cordic_data;
signal pipeline : cordic_pipeline;
begin
gen_pipe: for i in 0 to N-1 generate
inst_unit: cordic_unit
generic map( BITS => N,
STAGE => i )
port map( clk => clk,
rst => rst,
ctrli => pipeline(i).ctrl,
ctrlo => pipeline(i+1).ctrl,
xi => pipeline(i).x,
yi => pipeline(i).y,
zi => pipeline(i).z,
xo => pipeline(i+1).x,
yo => pipeline(i+1).y,
zo => pipeline(i+1).z );
end generate;
end;
ElectroScience — 22 —
1.8. Simulation
1.8 Simulation
1.8.1 Testbenches
When writing a design, it is important to verify its functionality. The most
common method of doing this is to create a testbench, i.e., instantiating a
DUT (Device Under Test), generate test vectors (a set of inputs), and monitor
the output, as shown in Figure 1.7. Common testbench tasks are to generate
clock and reset signals, and read/write information to file. Writing the output
values to a file makes it possible to verify the result using, e.g., Matlab. An
example testbench is found below.
0101
0101
d
q
rst
clk
DUT read
write
Figure 1.7: Testbench providing stimuli to a device under test (DUT).
library IEEE;
use IEEE.std_logic_1164.all;
use std.textio.all;
entity tb_exmple is
end tb_exemple;
architecture behav of tb_example is
file in_data:text open read_mode is "../stimuli/input";
file out_data:text open write_mode is "../stimuli/output";
constant half_period: time :=10 ms; -- half period = half clock period
component test_design
port (clk : in std_logic;
rst : in std_logic;
input : in std_logic_vector(1 downto 0);
output: out std_logic_vector(1 downto 0));
end component;
signal clk: std_logic:=’0’; -- start value of clk is ’0’
signal rst: std_logic:=’1’;
signal input_sig, output_sig : std_logic_vector(1 downto 0);
begin
rst <= ’0’ after 20 ms;
DUT : test_design port map (
clk => clk, -- device pin (clk) connected with (=>) signal in tb (clk)
rst => rst,
input => input_sig,
output=> ouput_sig);
process(clk)
variable buffer_in,buffer_out:line;
variable d,result :integer;
begin
if (clk=’1’) and (clk’event) then -- at each positive clock edge a new input
readline(in_data,buffer_in); -- is written to the DUT and the output is
read(buffer_in,d); -- written to the output file
inp_sig <= CONV_STD_LOGIC_VECTOR(d,2) after 2 ns; -- delay input 2 ns
result:=CONV_INTEGER(output_sig);
write(buffer_out,result);
— 23 — Lund Institute of Technology
Chapter 1. Introduction to VHDL
writeline(out_data,buffer_out);
end if;
end process;
process --clock generator
begin
wait for half_period;
clk <= not clk;
end process;
end;
1.8.2 Modelsim
Modelsim is a widely used simulator for VHDL and Verilog-HDL. The screen-
shots from the program in Figure 1.8 and Figure 1.9 show the simulation of the
accumulator example from Section 1.5.1. Simulation is performed by compiling
a VHDL file using the Modelsim command vcom, loading the design with vsim
and finally starting the simulation with the run command. All commands can
either be written at the prompt or found in the GUI. The waveform window
shows how the output value is incremented with d for each clock cycle.
ModelSim> vcom -work MYLIB accumulator.vhd
ModelSim> vsim MYLIB.accumulator
ModelSim> run 1000 ns
Figure 1.8: Modelsim main window. Can be used for compiling and simulating
designs.
ElectroScience — 24 —
1.9. Design example
Figure 1.9: Modelsim waveform window. The accumulated values increases with
d for each clock cycle.
1.8.3 Non-synthesizable code
Sometimes it is convenient to include functions that are non-synthesizable dur-
ing simulation. It could be printing a message in the simulation window, or
writing important information to a file. During synthesis, these lines of code
must be discarded. This can be done by placing a special line before and after
the non-synthesizable code. Place the line
-- pragma translate_off
before, and
-- pragma translate_on
after the part you want to hide from the synthesizing tool. When using Syn-
opsys, a system variable must also be set in order for Synopsys to understand
the translate off and translate on commands. This is done by setting the
following variable in the synthesis script:
hdlin_translate_off_skip_text = true
An example of non-synthesizable code is signal delays. During simulation, it
is possible to delay the signal assignment to emulate the behaviour of gates,
memories or external interfaces. However, most synthesis tools understand this
syntax and will simply ignore it; generating a warning message. It is not neces-
sary to use translate off when adding signal delays.
1.9 Design example
A small design example of an ALU (arithmetic logical unit) will be presented in
this section. The ALU supports addition, subtraction, and some logical func-
tions. There are two input operands, input control signal, and one output for
the ALU result. The architecture for the design example is found in Figure 1.10.
— 25 — Lund Institute of Technology
Chapter 1. Introduction to VHDL
ALU_OP
S
X
A
M
B
Q
R
Mux
R_reg
Q_reg
Alu
Shift
Figure 1.10: Schematic view of the ALU design.
ALU unit
entity ALU is architecture BEHAVIOURAL of ALU is
generic (WL : integer); begin
port( ALU_OP : in std_logic_vector(2 downto 0); process (A,ALU_OP,M)
A, M : in std_logic_vector(WL-1 downto 0); begin
X : out std_logic_vector(WL-1 downto 0)); case ALU_OP is
end; when "000" => X <= A;
when "001" => X <= A+M;
when "010" => X <= A-M;
when "011" => X <= M-A;
when "100" => X <= -(A+M);
when "101" => X <= (A and M);
when "110" => X <= (A or M);
when "111" => X <= (A xor M);
when others => null;
end case;
end process;
end;
Shift unit
entity SHIFTOP is architecture BEHAVIOURAL of SHIFTOP is
generic(WL : integer); begin
port( SHIFT : in std_logic_vector(1 downto 0); process (X,SHIFT)
X : in std_logic_vector(W_L-1 downto 0); variable S_temp : std_logic_vector(WL-1 downto 0);
S : out std_logic_vector(W_L-1 downto 0)); begin
end; case SHIFT is
when "01" => S <= X(WL-1) & X(WL-1 downto 1);
when "11" => S <= X(WL-2 downto 0) & ’0’;
when "10" => S <= X(WL-3 downto 0) & "00";
when others => S <= X;
end case;
end process;
end;
MUX unit
entity MUX is architecture BEHAVIOURAL of MUX is
generic(WL : integer); begin
port( MUX : in std_logic; process (MUX,B,R)
B,R : in std_logic_vector(WL-1 downto 0); begin
M : out std_logic_vector(WL-1 downto 0)); if MUX=’0’ then
ElectroScience — 26 —
1.9. Design example
end; M <= B;
else
M <= R;
end if;
end process;
end;
Registers
entity REGISTER is architecture BEHAVIOURAL of REGISTER is
generic(WL : integer); begin
port( clk : in std_logic; process (clk,WR_R,S)
WR_R : in std_logic; begin
S : in std_logic_vector(WL-1 downto 0); if (clk=’1’) and (clk’event) then
R : out std_logic_vector(WL-1 downto 0)); if WR_R=’1’ then
end; R <= S;
end if;
end if;
end process;
end;
— 27 — Lund Institute of Technology
Chapter 1. Introduction to VHDL
ALU toplevel
The ALU toplevel instantiates the previously created designs and connects the
port maps using signals.
entity ALU_TOP is
generic(WL : nteger := 8);
port( A, B : in std_logic_vector(WL-1 downto 0);
ALU_PORT : in std_logic_vector(7 downto 0);
clk : in std_logic;
Q : out std_logic_vector(WL-1 downto 0));
end;
architecture BEHAVIOURAL of ALU_TOP is
component alu -- declare components
generic (WL : integer);
port( ALU_OP : in std_logic_vector(2 downto 0);
A, M : in std_logic_vector(WL-1 downto 0);
X : out std_logic_vector(WL-1 downto 0));
end component;
...
-- declare signals
signal OEN, WR_R, MUX : std_logic;
signal SHIFT : std_logic_vector(1 downto 0);
signal ALU_OP : std_logic_vector(2 downto 0);
signal R,S,M,X : std_logic_vector(WL-1 downto 0);
begin
OEN <= ALU_PORT(7); -- connect signals
WR_R <= ALU_PORT(6);
SHIFT <= ALU_PORT(5 downto 4);
MUX <= ALU_PORT(3);
ALU_OP <= ALU_PORT(2 downto 0);
A_OP : alu -- instatiate units
generic map (WL => WL)
port map (ALU_OP => ALU_OP, A => A, M => M, X => X);
S_OP : shiftop
generic map (WL => WL)
port map (SHIFT => SHIFT, S => S, X => X);
M_OP: mux
generic map (WL => WL)
port map (MUX => MUX, B => B, R => R, M => M);
R_OP: register
generic map (WL => WL)
port map (clk => clk, WR_R => WR_R, S => S, R => R);
Q_OP: register
generic map (WL => WL)
port map (clk => clk, WR_R => OEN, S => S, R => Q);
end;
ElectroScience — 28 —
1.9. Design example
Table 1.1: List of operations.
Operator Description Operand type Example
and logical and bit, vector a := b and c
or logical or bit, vector a := b or c
nand logical nand bit, vector a := b nand c
nor logical nor bit, vector a := b nor c
not logical not bit, vector a := not b
xor logical xor VHDL’93 bit, vector a := b xor c
xnor logical xnor VHDL’93 bit, vector a := b xnor c
= equal bit, vector, integer if (a = 0) then
/ = not equal bit, vector, integer if (a /= 0) then
< less than bit, vector, integer if (a < 2) then
<= less or equal bit, vector, integer if (a <= 2) then
> greater than bit, vector, integer if (a > 2) then
>= greater or equal bit, vector, integer if (a >= 2) then
+ addition bit, vector, integer a := b + ”0100”
− subtraction bit, vector, integer a := b - ”0100”
∗ multiplication bit, vector, integer a := b * c
∗∗ exponent (shift) bit, vector, integer a := 2 ** b
& concatenation bit, vector, string a := ”01” & ”10”
— 29 — Lund Institute of Technology
ElectroScience — 30 —
Chapter 2
Design Methodology
This chapter proposes an HDL Design Methodology (DM) that is based on uti-
lizing certain properties of the VHDL language. After having gotten acquainted
with VHDL, it seems that the language contains many useful properties. As a re-
sult, an ad-hoq coding style often arises, resulting not only in non-synthesizable
code but also in many concurrent processes, signals and different source file dis-
positions. As the design grows, i.e., the actual length and number of source files,
together with the time spent on debugging, the need for an effective DM be-
comes evident. This is especially important when several designers are involved
in the same project. In the subsequent sections, a DM is proposed, based on the
methodology proposed by Jiri Gaisler [3], together with some additional guide-
lines. The proposed DM main objectives are to improve the following important
properties of HDL design namely:
• Increase readability.
• Simplify debugging.
• Simplify maintenance.
• Decrease design time.
The readability can be increased by exploring several important properties of
HDL design and VHDL, e.g., hierarchy, i.e., abstraction level; not use more than
two processes per entity, i.e., one combinatorial and one sequential; the way a
Finite State Machine (FSM) is written, using case statements; etc. As a result,
the source code will have a uniform disposition simplifying code orientation and
the way to address a certain design problem.
As the design grows, the time spent on functionality debugging becomes
a larger part of the design time. Using a proper DM will certainly help to
minimize the time spent on debugging.
An important issue after having released an IP-block is proper maintenance
and support, which might lead to an advantage over competitive designs. An
effective DM can simplify these procedures, since code orientation and single
block functionality verification becomes easier.
— 31 — Lund Institute of Technology
Chapter 2. Design Methodology
2.1 Structured VHDL programming
Structured VHDL programming involves specific properties of the VHDL lan-
guage as well as addressing design problems in a certain way. This section will
propose some simple guidelines to achieve the goals mentioned in the previous
section. The guidelines are summarized below:
• Use a uniform source code disposition, e.g., two processes per entity.
• Utilize records.
• Use synchronous reset and clocking.
• Explore hierarchy.
• Use local variables.
• Utilize sub-programs.
The proposed DM is based on these properties and they have to be utilized
throughout the complete design process. At first glance, adopting all paragraphs
may seem excessive, but as the skill of the designer increases, the individual ben-
efits will become evident. Furthermore, the presented guidelines are applicable
to any synchronous single-clock design, but many will, with some modification,
work for all types of designs, e.g., Globally Asynchronous Locally Synchronous
(GALS). Furthermore, the guidelines should be seen as recommendations and
following them will most certainly result in synthesizable and well structured
designs. The benefits of each paragraph will be explained in detail in the sub-
sequent sections.
2.1.1 Source code disposition
Limiting the number of processes to two per entity improves readability and
results in a uniform coding style. One process contains the sequential logic
(synchronous), i.e., registers, and the other contains the combinatorial part
(asynchronous), i.e., behavioral logic. The sequential part always looks the
same but the combinatorial part looks different in every source file due to the
functionality of the block. Due to this fact, the sequential part is always put
after the combinational part. A typical source code disposition is compiled
below:
1. Included library(s).
2. Entity declaration(s).
3. Constant declaration(s).
4. Signal declaration(s).
5. Architecture declaration.
6. Component instantiation(s).
7. Combinatorial process declaration.
ElectroScience — 32 —
2.1. Structured VHDL programming
8. Sequential process declaration .
The source file disposition should always have this given order, simplifying
readability, code orientation, and maintenance.
2.1.2 Records
A typical VHDL design consists of several hundreds of signal declarations. Each
signal has to be added on several places in the code, i.e., entity, component dec-
laration, sensitivity list, component instantiation, which obviously can become
quite hard to keep track of as the number of lines of code increase. A simple way
of avoiding this is to use records since once the signal is included in a record; all
these updates are done automatically. This is due to the fact that the record is
already included in the necessary places and the actual signal is hidden within
the record.
A record in VHDL is analogous to a struct in C and is a collection of
various type declarations, e.g., standard logic vectors, integers, other records,
etc. The declarations should not contain any direction, i.e., in or out. Below is
an example how a simple record is defined VHDL:
type my_type is record
A : std_logic_vector(WIDTH-1 downto 0);
B : integer;
Z : record_type;
end record;
The record declarations should be compiled in a declaration package that
can be imported to the various files, thus enabling reuse. Using records not
only reduces many lines of code but also simplifies signal assignment since none
of the default signal assignments are forgotten, which otherwise can lead to
unnecessary latches in the synthesized design.
2.1.3 Clock and reset signal
The clock signal solely controls the sequential part, i.e., updating the registers,
of the source file and should be left out of any record declaration. This is mainly
because the clock signal is treated differently by the synthesis tools, since it is
routed from an input pad throughout the complete design. Also, many clock
tree generation and timing analysis tools do not support clock signals that are
part of a bus. The sequential part always looks the same and should only have
the clock signal in the sensitivity list, see example below:
sequential_part : process(clk)
begin
if rising_edge(clk)
r <= rin;
end if;
end process;
There are two methods of implementing the reset signal, namely synchronous
and asynchronous. There are also two places to put the reset assignment in the
— 33 — Lund Institute of Technology
Chapter 2. Design Methodology
Level 1 Level 2
register operation
mux operation
shift operation
alu operation
alu (top level)
Figure 2.1: An example of how hierarchy can be used in the ALU example.
source code, i.e., in the sequential or combinational part. Choosing between the
two methods and deciding where to put the assignment seems to be a religious
matter and many of the arguments for choosing either of them are obsolete,
since most of the commercial synthesis tools now support them. In the proposed
DM, synchronous reset assigned in the combinational part is utilized, and for
the same reasons as for the clock signal; the reset signal should not be included
in any kind of record declaration
1
. The reset assignment is put in the latter part
of the combinational block, just before the current state record is updated, and
should be triggered on a logic zero, which is a commonly used convention. The
following example illustrates where the synchronous reset assignment is placed
in the source file:
combinational_part : process (various signals and records, reset)
begin
...
if (reset = ’0’) then
v.state := IDLE;
...
end if;
...
rin <= v;
end process;
2.1.4 Hierarchy
Hierarchy is an important concept and is general for all types of modeling since
altering the abstraction level tends to simplify a given problem. In VHDL,
hierarchy can be utilized to split a given problem or model into individual mod-
ules, designs or sub programs. Each program level is controlled by component
instantiation, enabling debugging of each block individually. Clearly, there is
no general methodology to explore hierarchy rather that it can and should be
utilized in every design. A block diagram of how hierarchy can be used in the
ALU example can be seen in Figure 2.1:
1
A synchronous reset signal can be added in a record since it is then treated as any other
signal but it is not advisory since it does not support a uniform coding style.
ElectroScience — 34 —
2.1. Structured VHDL programming
2.1.5 Local variables
By utilizing local variables instead of signals in the combinational part, a given
design problem is transformed from concurrent to sequential programming, i.e.,
each line of code is executed in consecutive order during simulation. This is
advantageous since humans find it easier to think in sequential terms rather
than in parallel. Using local variables will also increase simulation speed since
simulation is often event triggered and the tool can execute events in order of
appearance and does not have to consider parallel events. Basically, there is only
one restriction when using local variables namely that a computational result
stored in a variable should not be reused within the same clock cycle but rather
stored in a register to be used in the succeeding clock cycle. This is mainly to
avoid long logic paths and combinational feedback loops. Below is an example
of how to use local variables in the ALU example:
ALU_example : process("additional signals and records", r, reset)
variable v : reg_type;
begin
v := r;
...
case r.state is
when IDLE => v.state := CALC;
...
end case;
...
rin <= v;
end;
Initially, the local variable v is set to the current values of the registers stored
in r. Then, the local variable v is used throughout the combinational part in a
sequential programming fashion. Finally, the input signal to the registers rin is
updated to the computed values of the local variable v. To see the complete code
of the ALU example, please refer to Section 2.2. Variable and signal assignments
should be placed exactly as in the example above.
2.1.6 Subprograms
Using sub-programs is often a good way of compiling commonly used func-
tions and algorithms, thus increasing readability. Subprograms can be used to
increase the hierarchy and abstraction level, hiding complexity and enabling
reuse. The subprograms are defined in packages and imported into the different
source files. A restriction in the use of subprograms is that functions should only
contain purely combinatorial logic. Below is an example of how subprograms
can be used in the ALU example:
-- package declaration
function shift_operation (A : in data_type, m : in data_type,
control : in control_type)
return data_type;
-- package body
— 35 — Lund Institute of Technology
Chapter 2. Design Methodology
function shift_operation(A : in data_type , m : in data_type,
control : in control_type)
return data_type is
begin
case control is
when "000" => return A;
when "001" => return A + m;
when "010" => return A - m;
when "011" => return m;
when "100" => return (others => ’0’);
when "101" => return A and m;
when "110" => return A or m;
when "111" => return A xor m;
when others => return null;
end case;
end function shift_operation;
Each subprogram can be debugged individually with a separate testbench
to ensure correct functionality of the block.
2.1.7 Summary
The previous sections have proposed guidelines to establish a sound DM to
be used throughout the whole design process. As a result, a uniform coding
style can be established to simplify and increase readability, debugging, and
maintenance. There are also several other benefits of adopting this DM where
most important are summarized in the table below:
• Less lines of code.
• Altering the abstraction level, hiding complexity, and enabling reuse.
• Parallel coding is transformed into sequential.
• Improved simulation and synthesis speed.
Finally, adopting proposed guidelines hopefully results in synthesizable and
well disposed designs that will help the design team on the way towards the
final goal, which is a robust design using a minimal design time.
ElectroScience — 36 —
2.2. Example
2.2 Example
The same example as in the previous chapter is now presented, using the pro-
posed design methodology. The ALU code is divided into a global package and
the actual implementation. The package contains constants and type declara-
tions to improve the readability and portability, and can be found below. The
rest of the ALU implementation is presented in Section 2.3.5.
library IEEE;
use IEEE.std_logic_1164.all; -- for std_logic
use IEEE.std_logic_signed.all; -- for signed add/sub
package alu_pkg is
constant WL : integer := 8;
constant ALU_STORE : std_logic_vector(2 downto 0) := "000";
constant ALU_ADD : std_logic_vector(2 downto 0) := "001";
constant ALU_SUB : std_logic_vector(2 downto 0) := "010";
constant ALU_FB : std_logic_vector(2 downto 0) := "011";
constant ALU_CLEAR : std_logic_vector(2 downto 0) := "100";
constant ALU_AND : std_logic_vector(2 downto 0) := "101";
constant ALU_OR : std_logic_vector(2 downto 0) := "110";
constant ALU_XOR : std_logic_vector(2 downto 0) := "111";
constant SHIFT_NONE : std_logic_vector(1 downto 0) := "00";
constant SHIFT_RIGHT : std_logic_vector(1 downto 0) := "01";
constant SHIFT_2LEFT : std_logic_vector(1 downto 0) := "10";
constant SHIFT_1LEFT : std_logic_vector(1 downto 0) := "11";
subtype data_type is std_logic_vector(WL-1 downto 0);
type control_type is record
oen : std_logic;
fb : std_logic;
mux : std_logic;
shift : std_logic_vector(1 downto 0);
alu_op : std_logic_vector(2 downto 0);
end record;
component alu
port ( clk : in std_logic;
A : in data_type;
B : in data_type;
control : in control_type;
Q : out data_type );
end component;
end;
2.3 Technology independence
During code development, the target technology may not yet be known or
changed at a later stage. An important design aspect is therefore to keep the
code free from parts that are technology specific. All parts that are not standard
cells and pure logic, for example on-chip memories and the pads connecting
the design to the outside world, are technology dependent.
Memories and pads are instantiated in the VHDL code by name. At the same
time, avoid hard coding the memory name or pad name in the design. A simple
solution is to use abstraction. Start with creating a mapping file, which selects
the proper component based on the user constraints and the chosen technology.
For example, when creating a memory, specify only the number of data bits and
address bits in the code. Then let the mapping file choose the correct memory
— 37 — Lund Institute of Technology
Chapter 2. Design Methodology
pads
logic
memory
Figure 2.2: A chip layout with standard cells (logic), memories and pads.
based on the current technology and by using your specifications about the
address and data bus.
Design.vhd
mem: gen_ram
generic map ( 6, 16 )
port map ( ... );
t ech_map.vhd
if (t arget _t ech = VIRTEX) t hen
mem: virt ex_ram ...
end;
if (t arget _t ech = UMC13) t hen
mem: umc13_ram ...
end;
t ech_virt ex.vhd
Inst ant iat e and connect
Virt ex memory by name
t ech_umc13.vhd
Inst ant iat e and connect
UMC .13um memory by name
t arget _t ech
Figure 2.3: Using a mapping file to handle target specific modules.
In your design, instantiate a memory from the mapping file as:
use work.tech_map.all;
mem: gen_ram
generic map ( ADDR_BITS => 5,
DATA_BITS => 16 );
port map ( clk, addr, d, q );
The instantiation will point to the technology mapping file, instead of the actual
memory, which selects and connects the chosen technology dependent module.
The mapping file contains the following information:
use work.tech_virtex.all; -- one library for
use work.tech_umc13.all; -- each technology
type target_type is ( VIRTEX, UMC13 ); -- available technologies
constant target_tech : target_type := VIRTEX; -- set technology here
entity gen_ram
generic map ( ADDR_BITS : integer;
DATA_BITS : integer );
port map ( clk : std_logic;
addr : std_logic_vector(ADDR_BITS-1 downto 0);
d : std_logic_vector(DATA_BITS-1 downto 0);
q : std_logic_vector(DATA_BITS-1 downto 0) );
end;
architecture structural of gen_ram is
ElectroScience — 38 —
2.3. Technology independence
begin
if (target_tech = VIRTEX) generate -- VIRTEX memory
mem: virtex_ram
generic map ( ADDR_BITS, DATA_BITS );
port map ( clk, addr, d, q );
end generate;
if (target_tech = UMC13) generate -- UMC ASIC memory
mem: umc13_ram
generic map ( ADDR_BITS, DATA_BITS );
port map ( clk, addr, d, q );
end generate;
end;
2.3.1 ASIC Memories
ASIC memories must be created using a memory generator or by contacting the
vendor. The ideal memory generator creates a VHDL simulation model, entity
and architecture, and a layout file specifying the size and connections of the
memory. The memory is instantiated from the VHDL code by using exactly the
same name and port connections.
Physical size
Verilog
net list
ent it y
GDSII
comp
phys
MEM
GEN
LEF
DB
VHDL
SYN
SIM
PaR
Figure 2.4: The memory generator creates simulation files, library files and
physical description files.
2.3.2 ASIC Pads
For smaller projects, pads are normally inserted during synthesis. The reason
is simple: it is less work. For ASIC technologies, pad insertion is only a couple
of lines in the synthesis script. For Synopsys, use:
dc_shell> set_port_is_pad all_inputs()
dc_shell> set_port_is_pad all_outputs()
dc_shell> set_pad_type -exact IBUF all_inputs();
dc_shell> set_pad_type -exact B2CR all_outputs();
dc_shell> insert_pads -respect_hierarchy
The only thing you have to select is the pad name from your target library,
in this case IBUF for input ports and B2CR for output ports, names that are
technology specific. The number associated with the output pad is usually the
driving strength in mA. However, for larger designs, several different pad types
are usually required. It could be pads with different driving strength, Schmitt-
trigger, open-drains or bi-directional pads. Assigning pads individually and at
— 39 — Lund Institute of Technology
Chapter 2. Design Methodology
the same time creating multiple version, one for each technology, will be difficult.
Adding one pad requires changing the synthesis script for all technologies.
The solution is to apply the same technique as described for memories; create
generic pads in a mapping file, and point to the current technology. Adding a
new pad will only require a pad instantiation at the top level of your design,
and the correct pads will be chosen from the mapping file.
2.3.3 FPGA modules
For FPGAs, memories and pads are only two examples of technology specific
components. Xilinx has created a tool for automatically creating small cores
that can be used in the design. The tool is called Core Generator and creates
modules (.edn) that can be inferred during the place and route step. The mod-
ules are instantiated by name (which will create a black box during synthesis),
replacing the black box when reading the .edf file. For example, core generator
can create memories and multipliers of any size, combining the resources in the
FPGA. Usually the on-chip memories are small, but can be grouped to form a
larger memory.
Component descript ion
.EDF file
Bit st ream
.UCF
user const raint file
GEN
CORE
EDN
VHDL
SYN
SIM
ISE
Figure 2.5: Core Generator can create all kinds of modules, instantiated by
name in the VHDL code.
ElectroScience — 40 —
2.3. Technology independence
Figure 2.6: Core Generator main window. Generated cores can be instantiated
in the VHDL code.
2.3.4 FPGA Pads
Memory and pad insertion is more simple for FPGAs than for ASIC. The reason
is that the synthesis tool for FPGA already knows what kinds of memories and
pads that are available for a certain FPGA. The synthesis tool can therefore
handle a lot of things automatically.
— 41 — Lund Institute of Technology
Chapter 2. Design Methodology
2.3.5 ALU unit
The ALU unit imports the package on the previous page and uses the type
and constant declarations. The ALU is implemented using the two process
methodology, separating the sequential and combinatorial logic. All sequential
elements are declared in a record type.
use work.alu_pkg.all; -- use ALU types
entity alu is
port ( clk : in std_logic;
A, B : in data_type;
control : in control_type;
Q : out data_type );
end;
architecture behavioural of alu is
type reg_type is record -- all registers
output : data_type;
feedback : data_type;
end record;
signal r, rin : reg_type;
begin
process(r, A, B, control)
variable m, x, s : data_type;
variable v : reg_type;
begin
v := r;
if control.mux = ’1’ then -- select input or feedback
m := B;
else
m := r.feedback;
end if;
case control.alu_op is -- arithmetic operation
when ALU_STORE => x := A;
when ALU_ADD => x := A + m;
when ALU_SUB => x := A - m;
...
when others => null;
end case;
case control.shift is -- shift operation
when SHIFT_NONE => s := x;
when SHIFT_RIGHT => s := x(WL-1) & x(WL-1 downto 1);
when SHIFT_2LEFT => s := x(WL-2 downto 0) & ’0’;
when SHIFT_1LEFT => s := x(WL-3 downto 0) & "00";
when others => null;
end case;
if (control.fb = ’1’) then -- feedback enable
v.feedback := s;
end if;
if (control.oen = ’1’) then -- output enable
v.output := s;
end if;
Q <= r.output; -- drive output from register
rin <= v;
end process;
process(clk) -- update registers
begin
if rising_edge(clk) then
r <= rin;
end if;
end process;
end;
ElectroScience — 42 —
Chapter 3
Arithmetic
3.1 Introduction
In the original meaning, arithmetic describes the four fundamental rules of math-
ematics: addition, subtraction, multiplication and division. In computer tech-
nology, a device that performs these operations is called Arithmetic and Logical
Unit (ALU). In addition to the four fundamental rules of arithmetic, an ALU
can perform several other operations, e.g. shifting and logical operations.
Arithmetic and logical operations are of course very important to understand
when designing hardware, as we want an efficient hardware structure as possible
to perform the operation. Different implementations of a particular operation
can be distinguished in terms of area, performance or power.
To understand the implementation of arithmetic operations we must first
consider and distinguish two important concepts: the operation and the data
type. When it comes to hardware design with VHDL, the data type and the
operation is specified in the VHDL language. The description in this chapter
will concentrate on the VHDL data types based on an array of std logic and
the operations that can be performed on this specific data type. Although the
VHDL language specify both the data type and the operations to be performed
on this type, it does not say anything about how the operations are to be im-
plemented as a hardware structure. This is decided in the step which hardware
designers refer to as synthesis, which in this chapter will be discussed in the
context of the Synopsys synthesis tool, Design Compiler or DC for short.
Hence, this chapter addresses the implementation of arithmetic operations in a
digital ASIC.
3.2 VHDL Packages
Operations and data types are specified in the VHDL language in packages and
are made accessible by use of a library clause at the beginning of the VHDL file.
-- Library clause in VHDL
library IEEE;
use IEEE.std_logic_1164.all;
— 43 — Lund Institute of Technology
Chapter 3. Arithmetic
As there may be several definitions of the same operator in a standard pack-
age, the actual operator selected, by the compiler, is controlled by operator
overloading. Operator overloading means that the compiler checks input and
output data types, to find the correct operation from the list of all operations
currently accessible through the library clauses in the file.
The most important and widely used standard packages when it comes to
ASIC design, simulation and synthesis, are:
• std logic 1164
• std logic arith
• std logic signed
• std logic unsigned
The standard package called std logic 1164 is an IEEE standard and the
other three are Synopsys de facto industry standards. We will briefly explain
the content of the packages which also are found in Appendix A.
3.2.1 Data types
The basic data type from which the data types, used for writing synthesizable
VHDL code, originate is std ulogic. The type describes a 9-level information
carrier, see Table 3.1. The information carriers are ’U’ (uninitialized), ’X’ (forc-
ing unknown), ’0’ (forcing 0), ’1’ (forcing 1), ’Z’ (high impedance), ’W’ (weak
unknown), ’L’ (weak 0), ’H’ (weak 1) and ’-’ (don’t care). These are the values
you can find on a signal when simulating the design in Modelsim.
The data type std logic is a resolved std ulogic, which means that if
several processes assign the same signal, there will always be a defined value of
that signal. This is an important concept in hardware design, to describe shared
busses. When a conflict occurs in a resolved signal, a resolution function will
be automatically invoked to solve the conflict and give the signal a well defined
value.
When writing synthesizable VHDL code, the type std logic and the types
based on an array of std logic should be used exclusively. In std logic 1164,
the important types std logic and std logic vector are defined. The stan-
dard package std logic arith defines two additional types: signed and unsigned.
These types have definitions identical to the type std logic vector. Table 3.1
summarizes the data types and the standard package from which they originate.
Table 3.1: Important VHDL types
Type Description Package
std ulogic ’U’, ’X’, ’0’, ’1’, ’Z’, ’W’, ’L’, ’H’, ’-’ std logic 1164
std logic Resolved std ulogic std logic 1164
std logic vector Array of std logic std logic 1164
unsigned Array of std logic std logic arith
signed Array of std logic std logic arith
integer −(2
31
− 1) to (2
31
− 1) standard
ElectroScience — 44 —
3.2. VHDL Packages
3.2.2 Operations
It should be familiar to the reader that a number can be coded into a binary rep-
resentation in different ways. A binary number representation must be specified
to perform a predictable operation on the type. Since hardware for unsigned
arithmetic is less complex than hardware for signed arithmetic in a given num-
ber range, these cases have been separated in the standard packages. Signed
operations are done with two’s complement number representation and unsigned
operations with natural binary number representation. In the VHDL language,
there are two ways of controlling if signed or unsigned operations should be
performed.
If std logic vector is used in the design, the operation is controlled in the
library clause by including the package std logic signed or std logic unsigned
to select signed or unsigned operations, respectively. Hence, the type is just a
container for either unsigned or signed numbers. An obvious drawback is that it
is not straightforward to mix signed and unsigned operators in the same VHDL
file.
-- Library clause for unsigned operations with std_logic_vector
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_unsigned.all;
-- Library clause for signed operations with std_logic_vector
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_signed.all;
The other method is to use the types signed and unsigned and hence ex-
plicitly, in the type definition, specify the contents and thereby the operation.
The operators for signed and unsigned data types are all specified in the
std logic arith package. Whether a signed or an unsigned operator are se-
lected is then controlled by operator overloading.
-- Library clause for signed or unsigned operations with signed or
-- unsigned data types
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
As already stated, several functions for the same operation are defined in each
specific package. Table 3.2 summarizes the arithmetic operations in the pack-
ages. For details about operator overloading and all defined functions, we refer
the reader to Appendix A where all packages described are found.
3.2.3 VHDL examples
Here is a VHDL code example where std logic vector is used as the data type
in a multiplication. The std logic vector is synonymous to an unsigned by
including the std logic unsigned package.
library IEEE;
— 45 — Lund Institute of Technology
Chapter 3. Arithmetic
Table 3.2: Arithmetic operations in packages
Operation Description Arith Signed Unsigned
ABS Absolute value Yes Yes No
+ Unary addition Yes Yes Yes
− Negation Yes Yes No
+ Addition Yes Yes Yes
− Substraction Yes Yes Yes
∗ Multiplication Yes Yes Yes
< Less than Yes Yes Yes
<= Less than or equal to Yes Yes Yes
> Greater than Yes Yes Yes
>= Greater than or equal to Yes Yes Yes
= Equal to Yes Yes Yes
/ = Not equal to Yes Yes Yes
SHL Left shift Yes Yes Yes
SHR Right shift Yes Yes Yes
EXT Zero extension Yes No No
SXT Sign extension Yes No No
use IEEE.std_logic_1164.all;
use IEEE.std_logic_unsigned.all;
entity multiplication is
generic(wordlength: integer);
port(in1, in2 : in std_logic_vector(wordlength-1 downto 0);
product : out std_logic_vector(2*wordlength-1 downto 0));
end multiplication;
architecture rtl of multiplication is
begin
process(in1, in2)
begin
product <= in1 * in2;
end process;
end;
Here is a VHDL code example where std logic vector is used as the data
type in a multiplication. The std logic vector is synonymous to a signed by
including the std logic signed package.
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_signed.all;
entity multiplication is
generic(wordlength: integer);
ElectroScience — 46 —
3.2. VHDL Packages
port(in1, in2 : in std_logic_vector(wordlength-1 downto 0);
product : out std_logic_vector(2*wordlength-1 downto 0));
end multiplication;
architecture rtl of multiplication is
begin
process(in1, in2)
begin
product <= in1 * in2;
end process;
end;
Here is a VHDL code example where signed and unsigned are used as the
data types in a multiplication. Only the std logic arith package are included.
Hence, no arithmetic operations can be performed directly on std logic vector.
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
entity multiplication is
generic(wordlength: integer);
port(in1, in2 : in std_logic_vector(wordlength-1 downto 0);
product_u : out std_logic_vector(2*wordlength-1 downto 0);
product_s : out std_logic_vector(2*wordlength-1 downto 0));
end multiplication;
architecture rtl of multiplication is
begin
process(in1, in2)
variable product_u : unsigned(2*wordlength-1 downto 0);
variable product_s : signed(2*wordlength-1 downto 0);
begin
-- Unsigned multiplication
product_u := unsigned(in1) * unsigned(in2);
-- Signed multiplication
product_s := signed(in1) * signed(in2);
end process;
product_u <= std_logic_vector(product_u);
product_s <= std_logic_vector(product_s);
end;
— 47 — Lund Institute of Technology
Chapter 3. Arithmetic
3.3 DesignWare
In a VHDL description, the operations are specified, but it is not decided how
these operations are to be implemented, or mapped, to hardware components.
That is decided in the synthesis step. The synthesis tool comes with pre-built
components that implement the built-in VHDL operators. These pre-built com-
ponents are technology independent and several components for implementing
the same operator may exist. Since, the different implementations have differ-
ent properties when it comes to power, area, and delay, the designer provides
power, area, or delay constraints to the synthesis tool, which then selects the
most suitable implementation.
3.3.1 Arithmetic Operations using DesignWare
In synposys synthesis tool Design Compiler the libraries with pre-built com-
ponents are called DesignWare libraries. In these libraries there are technology
independent implementations of the in-built VHDL operators as well as other
more specialized operations. Technology independent means that the imple-
mentations are described as a netlist of common logic elements, which could
later be mapped to standard cells in a specific cell library. The DesignWare li-
braries which will be loaded into DC and used during the synthesis process are
controlled in the file .synopsys dc.setup. You can list all libraries currently
loaded, using the dc shell command list -libraries. A DesignWare library
has the file extension .sldb.
dc_shell> list -libraries
Library File Path
------- ---- ----
dw01.sldb dw01.sldb /usr/local-tde/...
dw02.sldb dw02.sldb /usr/local-tde/...
standard.sldb standard.sldb /usr/local-tde/...
A DesignWare component can be included in a design using basically three
different methods.
Operator inference: In operator inference, which is the most commonly used
method, the synthesis tool automatically maps operators in the VHDL
code to components in the included DesignWare libraries. This is done in a
hierarchy of abstractions. The VHDL operator is associated to a Synthetic
operator, which is bound to Synthetic modules. Each Synthetic module
can have several implementations. The operator inference is performed
by the synthesis tool during the Analyze and Elaborate phase, during
which the VHDL operator are mapped to a Synthetic operator. Next, the
designer sets the design constraints and compiles. During this phase the
synthetic operator is mapped to a synthetic module and a implementation
is selected. Table 3.3 list the most commonly used synthetic modules
and implementations used for operator inference. A complete list of the
DesignWare components can be found by typing sold at the UNIX prompt,
select DesignWare Library and then select DesignWare Building
Block IP. Figure 3.3.1 illustrates the flow for operator inference using
the addition operator.
ElectroScience — 48 —
3.3. DesignWare
DW01_add
(rpl)
Synthetic operator
sum = in1 + in2
Analyse and Elaborate
Constrain and Compile
in1
in2
sum
DW01_add
(cla)
Figure 3.1: The figure illustrates the steps and abstraction levels used for oper-
ator inference of the addition operator
Function inference Including the DesignWare component through function
inference is done by first making the function available though a library
clause in the VHDL file, and then call the function inside the architecture
body.
Component instantiation Including a DesignWare component through com-
ponent instantiation is done similarly as including a component through
function inference. The component is made accessible through a library
clause, an entity declaration of the component is included and the archi-
tecture body contains a port map.
A complete description on how to include a specific DesignWare component
through function inference or component instantiation is found in DesignWare
Building Block IP under the specific component. Normally, when including
a DesignWare component through one of the above mentioned methods, the de-
signer selects only the synthetic module and leaves selection of implementation
to the synthesis tool. However, it is possible to control also the implementation.
This can be done in two different ways. One method is to use component in-
stantiation or function inference and add VHDL code that specifies also the im-
plementation. The other method, which makes the VHDL code implementation
— 49 — Lund Institute of Technology
Chapter 3. Arithmetic
Table 3.3: Basic arithmetic synthetic modules and their implementations
Synthetic Module Operator Description Implementations
DW01 absval ABS Absolute Ripple (rpl)
value Carry Look-Ahead(cla)
Fast Carry Look-Ahead (clf)
DW01 add + Adder Ripple (cla)
Carry Look-Ahead (cla)
Fast Carry Look-Ahead (clf)
Brent-Kung architecture (bk)
Conditional Sum (csm)
DW01 addsub +, − Adder and Ripple (rpl)
subtractor Carry Look-Ahead (cla)
Fast Carry Look-Ahead (clf)
Brent-Kung architecture (bk)
Conditional Sum (csm)
DW01 cmp2 <, > 2-Function Ripple (rpl)
comparator Fast Carry Look-Ahead (clf)
Brent-Kung architecture (bk)
DW01 cmp6 <, >, 6-Function Ripple (rpl)
<=, >= comparator Fast Carry Look-Ahead (clf)
Brent-Kung architecture (bk)
DW01 dec − Decrementer Ripple (rpl)
Carry Look-Ahead(cla)
Fast Carry Look-Ahead (clf)
DW01 inc + Incrementer Ripple (rpl)
Carry Look-Ahead(cla)
Fast Carry Look-Ahead (clf)
DW01 incdec +, − Incrementer Ripple (rpl)
and Carry Look-Ahead(cla)
Decrementer Fast Carry Look-Ahead (clf)
DW01 add − Subtractor Ripple (cla)
Carry Look-Ahead (cla)
Fast Carry Look-Ahead (clf)
Brent-Kung architecture (bk)
Conditional Sum (csm)
DW01 mult ∗ Multiplier Carry Save Array (csa)
Non-Both-recoded Wallace (nbw)
Wallace Tree (wall)
ElectroScience — 50 —
3.3. DesignWare
independent, is to use operator inference, function inference or component in-
stantiation in the VHDL file. Then, for a specific cell, after the elaborate stage,
select the implementation using the dc shell command set implementation.
For example:
dc_shell> set_implementation DW01_add/rpl cell_name
selects the cell cell name to be implemented as a Ripple Carry adder (rpl) from
the synthetic module DW01 add. This method is preferable since it keeps the
VHDL code free from implementation specific details, which is then handled in
the Synthesis script. With some knowledge of the dc shell script commands, it
is for example possible to map all addition operators in a design to a specific
synthetic module and implementation.
The dc shell script command report resources can be used after the com-
pile step to check which synthetic modules and implementations the synthesis
tool has selected for the design.
3.3.2 Manual Selection of the Implementation in dc shell
The Synopsys Design Compiler automatically determines the hardware imple-
mentation to meet the design constraints speed, area and power in the given
order. However, if one wants to realize hardware that meets special needs, it
can be necessary to select the implementation manually.
Before the implementation type can be chosen the following steps are re-
quired a priori:
• analyze
• elaborate
• uniquify
• compile
Information on the used resources, e.g., number and type of adders, can be
obtained by
dc_shell > report_resources
The table below shows an example of how the structure in Figure 3.14 is imple-
mented. The chosen implementation is a ripple-carry structure for all adders.
If one wants to change the implementation of one of the adders the desired type
has to be determined manually, see Table 3.4 and needs to be assigned to the
target cell as
dc_shell > set_implementation cla add_37/plus/plus
The implementation of the targeted adder cell has now been changed to a carry-
lookahead structure,see the table below.
\label{sc:report_resources}
****************************************
Report : resources
Design: adder
Version: 2003.06-SP1-3
Date : Mon Jul 5 16:58:42
— 51 — Lund Institute of Technology
Chapter 3. Arithmetic
Table 3.4: Adder Synthesis Implementation
Implementation name Function Library
rpl Ripple-carry DWB
cla Carry-look-ahead DWB
clf Fast carry-look-ahead DWF
bk Brent-Kung architecture DWF
csm Conditional-Sum DWF
rpcs Ripple-carry-select DWF
clsa Carry-look-ahead-select DWF
csa Carry-select DWF
fastcla Fast-carry-look-ahead DWF
2004****************************************
Resource Sharing Report for design adder in file
/home/toll/jrs/Synopsys/vhdl/course/adder.vhd
===============================================================================
| | | | Contained | |
| Resource | Module | Parameters | Resources | Contained Operations |
===============================================================================
| r139 | DW01_add | width=12 | | add_37/plus/plus |
| r142 | DW01_add | width=12 | | add_37/plus/plus_1 |
| r145 | DW01_add | width=12 | | add_37/plus/plus_2 |
| r148 | DW01_add | width=12 | | add_37/plus/plus_3 |
===============================================================================
Implementation Report
=============================================================================
| | | Current | Set |
| Cell | Module | Implementation | Implementation |
=============================================================================
| add_37/plus/plus | DW01_add | rpl | cla |
| add_37/plus/plus_1 | DW01_add | rpl | |
| add_37/plus/plus_2 | DW01_add | rpl | |
| add_37/plus/plus_3 | DW01_add | rpl | |
=============================================================================
If all the implementations in one design needs to be set manually a script
can be executed in the dc shell
SELECT_DW_CELLS = true /* Choose DW implementations (rpl or
cla)*/
DW_ADD_IMPL = rpl
DW_SUB_IMPL = rpl
DW_CMP_IMPL = rpl
design_list = find (design, "*", -hierarchy) design_list =
design_list + current_design
if (SELECT_DW_CELLS == true) {
foreach (de, design_list) {
current_design de
plus_list = find (cell, "*plus*")
if (plus_list != {}) {
set_implementation DW01_add/ + DW_ADD_IMPL plus_list }
if (BF == sub) {
minus_list = find (cell, "*minus*")
if (minus_list != {}) {
set_implementation DW01_sub/ + DW_SUB_IMPL minus_list
}}
else {
cmp_list = find (cell, "*lt*")
if (cmp_list != {}) {
ElectroScience — 52 —
3.4. Addition and Subtraction
set_implementation DW01_cmp2/ + DW_CMP_IMPL cmp_list
}}}}
3.4 Addition and Subtraction
Addition and substraction are two of the most frequently used arithmetic op-
erations. They are used both in data paths as well as in controllers. Addition
and subtraction are closely related operations. A substraction can be performed
by negating the subtrahend and add the result to the minuend with a carry in.
Hence, the same basic architectures used for the implementations of adders are
found in the implementations of subtractors, see Table 3.3.
When a + or − sign is used in the VHDL code, either the DW01 add,
DW01 sub, DW01 inc, DW01 dec, DW01 incdec, or DW01 addsub synthetic
module will be instantiated through operator inference during compilation. The
same synthetic modules and implementations are used, independent of whether
unsigned or signed data types are used in the addition or subtraction. Al-
though there is no difference between an adder/subtractor for unsigned or signed
operands, there are indeed differences when it comes to exceptions, e.g., overflow
conditions or increasing the wordlength of the operands.
Figure 3.4 demonstrates signed and unsigned addition for a 3-bit number
representation. From this example we see that the same basic addition operation
can be used for both unsigned and signed numbers. The same number of bits
should be used in the result as used in the input operands. Hence, we see
that adding 011 with 101 generates the result 000. This is the correct result
for the signed addition, but it is an overflow for the unsigned addition. A
similar example for unsigned or signed substraction shows that the same basic
subtractor can be used for both signed and unsigned numbers.
8
3 3
+
011
+ (−3) + 5
code
2
unsigned signed
0 (1)000
101
101
010
+
111
+ (−3)
2
−1 7
+ 5
signed
−3
−4
0
1
2
3
0
7
6
5
4
3
2
1
−2
unsigned
111
110
101
100
011
010
001
000
−1
Figure 3.2: 3-bits unsigned and signed addition
3.4.1 Basic VHDL example
The wordlength of an addition or substraction specifies the number of bits used
in the operands. An unsigned number with wordlength w bits lies in the range
0 to 2
w
− 1 and a signed number with wordlength w lies in the range −2
w−1
to 2
w−1
− 1. Generally, the wordlength of the input operands and the result
should be the same in the VHDL code. The wordlength of an addition or
— 53 — Lund Institute of Technology
Chapter 3. Arithmetic
subtraction is one of the most important parameters used to control the final
area and performance of the implementation. To be able to do wordlength
optimizations, the designer must be aware of the dynamic range of the input
signals and the precisions needed in the operation.
The following VHDL code will instantiate the synthetic module DW01 add for
the addition operator and the synthetic module DW01 sub for the substraction
operator, through operator inference during compilation. The implementation
will be selected to meet the design constraints as well as possible. The VHDL
code used in the examples will not be complete when it comes to entity decla-
ration, architecture body, etc. Instead, for the sake of readability, we show only
the signal declarations and the actual operation.
-- Unsigned addition
in1, in2, sum : unsigned(wordlength-1 downto 0);
sum <= in1 + in2;
-- Signed addition
in1, in2, sum : signed(wordlength-1 downto 0);
sum <= in1 + in2;
-- Unsigned substraction
in1, in2, diff : unsigned(wordlength-1 downto 0);
diff <= in1 - in2;
-- Signed substraction
in1, in2, diff : signed(wordlength-1 downto 0);
diff <= in1 - in2;
The synthetic module DW01 addsub can perform both addition and substrac-
tion and will be inferred into the design if it is possible to do resource sharing
between an addition and subtraction.
3.4.2 Increasing wordlength
Addition or subtraction of two operands with the dynamic range −2
w−1
to
2
w−1
− 1, gives a result with the dynamic range −2
w
to 2
w
− 1, as seen in
Figure 3.4.2. Hence, to avoid overflow in the result, the wordlength of the
operands must be increased by one bit before the operation, as the result must
have the same wordlength as the largest wordlength of the input operands in
the VHDL code.
This is handled differently depending on, whether the operands are signed
or unsigned. An unsigned number should be extended with zeroes. A signed
operand should be sign extended, i.e. extend with the most significant bit of
the operand.
-- Unsigned addition with extended wordlength
in1, in2 : std_logic_vector(wordlength-1 downto 0);
sum : std_logic_vector(wordlength downto 0);
sum <= ext(in1, wordlength+1) + ext(in2, wordlength+1);
-- Signed addition with extended wordlength
ElectroScience — 54 —
3.4. Addition and Subtraction
sum
in2
in1
w
w
w+1
Figure 3.3: The dynamic range of the result of a two-operand addition or sub-
straction is increased by one bit compared to the dynamic range of the input
operands
in1, in2 : std_logic_vector(wordlength-1 downto 0);
sum : std_logic_vector(wordlength downto 0);
sum <= sxt(in1, wordlength+1) + sxt(in2, wordlength+1);
An important condition for addition or substraction in VHDL is that the
output wordlength must be equal to the largest wordlength of the operands. The
operand with shortest wordlength will be zero or sign extended, depending on
whether it is unsigned or signed, respectively. In the following VHDL example
in1 will be automatically sign extended to 16 bits.
-- Signed addition with implicit sign extension
in1 : signed(7 downto 0);
in2 : signed(15 downto 0);
sum : signed(15 downto 0);
sum <= in1 + in2 ;
3.4.3 Counters
Counters and addressing arithmetic are used extensively in controllers. A
counter is defined as an addition or subtraction where one of the operands
is a constant. In such operations the addition or substraction can be optimized
compared to the two-operand operation. In VHDL, a counter is best described
by using an integer to represent the constant.
-- Counter adding 5
input : std_logic_vector(wordlength-1 downto 0);
counter : std_logic_vector(wordlength-1 downto 0);
counter <= input + 5;
In the special case when the constant is equal to +1, we have a incrementer.
When the constant is equal to −1, a decrementer. The incremeter/decrementer
is the only case among counters that has a dedicated synthetic module in the
DesignWare library. An incrementer will be mapped to the synthetic module
DW01 inc and a decrementer to the synthetic module DW01 dec. If it is possible
to perform resource sharing between incrementation and decrementation, the
synthetic module DW01 incdec will be used.
The area and delay through a counter depends on the constant. For example,
a logic 0 at bit position i in the constant will prevent carry propagation to the
next stage in a ripple carry structure. Firstly this means that a half adder is
— 55 — Lund Institute of Technology
Chapter 3. Arithmetic
sufficient at bit position i + 1 and secondly that the critical path through the
carry chain will be reduced.
The beneficial architecture of the incrementer/decrementer in the Design-
Ware library may also be used, with some efforts, to make a counter with a
constant that is a power of two. The following VHDL example illustrates this
procedure with the constant equal to 8.
-- Counter adding 8
input : std_logic_vector(wordlength-1downto 0);
counter : std_logic_vector(wordlength-1 downto 0);
counter <= (input(wordlength-1 downto 3) + 1) & input(2 downto 0);
This example gives the area 205 area units compared to 224 area units, if the
operation where expressed writing +8.
3.4.4 Multioperand addition
Multioperand addition means addition with more than two input operands. To
perform this in a single cycle a binary tree of adders is used. Single cycle
multioperand addition are expressed in VHDL by using the + operator between
all operands. The following example shows a single cycle 4-operand addition.
-- Multioperand addition
in1, in2, in3, in4, sum :signed(wordlength-1 downto 0);
sum <= in1 + in2 + in3 + in4 ;
The code will instantiate 3 DW01 add synthetic modules during synthesis, see
Figure 3.4.4. To avoid overflow, the wordlength must be manually controlled
sum
in4
in3
in2
in1
Figure 3.4: Multioperand addition, adding four operands, implemented as a
binary adder tree
and selected by the designer. Implementation as a binary adder tree dictates
that the wordlength of the last addition will be w+log
2
M|, where w is the
wordlength of the input operands and M is the number of input operands.
Hence, for each depth in the binary adder tree, 1 bit is added to the wordlength
of the input operands.
Multioperand addition can also be performed using the specialized synthetic
module DW01 sum, which contains optimized architectures for multioperand ad-
dition. The module is inferred into the design through function inference or
component instantiation. There is no support for operator inference.
ElectroScience — 56 —
3.4. Addition and Subtraction
A recurrent problem closely connected to multioperand addition, is that of
calculating a sum of M operands using a single time-shared adder and a register
to store the partial results. The wordlength of the adder and the register should
then be selected as the wordlength of the input operands plus log
2
(M)| bits,
to avoid overflow.
3.4.5 Implementations
The design space for an operator is a multidimensional space. First, the wordlength
of the operator is crucial for the designer to decide. Secondly, there are several
possible implementations for each operator. Finally, for each implementation
there exists an area-delay-power space, depending on the mapping of the struc-
ture to a cell library. Such a multidimensional space is not easily illustrated.
To make some benchmarks which can be illustrated, we have used a 16-
bits adder (DW01 add) and a 16-bits subtractor (DW01 sub) and plotted delay
versus area for the different implementations in a 0.13 µm CMOS process, see
Figure 3.4.5 and Figure 3.4.5. The plots are generated by running DC sev-
eral times with different delay constraints. The point most far to the right in
each implementation is the smallest area possible to achieve with that specific
implementation.
With an area-delay plot for all different implementations we can find the
optimal area-delay curve, which spans through the different implementations.
This optimal area-delay curve is also plotted in the figures and the optimal
implementation for a given delay interval is printed.
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
Delay (ns)
A
r
e
a

(
a
.
u
.
)
Optimal
rpl
cla
clf
bk
csm
rpcs
rpl
cla
clf
bk
csm
rpcs
bk cla rpcs rpl
Infeasible
region
Figure 3.5: Area-Delay plot for 16-bits DW01 add implementations
— 57 — Lund Institute of Technology
Chapter 3. Arithmetic
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
400
600
800
1000
1200
1400
1600
1800
2000
2200
Delay (ns)
A
r
e
a

(
a
.
u
.
)
Optimal
rpl
cla
clf
bk
csm
rpcs
rpl
cla
clf
bk
csm
rpcs
bk cla rpcs rpl
Infeasible
region
Figure 3.6: Area-Delay plot for 16-bits DW01 sub implementations
3.5 Comparison operation
Comparison operators are used frequently in controlling the data flow in data
paths and selecting operands and outputs. The result of a comparison operation
is a boolean type (true or false), which often is used to control multiplexors in
the data path.
When it comes to comparison operation it is important to distinguish be-
tween signed and unsigned operators. For example, 100 < 011 if the operands
are signed and 100 > 011 if the operands are unsigned. In VHDL there are
six comparison operators: =, <, >, <=, >= and ` =. The only two operators
where it does not matter if the operators are unsigned or signed is equal to (=)
and not equal to (` =).
A comparison operation in the VHDL code will instantiate the synthetic
module DW01 cmp6 or DW01 cmp2 through operator inference during compilation.
Figure 3.5 shows the input and outputs to the synthetic module DW01 cmp6. If
Not equal to
DW01_cmp6
signed or unsigned
in2
in1
Greater than or equal to
Less than or equal to
Equal to
Greater than
Less than
Figure 3.7: Synthetic module for the six-function comparator DW01 cmp6
ElectroScience — 58 —
3.5. Comparison operation
not all the outputs of the synthetic module for comparison operations are used,
the unused logic will be removed. Thus, adding no extra area to the design.
3.5.1 Basic VHDL example
The most basic example is to compare operands of equal size and data type, as
shown in the following VHDL example.
-- Comparison
in1 : std_logic_vector(wordlength-1 downto 0);
in2 : std_logic_vector(wordlength-1 downto 0);
result : boolean;
result <= in1 < in2;
It is also possible to make a comparison between operands with different wordlengths,
which will result in sign or zero extension of the operand with smallest wordlength.
-- Comparison with implicit sign extension
in1 : signed(wordlength-1 downto 0);
in2 : signed(wordlength-9 downto 0);
result : boolean;
result <= in1 < in2;
Comparing two operands of different types will result in an implicit type con-
version. For example, the following VHDL code will increase the wordlength of
the comparison by one bit. Then sign extend in1 and zero extend in2.
-- Comparison between different types
in1 : signed(wordlength-1 downto 0);
in2 : unsigned(wordlength-1 downto 0);
result : boolean;
result <= in1 < in2;
The counter as a special case when building an adder/substractor structure
corresponds to doing a comparison with a constant, in the context of a com-
parator structure. Hence, comparing one input operand with a constant makes
it possible to reduce the area and delay of the operation. A constant is best
described in VHDL by using an integer.
As the output of the comparison operation is very often used as an input to
a multiplexor, we demonstrate such an example as well.
-- Comparison used to control a multiplexer
in1 : signed(wordlength-1 downto 0);
in2 : signed(wordlength-1 downto 0);
diff : signed(wordlength-1 downto 0);
if (in1 < in2) then
diff <= in2 - in1;
else
diff <= in1 - in2;
end if;
The code may be translated to the hardware blocks (or synthetic modules)
shown in Figure 3.5.1.
— 59 — Lund Institute of Technology
Chapter 3. Arithmetic
in1
in2
Less than
diff
DW01_cmp2 DW01_sub
mux
in1
in2
mux
in2
in1
Figure 3.8: The output from the comparator used as input to multiplexors, to
control the data flow
Table 3.5: Various Multiplier Synthesis Implementation
Synthetic Module Inferencing Description * Implementation
DW02 mult * Multiplier csa, nbw, wall
DW02 mult stage * 2, 3, 4, 5, 6 -stage Multiplier csa, nbw, wall
DW02 prod sum component instantiation Sum of products csa, nbw, wall
3.6 Multiplication
The implementation of a multiplier in digital hardware can be done in various
ways, see Table 3.5. Dependent on the design constraints, e.g. speed, area, and
power consumption, DC chooses the most suitable implementation to meet the
design constraints. However, if an operation such as a a product-sum multiplier
is desired the instantiation has to be done in the VHDL code.
A multiplier is basically a construct the adds the partial products of the
multiplicand and multiplier. The main differences between the different imple-
mentations is the Adder implementation. The multiplication of 2 two’s comple-
ment numbers is illustrated in Figure 3.9. The width of the required Adders is
N
a
+ N
x
− 1. The width of the adders can be reduced to N
x
by carrying out
the multiplication with right-shifts, see Figure 3.12, 3.13.
a
0
x
0
a
0
x
0
p
0
x
a
1
x
1
a
2
x
2
a
3
x
3
a
4
x
4
a
1
x
0
a
2
x
0
a
3
x
0
a
0
x
1
a
1
x
1
a
2
x
1
a
3
x
1
a
0
x
2
a
1
x
2
a
2
x
2
a
0
x
3
a
1
x
3
a
3
x
2
a
2
x
3
a
3
x
3
−a
4
x
3
−a
3
x
4
a
4
x
4
p
1
p
2
p
3
p
4
p
5
p
6
p
7
p
8
−a
4
x
0
−a
4
x
1
−a
4
x
2
−a
2
x
4
−a
1
x
4
−a
0
x
4
Figure 3.9: Sequential multiplication of 2 two’s complement numbers.
ElectroScience — 60 —
3.6. Multiplication
3.6.1 Multiplication by constants
For applications where multiplications with fixed coefficients are required, DC
optimizes the multiplication with smaller multipliers. Therefore, the multipliers
must be implemented as constants rather than having them stored in a ROM.
If fixed coefficients are read from a ROM, full-sized multipliers will be imple-
mented.
However, if the multiplier, c, is a power of 2 number, 2, 4, 8, . . . , DC replaces
the multiplier automatically by a bit-shift operation. Bit-shifts are hardwired
and do not require any gates. Therefore, it is recommended to use as many power
of 2 numbers as possible when designing hardware. Moreover, if c = 2
k
it might
be beneficial to express the multipliers by bit shifts and additions/subtractions.
A multiplication such as 126 x can be written as
128 x−2 x A VHDL example on how to write a multiplication with a constant
is given below:
architecture STRUCTURAL of mult126 is
signal input : std_logic_vector(N-1 downto 0);
signal hundred26 : std_logic_vector(N+9-1 downto 0);
begin
process(input)
begin
hundred26 <= conv_std_logic_vector(126,9)*input;
end process;
A VHDL example on how to implement a multiplication as a shift-and-add
operation is given below:
architecture STRUCTURAL of adder126 is
signal input : std_logic_vector(N-1 downto 0);
signal hundred28 : std_logic_vector(N+9-1 downto 0);
signal hundred26 : std_logic_vector(N+9-1 downto 0);
begin
process(input,hundred28)
begin
hundred28 <= conv_std_logic_vector(128,9)*input(0);
hundred26 <=hundred28 - conv_std_logic_vector(2,3)*input(0);
end process;
The area and delay can be reduced significantly by expressing a multipli-
cation by a bit-shift and add operations. Synthesis results show that such a
multiplication results in least area and delay if implemented as carry-lookahead,
see Figure 3.10. There are various articles on how the find the minimum number
of adders for the implementation of a multiplier. The more interested reader is
referred to [].
— 61 — Lund Institute of Technology
Chapter 3. Arithmetic
0.5 1 1.5 2
400
600
800
1000
1200
1400
1600
Delay (ns)
A
r
e
a

(
a
.
u
.
)
nbw
wall
0.5 1 1.5 2
400
600
800
1000
1200
1400
1600
Delay (ns)
A
r
e
a

(
a
.
u
.
)
rpl
cla
clf
bk
csm
rpcs
Figure 3.10: Area/Delay comparison of a implemented multiplication. The
graphs on the left panel refer to an implementation as 126 x and on the right
panel as 128 x − 2 x
3.6.2 Multiplication of signed numbers
A multiplication can be implemented as carry-save array (csa), Non-Booth-
recoded Wallace-tree (nbw) and Booth recoded Wallace-tree (wall). Synthesis
results show that the nbw implementation is superior in terms of area and
delay compared csa and wall, see Figure 3.11. The synthesis results of the
csa implementation is not shown as it can not compete with the csa and wall
implementation.
ElectroScience — 62 —
3.7. Datapath Manipulation
3.7 Datapath Manipulation
This section shows how the construction of the datapath can be determined by
writing appropriate VHDL code. A simple arithmetic instruction as an addition
can be written as:
z <= A + B + C + D;
The instruction is mapped to hardware as a balanced tree if possible in order
to minimize the delay. The same hardware construct can be obtained as follows:
z <= (A + B) + (C + D);
If information on the arrival time of the signal is available the delay can
be shortened by rearranging the order of the adders. Assuming that signal E
arrives late, the delay time can be shortened by using parentheses as
z <= (A + B + C + D) + E;
The parser is forced to place E latest in the adder chain. The signals within
the parentheses will be arranged as a balanced adder tree, see Figure 3.15.
1 1.5 2 2.5
1500
2000
2500
3000
3500
4000
Delay (ns)
A
r
e
a

(
a
.
u
.
)
nbw
wall
1.5 2 2.5 3 3.5
7500
8000
8500
9000
9500
10000
10500
11000
11500
12000
Delay (ns)
A
r
e
a

(
a
.
u
.
)
nbw
wall
Figure 3.11: Area/Delay comparison of multipliers. On the left-hand side a
8-bit multiplier and on the right-hand side a 16-bit multiplier.
— 63 — Lund Institute of Technology
Chapter 3. Arithmetic
x 1 0 1 0 1
0 0 0 0 0
1 0 1 0 1
1 1 0 1 0 1
0 0 0 0 0
1 1 1 0 1 0 1
1 0 1 0 1
1 1 0 0 1 0 0 1
0 0 0 0 0
1 1 1 0 0 1 0 0 1
0 1 0 1 1
1 0 1 0 1
1 1 0 1 0 1
1 0 0 1 0 0 1
1 1 0 0 1 0 0 1
1 0 1 0 1
0 0 1 1 1 1 0 0 1
p(0)
+x
0
a
2p(1)
+x
1
a
2p(2)
+x
2
a
2p(3)
+x
3
a
2p(4)
+(−x
4
a)
p(5)
1
1
1
1
Figure 3.12: Negative Right-Shift Multiplier.
1 0 1 0 1
0 1 0 1 1 x
0 0 0 0 0
1 0 1 0 1
1 0 1 0 1
1 1 0 1 0 1
1 0 1 0 1
0 1 1 1 1 1
1 0 1 1 1 1 1
0 0 0 0 0
1 0 1 1 1 1 1
1 1 0 1 1 1 1 1
1 0 1 0 1
1 0 0 0 0 1 1 1
1 1 0 0 0 0 1 1 1
0 0 0 0 0
1 1 0 0 0 0 1 1 1
1 1 1 0 0 0 0 1 1 1
p(0)
+x
0
a
2p(1)
+x
1
a
2p(2)
+x
2
a
2p(3)
+x
3
a
2p(4)
+x
4
a
2p(5)
p(5)
1
1
1
1
1
Figure 3.13: Positive Right-Shift Multiplier.
Z
D C B A
Figure 3.14: Balanced Adder Tree
ElectroScience — 64 —
3.7. Datapath Manipulation
Z
E
D C B A
Figure 3.15: Unbalanced Adder Tree
— 65 — Lund Institute of Technology
ElectroScience — 66 —
Chapter 4
Memories
One of the most important topics in digital ASIC design today is memories.
Memories occupy a lot of space and consume a lot of power and the situation
becomes even worse if off-chip memories are considered. Even though each new
generation of silicon processes can include more transistors on a single chip, it
is predicted that the percentage of memory on a chip will increase. In the road
map from Japanese system-LSI industry it is predicted that in 2014 more than
90% of a chip is covered with memory, compared to todays 50%.
With these numbers in mind, it is reasonable to spend some time to optimize
the memory system when designing a new ASIC. Each saved square millimeter
of silicon corresponds to a neat pile of money and to be able to use ASICs in
mobile or extremely fast applications, low power consumption is required. In
mobile applications to maximize battery life time and in fast applications to
avoid expensive cooling equipment. Another memory issue is that the larger a
memory is the slower and more power hungry it becomes. The consequence is,
that the designer have to split large memories into smaller once in order to be
more power efficient or to meet timing requirements. This will create a more
complex memory hierarchy and introduce more design criterias.
This section does not try to cover every aspect of memory design; instead
some examples on what can be done to optimize memory systems. These ex-
amples do not serve as the final truth but rather show the reader that memory
systems have to be tailor made for each specific application. The first example
is about minimizing the area of a Shift Register (SR) [4]. The second example
deals with a memory that is not always utilized 100% [5]. A cache system to
read from a off-chip memory in a image convolution application is presented in
the last example [6]. For more information on memory designs consult [7].
4.1 SR example
In this example shift register (SRs) are considered. SRs are commonly used
to delay or align data in data paths, for example, in a pipelined Fast Fourier
Transform (FFT) processor. Here, four different ways to implement a SR are
presented together with an estimate of the required silicon area.
Since one input is read and one output is written each clock cycle, the SR
could be implemented in at least four different ways, as shown in Figure 4.1:
— 67 — Lund Institute of Technology
Chapter 4. Memories
FF
FF
FF
FF
FF
(a)
RD WR
Dual-port
Memory
(b)
FF
MUX
FF
Single
Port
Mem.
Single
Port
Mem.
(c)
FF
MUX
FF
Single-port
Memory
(d)
Figure 4.1: Four ways to realize a SR. Flip-flop based (a), dual-port memory
(b), two single-port memories (c), and one single-port memory of twice width
(d).
a. Flip-flops connected in series: This is an easy solution since there is no
control logic needed. Additionally, flip-flops have a fast access time and
thus allow high operation speed. However, flip-flops are not optimized for
area.
b. One dual-port memory: A dual-port memory can perform both a read and
write operation each clock cycle and thus suits the requirements perfectly.
However, in order to perform simultaneous read/write additional logic is
required in each memory cell, compared to a single port memory.
c. Two single-port memories of half length and alternating read/write: The
input is gathered in blocks of two and written to the memories every
other clock cycle. In a similar manner, two words are read from the mem-
ories every other clock cycle. Single-port memories have smaller memory
cells than dual-port memories, but additional logic is required outside the
memories in order to gather and separate input and output.
d. One single-port memory of half length and double width, which reads
and writes every other clock cycle. This solution works in the same way
as the previous solution, but instead of storing the two inputs in separate
memories they are stored as one word in the same memory location. Thus,
the memory has to be twice as wide.
Figure 4.2 shows the estimated area for the different implementations of SRs
in a 0.35µm CMOS technology. There are no interconnections included in the
flip-flop area. All memory SRs include control logic and a 50µm power ring on
three sides of the block. In the figure it can be seen, that flip-flops is only the
best solution, from an area perspective, if the SR is less than approximately 400
ElectroScience — 68 —
4.2. Memory split example
0 1 2 3 4 5 6 7 8 9
0
0.5
1
1.5
2
2.5
FIFO size [Kbits]
A
r
e
a
m
m
2
Flip-flops
Dual-port memory
Single-port memories
Double width memory
Figure 4.2: Area versus SR size for different implementations.
bits and that one single port memory of double width is the most area efficient
solution for SRs larger than 400 bits. It can also be seen that memory size does
not increase linearly with the number of bits such as flip-flops do. This is due
to the decreased relative overhead from address logic and buffers, these parts
do not increase in size as fast as the cell matrix does. A rough number on the
overhead can be estimated if the memory areas in Figure 4.2 are extrapolated
down to a zero bit SR. Particularly, it can be seen that the overhead for a
dual-port memory is extremely large compared to single-port memories in this
process. This is due to extra transistors in each memory cell and double address
logic.
4.2 Memory split example
The target application in this example is a memory that continuously stores
data sequences of different lengths. The length of the sequence, N, can be any
power of two between 32 and 1024. The application will write N words of data
into the memory and then read the same data in a different order from the
memory, this operation will be performed continuously.
Table 4.1: Average current drawn for 128-1024 words memories. With V
dd
=3.3
V and a wordlength of 24 bits.
Size [words] 1024 512 256 128
IDD [µA/MHz] 516 463 436 74
Since the largest N in this design is 1024, the memory has to be 1024 words
long. If the memory is implemented as one block it will consume much power
even if only a subset of the addresses is used. Hence, to reduce power con-
sumption for cases when N is less than 1024, the memory is split into smaller
blocks. As a trade-off between power reduction and the difficulty to place and
— 69 — Lund Institute of Technology
Chapter 4. Memories
Table 4.2: Average current drawn for the memory bank for different sizes of N.
The values are given for V
dd
=3.3 V and a wordlength of 24 bits.
N 1024 512 256 128 64 32
IDD [µA/MHz] 359 255 74 74 74 74
Table 4.3: Average power consumption for different sizes of N. The values are
given for V
dd
=3.3 V and a clock frequency of 50 MHz.
N 1024 512 256 128 64 32
One memory [mW] 85 85 85 85 85 85
Memory bank [mW] 59 42 12 12 12 12
Power savings [%] 30 50 86 86 86 86
route many memories, the memory is split into four blocks of size 128, 128, 256,
and 512 words. Table 4.1 shows the average current that the memory draws.
The values are given for memories in a standard 0.35 µm CMOS process with
five metal layers, provided as macrocells. Note that the 128-words memory is
a low power memory and this was the largest low power memory available for
this process. The average current drawn for the memory bank for different sizes
of N is shown in Table 4.2. For example, if N = 512-points, both 128-words
memories and the 256-words memory is used. Each 128-words memory is used
a quarter of the time and the 256-words memory is used half the time. In this
case the average current is calculated as 436/2 + 74/4 + 74/4 = 255 µA/MHz.
With N smaller than 512-points, only the low power memories are used.
Table 4.3 shows a comparison of the power consumption between the one
memory and the memory bank solution. The values are given for a clock fre-
quency of 50 MHz and V
dd
=3.3 V. The last row shows the power savings in
percent. The power savings are between 30 and 86 percent. However, the draw-
back is that a larger chip is required, since the silicon area increases with the
number of memories due to the overhead in address logic and power supply.
4.3 Cache example
The target application is a custom image convolution processor. Two dimen-
sional image convolution is one of the fundamental processing elements among
many image applications and has the following mathematical form:
y(K
1
, K
2
) =

x(K
1
−m
1
, K
2
−m
2
)h(m
1
, m
2
), (4.1)
where x and h denote the image to be filtered and kernel functions, respectively.
K
1
and K
2
is the pixel index, i.e., this operation is performed once for each
output pixel. In this example x is 256256 pixels and h is 1515 pixels large.
As a consequence of the image size, the complete image has to be stored on
off-chip memory. During the convolving operations, each kernel position requires
1515 pixel values from off-chip memory directly. When this is performed
in real-time, a very high data throughput is needed. Furthermore, accessing
large off-chip memory is expensive in terms of power consumption and so is the
signaling due to the large capacitance of package pins and PCB wires.
ElectroScience — 70 —
4.3. Cache example
Table 4.4: memory hierarchy schemes and access counts for an image of 256256
(M0=off-chip 0.18µm, C0 and C1=0.35µm)
Scheme M0 C0 C1 energy cost
A: M0 13176900 790 mJ
B: M0→ C0 65536 13176900 56.6 mJ
C: M0→ C1 929280 13176900 68.9 mJ
D: M0→C0→C1 65536 929280 13176900 20.8 mJ
By the observation that each pixel value, except the one in the extreme
corners, is used in several calculations, a two level memory hierarchy was intro-
duced. Instead of accessing off-chip memories 225 times for each pixel operation,
14 full rows of pixel values and 15 values on the 15th row are filled into 15 on-
chip line memories before any convolution starts. After the first pixel operation,
for each successive pixel position, the value in the upper left is discarded and
a new value is read from the off-chip memory. As a result, pixel values from
the off-chip memories are read only once for the whole convolving process from
upper left to lower right. Thus the input data rates are reduced from 1515
accesses/pixel to only 1 read. In Figure 7.16 a three level memory hierarchy is
shown and the number of memory access required for Equation 4.1 when using
one, two, or three levels of the hierarchy.
2
N
2
M M M N ) 1 (
Image memory
off-chip
Kernel
memories
Line memories
Scheme Image Line Kernel
Image M
2
(N-M+1)
2
Image line
N
2
M
2
(N-M+1)
2
Image kernel MN(N-M+1) M
2
(N-M+1)
2
Image line kernel
N
2
MN(N-M+1) M
2
(N-M+1)
2
Figure 4.3: Three levels of memory hierarchy and the number of reads from
each level.
Using a two level memory structure can reduce both power consumptions
and I/O bandwidth requirements, one extra cache level is introduced to further
optimize for power consumptions during memory operations, as shown in Fig-
ure 7.16. Since accessing large memory consumes more power, 15 small cache
memories are added to the two level memory hierarchy. The small cache mem-
ories are composed of 15 separate memories to provide one column pixel values
— 71 — Lund Institute of Technology
Chapter 4. Memories
during each clock cycle. Instead of reading pixel values directly from line mem-
ories 15 times for each pixel operation, one column of pixel values are read to
cache memories first from the line memories for each new pixel operation ex-
cept for the first one. During each pixel operation, direct line memory access is
replaced by reading from the cache. As a result, reading pixel values from line
memories 15 times could be replaced by only once plus 15 times small cache
accesses. Assuming that the embedded memories are implemented in a 0.35µm
process CMOS, by the calculation method in [7], it is shown in Table 4.4, that
the power consumption for the total memory operation could be reduced over
2.5 times compared to that of a two level hierarchy.
In addition to the new cache memory, one extra address calculator is syn-
thesized to handle the cache address calculation. In order to simplify address
calculation, the depth for each cache is set to 16. This allows circular operations
on the cache memories. During new pixel operations each new column of data
is filled into the cache in a cyclic manner. However, the achieved low power so-
lution has to be traded for extra clock cycles introduced during the initial filling
of the third level cache memories. In the case of an image size of 256256, this
will contribute 61696 clock cycles in total. Compared with the total clock cycles
in the magnitude of 10
7
, such a side effect is negligible.
Cache
level 3
(15x16)
Processor
core
1
Processor
core
4
Processor
core
2
Processor
core
3
Line memories with pipelined registers
level 2
(15x256)
2
APU
1
APU
3
New value feeded during
each new pixel operation
Large off-chip
memories
level 1 (256x256)
Input
buffer
New column written
to cache for each
new pixel operation
Image processor
without controller
Data out - 4x24 bits
System bus 15x8 bits
8 bits
Cyclic column
storage
4 bits
address line
Control
signals from
controller
Unfilled memory
elements
Kernel moving one
pixel to the right
Figure 4.4: Controller Architecture with incremental circuitry
In fact, alternative schemes exists by different combinations of the three
memories in the hierarchy. All these possible solutions differ substantially in
memory accesses. In Table 4.4, different memory schemes are shown where M0,
C0, C1 denote off-chip memory, line memories and smaller caches, respectively.
Although the amount of data provided to the datapath remains the same for
all four schemes, the access counts to the large off-chip memories varies. For
the two and three level hierarchy structures, the counts to the large memory
M0 are reduced by nearly 225 times compared to that of the one level solution.
Between scheme B and D, the least access counts to both external memories and
line memories are identified in three level memory structure, but this is achieved
at the cost of extra clock cycles introduced during each pixel operation. Thus,
trade off should be made when different implementation techniques are used.
For the FPGAs where power consumption is considered less significant, two
level hierarchies prevails due to its lower clock cycle counts. While for ASIC
ElectroScience — 72 —
4.3. Cache example
solutions, three level hierarchy is more preferable as it results in a reasonable
power reduction.
— 73 — Lund Institute of Technology
ElectroScience — 74 —
Chapter 5
Synthesis
5.1 Basic concepts
Synthesis is the process of taking a design written in a hardware description
language, such as VHDL, and compiling it into a netlist of interconnected gates
which are selected from a user-provided library of various gates. The design
after synthesis is a gate-level design that can then be placed and routed onto
various IC technologies.
There are three types of hardware synthesis, namely, logic synthesis, which
maps gate level designs into a logic library, RTL synthesis, creating a gate level
netlist from RTL behavior, behavioral(high-level)synthesis, that creats a RTL
description from an algorithmic representation.
Such a process is generally automated by many commercial CAD tools. How-
ever, this tutorial will only cover the synthesis tool named Design Compiler by
Synopsys, which is widely used in industry and universities. The much sim-
plified version of the design flow is shown in Figure 5.1. The Design Compiler
product is the core of the Synopsys synthesis software products. It comprises
tools that synthesize your HDL designs into optimized technology-dependent,
gate-level designs. It supports a wide range of flat and hierarchical design styles
and can optimize both combinational and sequential designs for speed, area,
and power.
Design Compiler deals with both RTL and logic synthesis work. Its process of
synthesis can be described as translation plus logic optimization plus mapping.
This is illustrated in Figure 5.2.
Translation is the process of translating a design written in behavioral hdl
into a technology independent netlists of logic gates. In design compiler, this is
done by either the command read vhdl or analyze/elaborate. It performs
syntax checks, then builds the design using generic (GTECH) components.
After translation, the design in the form of netlists of generic components
needs to be mapped into the real logic gates (contained in cell libraries provided
by IC vendors). This process is accompanied by design optimization. Accord-
ing to various user specified constraints and optimization techniques, design
compiler maps and optimizes the design with a range of algorithms, eg., trans-
formation, logic flattening, design re-timing, and selecting alternative cells, etc.
Aiming to meet the design constraints, the process makes trade-offs between
— 75 — Lund Institute of Technology
Chapter 5. Synthesis
VHDL
Simulation
Correct Function?
Synthesis
Place &
Route
Layout
Modelsim
Design Compiler
Silicon Ensemble
N
Y
Back Annotate
Figure 5.1: Simplified ASIC Design flow
speed, area and power consumption. Upon completion, the unmapped design
in Design Compiler memory is overwritten by the new, mapped design, ready
for use in later physical design of place and route.
5.2 Setting up libraries for synthesis
Before Design Compiler could synthesize a design, it is necessary to tell the tool
which libraries it should use to pick up the cells. In total, there are three types
of libraries that should be specified in Synopsys setup file ”.synopsys dc.setup”,
namely, technology libraries, symbol libraries, and DesignWare libraries.
The technology libraries, which contain information about the characteristics
and functions of each cell, are provided and maintained by the semiconductor
vendors and must be in .db format. If only library source code is provided, you
have to use Synopsys Library Compiler to generate .db format, see Library Com-
piler manual for further information. In addition to cell information, Wireload
models are also provided to calculate wire delays during optimization process.
Wireload models will be discussed in detail in the following sections.
The symbol libraries contain graphical symbols for each library cells. As
with technology libraries, it is also provided and maintained by semiconductor
vendors. Design compiler uses it to display symbols for each cell in the design
schematic.
The DesignWare libraries are composed of pre-implemented circuits for arith-
metic operations in your design, and such designware parts are inferred through
arithmetic operators in the HDL code, such as ”+−∗ <><=>=”. The libraries
come with two types: the standard DesignWare library and the Foundation De-
signWare library. The standard version provides some basic implementation of
arithmetic operations, for example, the classic implementation of multiplication
by a 2-D array of full adders. The Foundation provided some advanced im-
plementation to provide significant performance boost at the cost of a higher
gate count. For example, a Booth-coded Wallace tree multiplier is provided
ElectroScience — 76 —
5.2. Setting up libraries for synthesis
library ieee;
entity test is
port (
a : in ..
b: in ...
...
end test;
architecture ..

create_clock ...
uniquify
... ...
Translate
Mapping+optimization
constraints
VHDL
GTECH netlist
Technology netlist
Figure 5.2: Synthesis Process
for multiplications. In addition to basic arithmetic operations, several com-
plex arithmetic operations are implemented to obtain even higher performance.
These are the MAC (Multiply-Accumulate), SOP (Sum-of-Products), vector
adder(A+B+C+D), their counterpart DesignWare components are DW02 mac,
DW02 prod sum, DW02 sum respectively. One thing to note is these parts can
not be inferred through HDL operators, modifications to the HDL code has to be
done by instantiating or using function calls to utilize those parts. The standard
library comes for free, while the foundation library need a extra license.
To setup these libraries in Synopsys setup file, use four library variables,
”target library, link library, symbol library, synthetic library”.
Here is one example segment of the setup files for AMS035 technology at the
department.
/* -- .synopsys_dc.setup -- -- setup file for Synopsys synthesis
Synopsys v2002.05 --
AMS 0.35u CMOS v3.50 -- Feb 2003 / S Molund -- */
company = "TdE LUND"
cache_read = ./; cache_write = ./;
search_path = {
/usr/local-tde/cad1/amslibs/v3.51/synopsys/c35_3.3V } +
search_path
link_library = { c35_CORELIB.db c35_IOLIB_4M.db standard.sldb
dw01.sldb dw02.sldb}
target_library = { c35_CORELIB.db c35_IOLIB_4M.db }
symbol_library = { c35_CORELIB.sdb c35_IOLIB_4M.sdb }
synthetic_library = { standard.sldb dw01.sldb dw02.sldb }
From Synopsys manual, target libraries set the target technology libraries
where all the cells are mapped to, in the example above, ”c35 CORELIB.db
— 77 — Lund Institute of Technology
Chapter 5. Synthesis
c35 IOLIB 4M.db” in the directory indicated in ”search path”. The link li-
braries resolve cell references in your design, so both target library(c35 CORELIB.db
c35 IOLIB 4M.db), your own design(* by default), and DesignWare libraries(standard.sldb
dw01.sldb dw02.sldb) should be specified here. The variable ”symbol library”,
”synthetic library” set symbol libraries and DesignWare libraries respectively,
where ”stdandard.sldb” represents standard DesignWare library while ”dw01.sldb
dw02.sldb” is a foundation DesignWare library.
5.3 Baseline synthesis flow
After correctly setting up the environments, the synthesis tool is ready to start.
Synopsys Design Compiler provides two ways to synthesize the design. One is
the command line user interface, called dc shell, which works much like the way
of a unix shell, and even some common unix command is also supported in such
a shell, eg., ls, cd. The shell can be invoked by typing the command: dc shell.
If the designer is more into the graphical interface, there exists two programs
provided by Synopsys, namely, design analyzer and design vision. These two
programs are nothing but simple graphical interfaces for dc shell. Most com-
mon operations in synthesis can be done by ”pushing-button” procedure in the
graphical tools. To use any of the two graphical tools, type in design analyzer
& or desgin vision -dcsh mode &.
To synthesize a design using command line based dc shell is usually con-
fusing for beginners, however, it is a more efficient way to do synthesis for the
experienced designer. By running synthesis script (a collection of synthesis com-
mands and constraints) on a dc shell, a synthesis flow can be fully automated
without any intervention. For this reason, only synthesis commands/constraints
are given for each synthesis steps in the following sections. Beginners who uses
graphical tools can easily find these commands/constraints in the menus, refer
to Design Vision User Guide or Design Analyzer Reference manual for more
details about the graphical tools.
5.3.1 Sample synthesis script for a simple design
The following is a sample synthesis script for a simple design. Assume there is
one project containing a hierarchy of design files, which is shown in Figure 5.3.
To synthesize the design, the design files are first analyzed one by one from
bottom up in the hierarchy. During this process, each design is checked for
syntax error and loaded into Design Compiler memory if no errors are found.
The top design is then elaborated, which means the whole design is translated
into a circuit interconnected by technology independent logic gates (GTECH
netlist). Given some goals (eg. speed or area of the circuit) by constraint
specification, Design Compiler starts mapping and optimization processes at
the same time. That is, trying to pick up cells from the technology libraries
to meet the design goals. This is called optimization, which can be done by
the command compile. Whenever the design is optimized and mapped to the
technology gates, it can be saved as different format for future use, eg., .sdf and
.vhd format for back-annotate simulation, .v format(verilog) for place&route.
To check if the synthesized design meets the goals, report has to be generated.
Use the command report timing or report area.
ElectroScience — 78 —
5.4. Advanced topics
TOP
Module_1 Module_2
Cell_1 Cell_2 Cell_3
Figure 5.3: Sample hierarchical design
/* Analyze and ELaborate design */
analyze -f vhdl -lib WORK cell_1.vhd
analyze -f vhdl -lib WORK cell_2.vhd
analyze -f vhdl -lib WORK cell_3.vhd
analyze -f vhdl -lib WORK module_1.vhd
analyze -f vhdl -lib WORK modele_2.vhd
analyze -f vhdl -lib WORK top.vhd
elaborate top -lib WORK
/* constraints (design goals) */
create_clock -period CLK_PERIOD -name CLK find(port CLK)
uniquify
/* compile (mapping & optimization) */
current_design top
compile
/* outputs */
write -format db -hier -output TOP.db
write -f vhdl -hier -output TOP.vhd
write -f verilog -hier -output TOP.v
write_sdf -version 1.0 top.sdf
/* reports */
current_design top
report_timing >"timing.rpt"
report_area > "area.rpt"
5.4 Advanced topics
So far, simple synthesis steps have been given to show how to derive a hardware
(netlist) from a behavioral HDL descriptions using Design Compiler. However,
due to its inherent complexity in synthesis process, further knowledge on inner
working mechanism of synthesis process would be beneficial for a designer to
control the synthesis process in order for better synthesis results. The following
sections will try to address some of important issues that will lead to a bet-
ter understanding of the synthesis process, hopefully resulting in better script
writing to control the synthesis process.
— 79 — Lund Institute of Technology
Chapter 5. Synthesis
Figure 5.4: Statistic distributions of wire capacitance value
5.4.1 Wireload model
During optimization phase, design compiler needs to know both cell delays and
wire delays. The cell delays are relatively fixed and dependent on the fan-
ins and fan-outs of those cells. Wire delays, however, can not be determined
before post-synthesis place and route. One has to estimate such delays in certain
”pessimistic” way to guarantee the estimated delay is equal to or larger than the
real wire delays after place and route, otherwise the circuit might malfunction.
Usually, Wireload models are developed by semiconductor vendors, and this is
done based on statistical information taken from a variety of example designs.
For all nets with a particular fanout, the number of nets with a given capacitance
is plotted as a histogram, one capacitance is picked to represent the real values
for that fanout depending on how conservative criteria you want the model to
be. Figure 5.4 is one example of such histogram. If a criteria of 90 percent is
chosen, the value is 0.198 pF in this case.
Since the larger a design gets, the larger capacitance the wires might have,
Wireload models usually come with a range of area specifications. Use the
command report lib, you will see the segment describing the wireload models
as the example below.
****************************************
report : library
Library: c35_CORELIB
Version: 2003.06-SP1-3
Date : Wed Aug 18 13:06:21 2004
****************************************
Library Type : Technology
Tool Created :2000.11 Date Created : Date:2003/03/14
Library Version : c35_CORELIB 1.8 - bka Comments :
Owner: austriamicrosystems AG HIT-Kit: Digital Time Unit : 1ns
Capacitive Load Unit : 1.000000pf Pulling Resistance Unit :
1kilo-ohm Voltage Unit : 1V Current Unit :
1uA Leakage Power Unit : Not specified.
ElectroScience — 80 —
5.4. Advanced topics
Bus Naming Style : %s[%d] (default)
Wire Loading Model:
Name : 10k Location : c35_CORELIB Resistance
: 0.0014 Capacitance : 0.001633 Area : 1.8
Slope : 5 Fanout Length Points Average Cap Std
Deviation
--------------------------------------------------------------
1 5.00
Name : 30k Location : c35_CORELIB Resistance
: 0.0014 Capacitance : 0.001633 Area : 1.8
Slope : 6 Fanout Length Points Average Cap Std
Deviation
--------------------------------------------------------------
1 6.00
Name : 100k Location : c35_CORELIB Resistance
: 0.0014 Capacitance : 0.001633 Area : 1.8
Slope : 8 Fanout Length Points Average Cap Std
Deviation
--------------------------------------------------------------
1 8.00
Name : pad_wire_load Location : c35_CORELIB
Resistance : 0.0014 Capacitance : 0.001633 Area
: 1.8 Slope : 15 Fanout Length Points Average Cap
Std Deviation
--------------------------------------------------------------
1 15.00
Wire Loading Model Selection Group:
Name : sub_micron
Selection Wire load name
min area max area
-------------------------------------------
0.00 1427094.00 10k
1427094.00 4280637.00 30k
4280637.00 219533760.00 100k
Wire Loading Model Mode: enclosed.
As to how to select Wireload models for wires crossing design hierarchy,
synopsys provides three modes, namely, top, enclosed and segmented.
This is shown in Figure 5.5.
For the top mode, Design Compiler sees the whole design as if there is no
hierarchy, and use the Wireload model used for the top level for all nets.
In the enclosed mode, Design Compiler uses the Wireload model specified
for the smallest design that encloses the wires across the hierarchy.
Segmented mode is the most optimistic mode among all these three, it uses
several Wireload models for segments of the wire for each hierarchy.
You can choose which mode to use by the command set wire load model
in your constraint file. Otherwise, design compiler uses the mode specified in
the technology libraries or it’s own default mode.
5.4.2 Constraining the design for better performance
After translating process, the design in behavioral HDL is translated into tech-
nology independent gates, after which optimization phase starts. Two major
— 81 — Lund Institute of Technology
Chapter 5. Synthesis
10K 10K 30K
100K 100K
100K
10K
Mode=segmented
100K
10K
30K
10K
30K
100K
D Q
D Q
Q D
D Q
D Q
D Q
D Q
D Q
D Q
30K
100K 100K 100K
100K 100K
Mode=top
100K
100K
100K
10K 10K
100K
100K
D Q
D Q
Q D
D Q
D Q
D Q
D Q
D Q
D Q
30K
30K 30K 30K
100K 100K
Mode=enclosed
100K
100K
100K
10K
100K
100K
10K
D Q
D Q
Q D
D Q
D Q
D Q
D Q
D Q
D Q
Figure 5.5: Wireload mode
tasks take place during this process, combinational logic optimization plus map-
ping and sequential logic optimization.
For combinational logic optimization, a phase called technology-independent
optimization is performed to meet your timing and area goals. Two most com-
monly used methods to do this are logic flattening and structuring which are
described in the section about logic optimization. Combinational logic mapping
takes in the technology-independent gates optimized from an early phase and
maps those into cells selected from vendor technology libraries. It tries different
cell combinations, using only those that meet or approach the required speed
and area goals.
Sequential optimization maps sequential elements (registers, latches, flip-
flops, etc.) in the netlists into real sequential cells (standard or scan type) in
the technology libraries, and also tries to further optimize sequential elements
on the critical path by attempting to combine sequential elements with it’s
surrounding combinational logic and pack those into one complex sequential
cells, refer to Design Compiler Reference Manual: Optimization and Timing
analysis.
5.4.3 Compile mode concepts
Before going into details about optimization techniques, the basic concepts re-
garding different compile modes should be explained. It is crucial for under-
standing different compile strategies and scripts writing. If properly used, these
compile modes will significantly reduce the compile time and improve the final
ElectroScience — 82 —
5.4. Advanced topics
synthesis results.
• Full compilation
The most commonly used compile mode is the full compilation. It starts
the optimization and mapping process all over again no matter whether
the design was compiled or not before. If prior optimized gate structure
already exists, it removes it in the design and rebuild a brand new gate
structure. Usually this results in better circuit implementation after the
design has been changed, but such improvements come at the cost of
increased compilation time. Full compilation is simply invoked by the
command compile in the scripts.
• Incremental compilation
Incremental compilation is useful if only part of the design fails to meet
the constraints. It starts with an optimized gate structure and works
on the parts with violations. It tries to re-map these parts and change
the structure only if they are improved. BY working on only problem-
atic parts without first phase of logic optimization, the compile time is
reduced substantially. Use the command compile -inc for incremental
compilation.
5.4.4 Compile strategy
Most designers would not bother with design strategy if their design is within
reasonable size (for example, within 150k equivalent gates), which is the case for
the designs at the department. Just take the top level design in the hierarchy
and compile it, a better results among all alternative strategies can be achieved.
This is so called top-down compile strategy. In addition to the QOR(Quality of
Results), the script writing is much simpler and easy to understand.
If the design, however, turns into a huge project, like the ones a group of peo-
ple work on, the run time of compiling the whole design could be prohibitive. In
such cases, the synthesis process has to take place in parallel. This demands the
whole design be separated into several sub-blocks, synthesized separately, and
integrated together on the top level, a typical divide-and-conquer procedure. In
practice, such bottom-up and then top-down strategy would not be manipulated
as easily as it sounds. Problems always emerge in the sub-block interfacing. If
snake path is present due to not fully registered sub-blocks, manually setting
constraints on the sub-block borders could be laborious and apt to mistakes.
Some blocks might be over-constrained while others under-constrained. This in
turn will direct Design Compiler to work on the wrong part of a timing path.
Solution to the above problems is automatic design budgeting, one capability
that comes with Design Compiler.
Design budgeting can be invoked on dc shell with the command dc allocate budgets
at three different levels. The budgeting starts either in unmapped RTL level or
mapped gate level or the mix of these two. It utilizes top level constraints and
delay information (cell delays, Wireload models, physical data from P&R, etc.)
to create a set of constraints among all hierarchies, a designer could then apply
these constraints to the sub-blocks interested, and compile top-down for the
sub-modules separately, and do top level compile to fix interconnect violations
— 83 — Lund Institute of Technology
Chapter 5. Synthesis
for top level modules. Refer to Synopsys User Guide, Budgeting for Synthesis
for more information.
5.4.5 Design Optimization and constraints
Design optimization is performed by Design Compiler according to user specified
constraints. Two types of constraints exist for constraining your design. The
following sections will discuss these in detail.
One of the two types is the design rule constraints, which are implicit con-
straints defined by the technology libraries. These constraints are required to
guarantee the circuit to function correctly. Some examples are:
• The maximum transition time for a net, specifying a cell with input pin
connected to the net to function correctly.
• The maximum fanout of a cell that drives many load cells.
• The maximum load capacitance a net could have.
• A table listing maximum capacitance of a net as a function of the transition
times at the inputs of the cell.
When violations are found during the optimization process, measures are
taken to meet the requirements, eg. choosing cells with different size, buffer-
ing the outputs of a driving cell. Although the design will meet design rule
constraints with default values by technology libraries, the designer could still
specify more restrict design rule constraints.
The other type is optimization constraints. These constraints are defined by
the designer with the goal to get better synthesis results such as higher speed or
less chip area. It has a lower priority over design rule constraints. In other word,
design compiler can not violate design rule constraints, even if it means violat-
ing optimization constraints. Comprised of four types of constraints (namely
speed, power, area, and porosity), optimization constraints are commonly used
to optimize for the speed of your design.
To use optimization constraints to obtain better circuit performance, you
should have in mind that your design is composed of two parts, synchronous
and asynchronous paths. Synchronous paths are usually the timing paths of
combinational logics between any two sequential elements (eg. registers). To
specify the maximum delay of these timing paths, you only have to set clock
period constraints on the clock pins of the sequential elements. Use the com-
mand create clock to specify the clocks. In addition to timing paths between
sequential elements, any paths starting from an input port and end at the se-
quential elements or paths from sequential elements to output ports are regarded
as synchronous paths as well. Apart from clock period setting, input and out-
put delays have to be specified to complete timing paths constraints. Use the
set input delay and set output delay for this purpose. Asynchronous delay
is defined as point to point paths with partial or no association with sequential
elements. To set the constraints on these paths like maximum and minimum
delays, use the command set max delay and set min delay.
In the example, only clock period is set to constrain any synchronous paths
between synchronous elements under the assumptions of 0 input and output
delay.
ElectroScience — 84 —
5.4. Advanced topics
Optimize accrodingly TOP
A_1
A_0
Copy
Copy
Subdesign A
A_0
A_1
TOP
Compile
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
Figure 5.6: Uniquify method
5.4.6 Resolving multiple instances
Before compilation, the designer need to tell Design Compiler how multiple
instances should be optimized according to the environment around. In a hier-
archical design, a sub-design is usually referenced more than once in the upper
level, thus reducing the effort of recoding the same functionality by design reuse.
In order to obtain better synthesis results, each copy of these sub-designs has
to be optimized specifically to be tailored to the surrounding logic environment.
Design Compiler provides three different ways of resolving multiple instances.
• Uniquify
When the environment around each instance of a sub-design differs signif-
icantly, it would be better to optimize each copy independently to make
it fit to the specific environment around them. Design Compiler provides
the command Uniquify to do this. The command makes copies of the
sub-design each time it is referenced in the upper level of the hierarchy,
and optimize each copy in a unique way according to the conditions and
constraints of its surrounding logic. Each copy of the sub-design will be
renamed in certain conventions. The default is %s %d, where %s is the
sub-design name, and %d is the smallest integer count number that dis-
tinguish each instance. This is shown in Figure 5.6. The corresponding
sample script is as follows:
dc_shell> current_design top
dc_shell> uniquify
dc_shell> compile
• Don’t touch
There is one another way called Don’t touch that compile the sub-design
once and use the same compiled version across all the instances that ref-
erence the sub-design. This is shown in Figure 5.7. This is used when the
environment around all the instances of the same sub-design is quite simi-
lar. Use the command characterize to get the worst environment among
all the instances, and the derived constraints are applied when compiling
the sub-design. A dont touch attribute is then set on the sub-design
so that the following compilation in the upper level will leave the all the
instances intact. The major motivation of doing optimization in this way
is to save compilation time of the whole design. The script is as follows:
— 85 — Lund Institute of Technology
Chapter 5. Synthesis
Subdesign
TOP
A
A
A
Copy
Compile
A
D Q
D Q
D Q
D Q
D Q
Figure 5.7: Dont touch method
dc_shell> current_design top
dc_shell> characterize U2/U3
dc_shell> current_design C
dc_shell> compile
dc_shell>current_design top
dc_shell> set_dont_touch {U2/U3 U2/U4}
dc_shell>
compile
• Ungroup
If the designer wants the best synthesis results in optimizing multiple in-
stances, use the command Ungroup. Similar to Uniquify, Ungroup makes
copies of a sub-design each time it is referenced. Instead of optimizing each
instance directly, design compiler removes the hierarchy of the instance to
“melt” it into its upper level design in the hierarchy. This is shown in
Figure 5.8. In this way, further design optimization could be done at the
logics that were originally at the instance border which are impossible to
optimize with design hierarchy. The better synthesis result comes at the
cost of even longer compilation time. In addition, design hierarchy is de-
stroyed, which makes post-synthesis simulation harder to perform. The
sample script is as follows:
dc_shell> current_design B
dc_shell> ungroup {U3 U4}
dc_shell>current_design top
dc_shell> compile
5.4.7 Optimization flow
Until now, enough “peripheral” knowledge has been introduced regarding de-
sign optimization. This section will move on to give an overview of the whole
optimization process, intended for a better understanding of what is actually
happening in synthesis process. This, to the author’s view, is beneficial for
constraints settings in optimization.
ElectroScience — 86 —
5.4. Advanced topics
Compile
Copy
Copy
Subdesign A
TOP
A_0
A_1
Hierarchy
Remove
A_0
TOP
A_1
TOP
Two logic blocks merging Optimize accrodingly
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
Figure 5.8: Ungroup method
In fact, design optimization starts already when it reads in designs in HDL
format. Aside from performing grammatical checks, it tries to locate com-
mon sub-expressions for possible resource sharing, select appropriate Design-
Ware implementations for operators, or reorder operators for optimal hardware
implementations, etc. These high level optimization steps are based on your
constraints and HDL coding style. Synopsys refers this optimization phase as
Architectural Optimization.
After architectural optimization, the design represented as GTECH netlist
goes through logic optimization phase. Two major process happens during this
phase:
• Logic Structuring
For paths not on critical timing ranges, structuring means smaller area
at the cost of speed. It checks the logic structure of the netlist for common
logic functions, attempting to evaluate and factor out the most commonly
used factors, and put them into intermediate variables. Design resource
is reduced by hardware sharing. Introducing extra intermediate variables
increases the logic levels into the existing logic structure, which in turn
makes path delay worse. This is the default mode that design compiler
optimize a design.
• Logic Flattening
Opposite to logic structuring, logic flattening converts combinational logic
paths into a two level, sum-of-products structure by flattening logic ex-
pressions with common factors. It removes all the intermediate variables
and form low depth logic structure of only two levels. The resulting en-
— 87 — Lund Institute of Technology
Chapter 5. Synthesis
hanced speed specification is a tradeoff for area and compilation time, and
some times it is even not practical to perform such tasks due to limited
CPU capability. By default, design compiler does not flatten the design.
It is even constraints independent, meaning design compiler would not in-
voke it even timing constraints is present. Designer, however, can enable
such capability by using the command set flatten.
Gate level optimization is about selecting alternative gates from the cells in
the technology libraries, which tries to meet timing and area goals based on your
constraints. Setting constraints (both design rule and optimization) is already
covered in the previous section. Various compile mode can be used to control
the mapping process in gate level optimization.
5.4.8 Behavioral Optimization of Arithmetic and Behav-
ioral Re-timing
For designs with large number of arithmetic operations such as multimedia ap-
plications, manually setting constraints to optimize datapath is a tedious work.
Designer has to go back to HDL descriptions, manually inserting pipeline stages
to the parts on the critical ranges. Specialized arithmetic components have to
be coded or instantiated in the HDL file, making it harder to maintain. To
address this issue, Design Compiler provides two capabilities that will automate
such a process.
BOA (Behavioral Optimization of Arithmetic) is a feature that applies var-
ious optimization on arithmetic expressions. It is effective for datapaths with
large number of computations in the form of tree structures and large word-
length operands. BOA replaces arithmetic computations in the form of adder
trees with carry save architecture. This includes decomposing a multiplier into
a wallace tree and an adder. It also optimizes multiplication with a constant
into a series of shift-and-add operations. Compared to similar implementations
from DesignWare components, BOA can also work on much more complicated
arithmetic operations and does not need to change the source code manually.
Acting as a behavioral optimization techniques, it only recognizes high level
synthetic operators (+, −, ∗), and is invoked by the command transform csa
after elaboration (but before compile).
BRT (Behavioral Re-timing) is a technique to move registers around com-
binational paths to achieve faster speed or less register area. The concept of
retiming is introduced in the course “DSP-Design”. By inserting pipeline reg-
isters into timing paths, the path delays between any two registers are reduced.
As a result, the total circuit could run at a higher speed at the cost of latency.
The idea is easy to understand but hard to implement when this is done in
the mapped gate level. Use the command pipeline design provided by Design
Compiler. This command will automatically place the pipeline registers on your
design to achieve the best timing results according to your requirements. One
thing to note is that the targeted design has to be purely combinational circuits.
There is another command, optimize registers, which optimizes the number
of the registers by moving registers in a sequential design. Refer to Creating
High-speed Data-Path Components for more details.
ElectroScience — 88 —
5.5. Coding style for synthesis
Figure 5.9: synthesis for if bad
5.5 Coding style for synthesis
Although synthesis tools will optimize the design to a certain extent, only 10-20
percent of performance enhancement is usually expected. The higher the design
level you are working on, the more performance boost you will get eventually.
In this context, the most important factor for a good design lies mostly in every-
thing in the upper level, ie, how well the algorithm is, what kind of architecture
the whole system is utilized, how good the whole system is partitioned, whether
a good coding style is used in VHDL.
Since this will cover everything from algorithm down to logic gates, which
is beyond the range of the tutorial, only coding styles will be covered in the
following section. Two examples on how the synthesis results could differ in a
good and bad coding style will be shown.
5.5.1 Coding style if statement
This is a “classical” problem regarding if statement coding style in vhdl, which
comes to different synthesis results. Below is two coding styles using if state-
ment for the same functionality.
library ieee;
use ieee.std_logic_1164.all;
entity if_bad is
port (
a, b, c, d : in std_logic;
sel : in std_logic_vector(1 downto 0);
o : out std_logic);
end if_bad;
architecture behave of if_bad is
begin
process (sel, a, b, c, d)
begin -- process
if sel="00" then
o<=a;
end if;
if sel="01" then
o<=b;
end if;
if sel="10" then
o<=c;
end if;
if sel="11" then
o<=d;
end if;
end process;
end behave;
— 89 — Lund Institute of Technology
Chapter 5. Synthesis
Figure 5.10: synthesis for if good
library ieee;
use ieee.std_logic_1164.all;
entity if_good is
port (
a, b, c, d : in std_logic;
sel : in std_logic_vector(1 downto 0);
o : out std_logic);
end if_good;
architecture behave of if_good is begin
process (sel, a, b, c, d)
begin -- process
if sel="00" then
o<=a;
elsif sel="01" then
o<=b;
elsif sel="10" then
o<=c;
else
o<=d;
end if;
end process;
end behave;
Here is the synthesis results from synopsys for both designs. For if bad
design, critical path increases due to the delay propagated from input pin d
through all the way to the output o. Design Compiler is not smart enough to
take the advantage of all these exclusive if conditions, that should be translated
into a single MUX. Instead, it uses priority coding for if statements without
following else statements. This results in three cascaded MUXes.
5.5.2 Coding style for loop statement
For for loop construct in the vhdl design, Design Compiler simply unrolls the loop, making a
hardware for each iteration of the loop. So, when writing such loops, the designer should aware of
the hardware inference in such for loop, and try to use as less hardware as possible within the loop.
The following is an example of two for loop coding styles.
architecture RTL of LOOP_BAD is
type TABLE_8_x_5 is array (1 to 8) of std_logic_vector (4 downto 0);
begin
process (BASE_ADDR, IRQ)
variable INNER_ADDR: std_logic_vector(4 downto 0);
ElectroScience — 90 —
5.5. Coding style for synthesis
Figure 5.11: synthesis for loop bad
variable DONE: std_logic;
constant OFFSET: TABLE_8_x_5 :=("00001","00010", "00100",
"01000","10000","10001","10100","11000");
begin
DONE := ’0’;
INNER_ADDR := BASE_ADDR;
for I in 1 to 8 loop
if ((IRQ(I) = ’1’) and (DONE = ’0’)) then
INNER_ADDR := INNER_ADDR + OFFSET(I);
DONE := ’1’;
end if;
end loop;
ADDR <= INNER_ADDR;
end process;
end RTL;
architecture RTL of LOOP_BEST is
type TABLE_8_x_5 is array (1 to 8) of std_logic_vector (4 downto 0);
begin
process (BASE_ADDR, IRQ)
variable TEMP_OFFSET: std_logic_vector(4 downto 0);
variable DONE: std_logic;
constant OFFSET: TABLE_8_x_5 := ("00001","00010","00100","01000",
"10000","10001", "10100","11000");
begin
TEMP_OFFSET := "00000";
for I in 8 downto 1 loop if (IRQ(I) = ’1’) then
TEMP_OFFSET := OFFSET(I);
end if;
end loop;
-- Calculate Address of interrupt Vector
ADDR <= BASE_ADDR + TEMP_OFFSET;
end process;
end RTL;
— 91 — Lund Institute of Technology
Chapter 5. Synthesis
Figure 5.12: synthesis for loop good
ElectroScience — 92 —
Chapter 6
Place & Route
6.1 Introduction and Disclaimer
This manual is intended to guide an ASIC designer carrying out the designs steps
from netlist to tapeout. The reader is guided by using the Silicon Ensemble GUI
and by the use of script (mac) files. However, this manual was never intended
to be complete. The design steps considered in this manuscript are presented
in Figure 6.1. Several digital ASIC’s have been successfully produced (in our
digital ASIC group) according to the presented design flow. For a more detailed
description the reader is referred to the Silicon Ensemble manual.
Although every precaution has been taken in the preparation of this manual,
the publisher and the authors assume no responsibility for errors or omissions.
Neither is there any liability assumed for damages or financial loss resulting
from the use of the information contained herein.
6.2 Starting Silicon Ensemble
Before you start SE you should ensure that your design has been tested exten-
sively as the place and routing job can be very time consuming dependent on
the design size.
The required library elements for the design flow are:
• LEF/DEF Files (usually provided from the chip manufacturer)
• Netlist in Verilog format (generated with Synopsys)
!As SE needs a lot of disk quota to do the routing you should check that
you have enough free disk space!
— 93 — Lund Institute of Technology
Chapter 6. Place & Route
Import to Silicon
Ensemble
Block and Cell
Placement
Floorplanning
Pad Placement
Signal Routing
Clock Tree
Generation and
Routing
Design Verification
Initialization Files
LEF/DEF
Verilog
Power Routing
Figure 6.1: Silicon Ensemble Design Flow
To start SE go to the cadence work directory: cd NameofDirectory
Setup the SE environment with source ....
To view the SE manual type: openbook
Start SE graphical user interface with seultra -m=500&
and following actions will take place:
• 200 MB RAM will be allocated for the use of SE.
• A log file se.jnl will be created
• Commands in the initialization file se.ini will be run
• The SE GUI pops up and is ready to use
ElectroScience — 94 —
6.2. Starting Silicon Ensemble
Type seultra -h for more information.
— 95 — Lund Institute of Technology
Chapter 6. Place & Route
6.3 Import to SE
To create a database for your design the LEF files have to be imported to cre-
ate a physical and logical library. LEF files contain information on the target
technology, the cells and the pads. If memory cells are used the respective LEF
files have to be read as well. The import can be done by using the GUI.
⇒ Select File → Import → LEF
Import cmos035tech.lef (process parameters), MTC45000.lef, MTC45100.lef
(cells and pads) and memory.lef (custom memeory) in the given order and
terminate each import routine with ok.
The description of the memory and a netlist of the design in verilog format
needs to be imported to SE:
⇒ Select File → Import → Verilog
and browse for the memory verilog file memory.v. Repeat the same procedure
and import filter.v. Type the name of your top module under Verilog Top Mod-
ule. The name of the top module is created after the synthesis and can be quite
long as all the generics are included in the name. Set Compiled Verilog Output
Library to projdir, see Figure 6.2.
In the next step the supply pads definition file that specifies the names of
the different power pins is imported:
⇒ Select File → Import → DEF
and browse for the supplyPads.def file.
Such files contains the minimum set and a few additional supply pads. All the
files required to start the SE design flow have now been imported.
⇒ Type save design ”loaded”;
in the command line of the SE GUI.
ElectroScience — 96 —
6.4. Creating a mac file
Figure 6.2: Import of the netlist in verilog format
6.4 Creating a mac file
Instead of using the GUI to import the data, a script file can be executed to
perform the same operations. All the instructions needed to set up such a script
file can be found in se∗.jnl. The *.jnl is a screen dump of the report window.
If the structure of a script file and the SE design flow is understood, script files
can be more than a competitive alternative to the GUI. If parameters needs
to be changed it is more convenient to do the changes in the script file rather
than doing the changes via the GUI. For future designs the script files can be
easily modified and the time needed to do the place and route job is significantly
reduced. Furthermore, a complete mac file serves as documentation and makes
it possible to repeat the flow exactly.
⇒ Open a new file, e.g. 1 verilogin.mac, with an editor, e.g emacs. Scan
se.jnl for the instructions that have been carried out with the GUI to load the
LEF, Verilog and DEF files. Paste the instructions in your newly generated file.
You script file may look as follows:
— 97 — Lund Institute of Technology
Chapter 6. Place & Route
##-- Import Library Data ##--
FINPUT LEF F LEF/cmos035tech.lef;
INPUT LEF F LEF/MTC45000.lef ;
INPUT LEF F LEF/MTC45100.lef ;
INPUT LEF F LEF/memory.lef;
##-- your design.
INPUT VERILOG FILE "/../memory.v"; LIB "cds_vbin" REFLIB "MTC45000 MTC45100
cds_vbin";
INPUT VERILOG FILE "/../your_design.v"; LIB "projdir" REFLIB "MTC45000 MTC45100
cds_vbin";
DESIGN "projdir.your_design:hdl" ;
INPUT DEF F DEF/supplyPads.def ;
SAVE DESIGN "LOADED" ;
FLOAD DESIGN "LOADED";
Change the first command (in your mac file) INPUT to FINPUT. The dif-
ference between these commands is that it is possible to add data to an existing
library with INPUT but not with FINPUT. However, as SE provides a huge set
of commands, are variations in commands that result in the same design task
possible.
⇒ Select File → Execute → 1 verilogin.mac
All the necessary files have been imported as it has been equivalently done
with the GUI.
ElectroScience — 98 —
6.5. Floorplanning
6.5 Floorplanning
The floorplanning is done by using a combination of automatic functions pro-
vided with the GUI in combination with modifications done by the user, see
Figure 6.3. By Initialize Floorplan following actions will take place:
• A starting floorplan is created by an estimate of the required layout.
• Global and detailed routing grids are created.
• The core rows are created
• Sites for corner cells are created if necessary
⇒ Select Floorplan → Initialize Floorplan and Figure 6.3 pops up.
In Figure 6.3 the design statistics such as Number of Cells, IO’s, Nets is
provided by SE. The aspect ratio is set to one by default, which is square.
The distance of the IO to the core needs to be set, e.g. 150µ in both cases.
The distance has to be large enough to leave room to route power and ground
wires.
⇒ Select Flip Every other row and Abut Rows to create rows where VDD and
GND alternates. Set the value for Block Halo Per Side (distance between blocks
and rows) to 30µ for this design. The calculate button describes the floorplan
in terms of number of rows, row utilization, chip area etc., before creating the
floorplan. The floorplan, as described in Expected Results, will be created by
confirming your settings with apply or ok.
Figure 6.3: Initialize Floorplan Dialog Window
— 99 — Lund Institute of Technology
Chapter 6. Place & Route
You can try different setting, e.g. IO to core distance, and observe how the
floorplan changes. Before continuing in the design flow return to your initial
settings for the floorplan. A floorplan and a memory cell are visible now, see
Figure 6.4.
Figure 6.4: Initialized floorplan with memory cell
ElectroScience — 100 —
6.6. Block and Cell Placement
6.6 Block and Cell Placement
The memory cell, currently placed on the lower left site, needs to be placed on
the core, see Figure 6.4.
⇒ Activate cell as selectable,
⇒ Select Move on the icon panel on the left hand side.
Drag the memory cell to one of the corners on the core.
Assure that the memory supply faces to one of the core borders (the memory
pins become visible if net is selected as visible). Thus the memory power ring
has only to be drawn on two or three sides of the memory. The coordinates for
cell orientation are illustrated in Figure 6.5.
⇒ Select Floorplan → Update Core Rows → ok.
The rows underneath the memory are cut and the rows surrounding the memory
are shortened (halo per side) to give space for the memory power ring.
⇒ Scan se.jnl for the instructions and paste them a new file, e.g. 2 floorplan.mac.
Next, the power pins and signal IO’s needs to be placed.
⇒ Select Place → IO’s → Random → ok.
⇒ Select Place → IO’s → I/O constraint file and select Write.
A file ioplace.ioc is generated that specifies the location/orientation of the
pads. The order of the cell names in the generated file is the placement order:
Left → right in the top and bottom rows and bottom → top in the left and
right rows.
⇒ Select Edit if you want to do changes in the pad placement. The names of
the IO’s as specified in the vhdl code are listed with their respective net name
in the verilog file.
⇒ Finally, place the IO pads with ok.
The aim of placing the pads manually by using ioplace.ioc is to make your
ASIC fit better with surrounding blocks. However, by setting such constraints
it might be a much harder job for the router and some configurations might not
be routeable at all. As a rule of thumb: Place the supply pads evenly in the
center of each side of the chip. Interrupt a series of IO pads with pad supply to
obtain a better driving of the pads. The more supply pads the better. Avoid
core supply pads in the corners of the ASIC.
North FLPNorth South FLPSouth
Figure 6.5: Initialized floorplan with memory cell
— 101 — Lund Institute of Technology
Chapter 6. Place & Route
The floorplan is now ready to insert the cells. The placement will be done
automatically (QPlace). Dependent on the design size this can take some time.
⇒ Select Place → Cell and ok
and the cells will be placed on the rows, see Figure 6.6.
⇒ Save you design as placed.
⇒ Scan se.jnl for the instructions carried out under block and cell placement
and paste them in a new file, e.g. 3 place.mac.
Figure 6.6: Closeup of the memory and placed cells
ElectroScience — 102 —
6.7. Power Routing
6.7 Power Routing
The power ring structures surrounding the core and the memory cell has to be
planned and routed. With Plan Power the following design steps are done:
• Power paths can be planned and modified before routing them
• Generation of power rings that are surrounding all blocks and cells
• Creates stripes over standard cell rows
• Connects created ring, stripes and power pads
To start Plan Power
⇒ SelectRoute → Plan Power → Add Rings.
Only vdd and gnd is needed as core supply. Delete other supplies in the dialog
window, see Figure 6.7. The order of the rings can be defined by the order of
the net names. The first net name is the left or bottom-most ring.
⇒ Select the type of metal you want to use for the power ring. To prevent
routing violations choose metal3, metal4, as at least metal1 is already reserved
for signal routing. Specify the width for the core supply ring and channel,
located between the memory and core, to 30µ and 10µ, respectively. If the IO
to core distance (defined during initialize floorplan) is to narrow you will get a
warning.
In order to improve the power supply throughout the rows, wires crossing
the core can be added.
⇒ Select Route → Plan Power → Add Stripes.
With the dialog box that pops up, see Figure 6.9, the following can be specified:
• layer for the stripes
• width and spacing
• number of stripes (equally spaced across the core)
Figure 6.7: Dialog box for Add Rings
— 103 — Lund Institute of Technology
Chapter 6. Place & Route
Figure 6.8: Core with power rings
To create stripes across the core you may use following parameters:
layer metal4
width 5.000
spacing 1.000
mode all
Set to Set spacing 100.000
The strips are only generated to the edge of the core and will be connected
later to the power rings.
⇒ Scan se.jnl for the instructions carried out during power routing and paste
them in a new file, e.g. 4 powerroute.mac.
ElectroScience — 104 —
6.7. Power Routing
Figure 6.9: Dialog box for Add Stripes
Filling the gaps between the IO pads
The gaps between the the IO pads needs to be closed to maintain continuity in
the power supply ring. This can be done by inserting fillercells (dummy-cells)
in the IO ring, see Figure 6.10.
⇒ Select Place → FillerCells → AddCells
to open the dialog box. Fill in the Model as IOFILLER64 and the prefix as
FP64.
⇒ Select North, Flip North, South, Flip South under Placement, see Figure 6.11.
Iterate this procedure with narrower filler cells (32,16,8,4,2,1). The different
filler cells can be found in MTC45100.lef. Observe how the gaps between the
IO pads are vanishing.
⇒ Scan se.jnl for the instructions and paste them in a script file.
It is recommended to save the commands for the filler cells in a separate script
file, e.g. IOdummys.mac, as the script can easily be reused in the AMIS 0.35
CMOS technology. You only need to set the area parameter to a rather big
value to make it applicable for future designs.
— 105 — Lund Institute of Technology
Chapter 6. Place & Route
IO IO IO IO IO IO
IO IO IO IO IO IO
IOcell IOcell IOcell IOcell
IO IO IO IO IO IO
IOcell IOcell IOcell IOcell
text
text
text
text
vdd
gnd
Filler
cell
Filler
cell
text
text
vdd
gnd
Before placement
After placement: empty
cells result in interruption
of the supplies
Filler cells result in
continuous supply lines
Figure 6.10: Adding Filler Cells
6.8 Connecting Rings
To connect the rings use the Connect Rings command. The following is com-
pleted with the Connect Ring command:
• Connections between IO power pins within IO rows
• Connections between CORE IO ring wires and the IO power pins
• Connections between stripes and core rings
• Connections between block power pins and the core IO wires
⇒ Select Route → Connect Ring
Assure that Stripe, Block, All Ports, IO Pad, IO Ring, Follow Pins are selected.
All the supplies are now connected, see Figure 6.12.
⇒ Save the design as connected.
⇒ Scan se.jnl for the instructions carried out during connecting the rings
and paste them in a script file, e.g 5 connectrings.mac.
Export your design as b4CT.def.
⇒ File → Export → DEF
ElectroScience — 106 —
6.9. Clock Tree Generation
Figure 6.11: IO Filler Cell Window
6.9 Clock Tree Generation
As an optional tool to SE comes a clock-tree generator (CT-Gen). It can be
controlled with the GUI or as a stand-alone tool. We will use CT-gen as a
stand-alone tool. You need to export your design as b4CT.def before generat-
ing the clock tree.
It is recommended to route the clock net only after all the core cells are
connected to the supplies as eventual routing violations might be prevented.
With the use of data from your DEF file CT-Gen produces clock tree(s),
computes wire estimates and inserts buffers to reduce clock skew. The input
DEF file must be fully and legally placed. The output of CT-Gen is a fully
placed netlist with new components such as buffers and inverters.
CT-Gen provides following features:
• Construction of an optimized clock tree.
• Minimizes skew.
• Produces code for Silicon Assemble router.
• Controls buffer and inverter selection
Following constraints for the clock tree need to be defined before using CT-
Gen:
• Clock root.
• t
rise
and t
fall
of the waveform.
• Minimum/maximum insertion from clock root to any leaf
• Maximum skew between insertion delay and any leaf pin
— 107 — Lund Institute of Technology
Chapter 6. Place & Route
Figure 6.12: Closeup of the layout after the Connect Ring command
⇒ Open the file ctgen.constrainsts and set the constraints (see the comments
for explanation). In the ctgen.commands parameters such as the input LEF file,
supply definitions need to be set.
The clock generator tool uses the library CTGEN (defined in ctgen.commands)
as work library.
⇒ Delete all the files in CTGEN that may remain from previous designs with
`rm -r CTGEN/.
⇒ Start CT-Gen with ctgentool ctgen.commands
If the the clock generation was successful, a number of report files will be
produced in CTGEN/rpt. The report files are of the syntax <stage>.<type>
where <stage> is one of
• initial: without modifications.
• clock: after the clock tree has been added.
• final: when the clock tree components are placed correctly.
and <type> is one of
• analysis: complete description of best and worst path found.
• timing: the same without the details.
• trace: describes the clock path/tree with its components.
• violations: lists any violations against the given constraints.
The DEF file generated with CT-Gen needs to be loaded to SE. However,
the technology and memory LEF files needs to be imported first.
⇒ Open the verilogin.mac and paste the instructions in the SE command
line.
ElectroScience — 108 —
6.10. Filler Cells
6.10 Filler Cells
The row utilization as calculated during floorplan initialization is visible on the
floorplan. The red gaps between the cells need to be closed to avoid design rule
violations. By placing filler cells continuity is maintained in the rows. Further-
more, substrate biasing is improved by filler cells substrate connections.
⇒ Select File → Import → DEF and select afterCT.def
⇒ Select Place → FillerCells → AddCells
to open the dialog box. Fill in the Model as FILLERCELL16 and the prefix
as FC16. Select North, Flip North, South, Flip South under Placement, see
Figure 6.13.
⇒ Open a new file, e.g 6 coredummy.mac and scan se.jnl for the place filler
cell command. Paste the command in your newly generated file. Iterate this
procedure and change the width of the filler cells (8,4,2, ). The different filler
cells can be found in MTC45000.lef.
!With increased area in your 6 coredummy.mac file such file can
be reused in other designs!
Figure 6.13: Core Filler Cell Window
— 109 — Lund Institute of Technology
Chapter 6. Place & Route
6.11 Clock and Signal Routing
For final routing the Envisia Ultra Router (WRoute) is used. WRoute is
included in the SE package and can be used with the GUI or as stand-alone
tool. The benefits of WRoute are:
• 3 to 20 times faster than GRoute/FRoute
• Timing driven global and final routing
• Automatic search and repair in final routing
• Supports prerouted nets
• Final clean up
WRoute requires a lot of memory. If it crashes decrease memory allocation
for SE as WRoute is run outside SE and thus will get more memory. !It is
recommended to route the clock before routing the remaining signals (shortest
wiring possible)!
⇒ Select Route → Clock route. Assure that ALL is selected and proceed
with ok
To start the final routing
⇒ Select Route → WRoute to open the dialog box. Select Global and Final
Route and Auto Search and Repair.
⇒ Scan se.jnl for the instructions carried out during final routing the rings
and paste them in a script file, e.g 7 finalroute.mac.
If the final routing is done identify different parts of the ASIC.
⇒ Select nets as visible.
Choose one or several pins and find the clock signal. Trace the highlighted sig-
nal path and identify the cell where the net ends.
⇒ Select Edit → Find. Choose net as type. Specify the net name as: nxxxx∗.
The ∗ option ⇒ Selects every netname that starts with nxxx.
⇒ Select highlight and the entire clock tree becomes visible.
Scan the last lines in se.jnl for ”Total wire length” and ”Total number of
vias”.
⇒ Select Route → WRoute. Enable Incremental Final Route.
⇒ Select Options and enable Optimize Wire Length. OK in both boxes.
After the improved routing is done find the new ”Total wire length” and
”Total number of vias”. Check if the design could be improved.
ElectroScience — 110 —
6.12. Verification and Tapeout
Figure 6.14: Closeup of the final layout
6.12 Verification and Tapeout
As a last step in the SE design you may test if the place and route process went
smoothly. To check if there are any unconnected pins or routing violations
⇒ Select Verify → Connectivity.
Assure that Types and Nets are set to ALL.
If no errors are found the design is ready for tapeout.
Tapeout files
The finished design needs to be saved as DEF or as GDS II format. However,
for tapeout, the manufacturer requests the GDS II format.
⇒ Select File → Export → GDS II.
⇒ Select a name of the GDS-II file and a name of the Report file.
A map-file is used to translate the design to GDS II.
⇒ Choose the map-file cmos035gds2.map and terminate with OK.
The GDS II file is ready to send to the manufacturer. The manufacturer
replaces the cell abstracts with layout cells.
— 111 — Lund Institute of Technology
Chapter 6. Place & Route
6.13 Acknowledgements
Thanks to Stefan Molund for setting up a system for our needs. Thanks to
Shousheng He, Anders Berkemann, Fredrik Kristensen and Zhan Guo for con-
tributing with their experience and knowledge.
ElectroScience — 112 —
Chapter 7
Optimization strategies for
power
7.1 Sources of power consumption
Two major sources of power consumption are considered: Dynamic and static
power. Dynamic power is dissipated when a circuit is active, that is, the voltage
on a net changes due to changes at the input. However, transitions on a cell
input do not necessarily result in a logic transition at the output. Hence, one
further distinguishes between switching and internal power consumption. On
the other hand, static power is considered to be consumed when the circuit
is inactive, that is, not switching. As technology is scaled down, static power
consumption starts to be the major part of the power consumption even during
switching.
7.1.1 Dynamic power consumption
Switching power
The switching power is due to charging and discharging of capacitances and can
be described by superposition of the individual switching power contributed by
every node in a design. The load capacitance as seen by a cell is composed of net
and gate capacitances on the driving output. Since charging and discharging
are result of logic transitions, the switching power increases as logic transitions
increase.
Internal power
The internal power consumption is due to voltage changes on nets that are
internal to a cell. Here, short-circuit power is the major contributor and often
used synonymously. This power is consumed during a momentary short caused
by a non-perfect signal transition, that is, nonzero rise or fall time. In a static
CMOS cell, both P and N transistors are conducting simultaneously for a short
time, providing a direct path between V
dd
and ground.
Internal power consumption of a cell is said to be both state and path de-
pendent. Clearly, a complex cell can have several levels of logic and hence
— 113 — Lund Institute of Technology
Chapter 7. Optimization strategies for power
transitions on different input pins will consume a different amount of internal
power, depending on the number of logic levels that are affected by the tran-
sition, as exemplified in Figure 7.1. Input A and D can each cause an output
transition at Z. However, D only affects one level of logic, whereas A affects all
three, thus consuming more internal power.
Z
A
B
D
C
Figure 7.1: Cell with different logic depths.
A typical example for a cell with state dependent internal power is a RAM
cell. It consumes a different amount of internal power, depending on whether it
is in read or write mode.
7.1.2 Static power consumption
In digital CMOS technology, static power consumption is caused by leakage
through the transistors. In [8], six different types of leakage are described.
The current I
1
in Figure 7.2 comes from the pn-junction reverse-bias, I
2
is a
combination of subthreshold current and channel punchthrough current, I
3
is
gain-induced drain leakage and finally, I
4
is a combination of oxide tunnelling
current and gate current due to hot-carrier injection.
In CMOS circuits currently used today, the major sources of leakage are the
subthreshold current (part of I
2
) and gate leakage due to oxide tunnelling (part
of I
4
). In future CMOS technologies, the gain-induced drain leakage (I
3
), and
the pn-junction leakage (I
1
) may also become important factors.
For submicron CMOS technology, the supply voltage is decreased to reduce
electrical field strengths and power dissipation [9]. Since the ratio of supply volt-
age to threshold voltage affects the circuit delay, the threshold voltage is also
decreased to sustain an improvement in gate delay for each new technology gen-
eration [8]. Leakage power due to subthreshold current increases exponentially
with threshold voltage scaling [8], which makes leakage increasingly trouble-
some. Traditionally, leakage has been a rather small contributor to the overall
power consumption in digital circuits. However, the increase in static leakage
power cannot be ignored in CMOS circuits of today and will be an even greater
issue in the future.
7.2 Optimization
Where and when to apply power optimization strategies is simply answered,
as early as possible and at the highest possible abstraction level. Figure 7.3
shows that the attainable power savings are the greatest at system level and still
significant downto register transfer level (RTL)—before the design is committed
to a specific technology. However, the highest accuracy on a power estimate is at
gate and transistor level. Despite possible savings at higher abstraction levels,
ElectroScience — 114 —
7.2. Optimization
I
1
I
2
I
3
I
4
Gate Oxide
Gate
Drain
Source
Figure 7.2: Leakage currents in a MOS transistor.
most IC’s are still developed at RTL due to the maturity of design, synthesis,
and verification flows that come along with it.
Figure 7.3: Power savings and accuracy attainable at different abstraction levels
[10].
RTL power estimation is usually fast and gives early answers to the ques-
tions:
• Which architecture consumes least power?
• Which module in a design is consuming the most power?
• Where is power being consumed within a given block?
Doing such an estimation does not introduce much overhead to your design flow
and should be carried out if power is a serious objective in your design, which is
usually true. Based on the obtained results, design modifications can be carried
out
1
. Following are some of the options that can be thought of.
7.2.1 Architecural issues—Pipelining and parallelization
The dynamic power consumption P
d
for a CMOS design depends on the oper-
ating speed as
P
d
∝ C
L
f V
dd
V
swing
≈ C
L
f V
2
dd
, (7.1)
where C
L
is the total capacitive load on the chip that is switched, V
dd
is the
operating voltage, V
swing
is the voltage swing, and f is the clock frequency. For
simplicity, it is assumed that the voltage swing is equal to V
dd
.
1
From our experience, RTL power estimation in Synopsys involves basically the same steps
as the more accurate gate-level estimation and will hence be neglected in favour of the latter.
— 115 — Lund Institute of Technology
Chapter 7. Optimization strategies for power
Let f
req
be the frequency that is required at the output of a design to achieve
a desired throughput. Then, the architecture of the design will affect both f
req
and thus the dynamic power consumption, see Figure 7.4 [11].
FF
FF
f = f
req
(a)
FF
FF
FF
f = f
req
(b)
FF FF
FF FF
f = f
req
/2
(c)
FF
FF
f = 2 f
req
MUX
(d)
Figure 7.4: HW mapped (a), pipelined (b), parallel (c), and time shared (d)
implementation.
The hardware mapped and pipelined designs both produce one sample per
clock cycle and hence have f equal to the required frequency f
req
. Note that
the pipelined design has a shorter critical path and can therefore reach a higher
maximum clock frequency. The parallel design produces two samples per clock
cycle and f can be reduced to f
req
/2, while the time shared design has to double
f to reach f
req
.
Comparing the hardware mapped and the parallel implementation, it seems
as if there is nothing to gain from a power perspective by decreasing f since
C
L
increases accordingly. However, with a lower clock frequency, V
dd
can be
decreased since
f =
1
t
(7.2)
and
t
pd

C
crit
V
dd
K(V
dd
−V
t
)
α
, (7.3)
where t
pd
is the propagation delay, K and α are constants, C
crit
is the capacitive
load in the critical path, and V
t
is the threshold voltage. K, α, and V
t
depend
on the silicon process. Note that α accounts for velocity saturation in processes
characterized by short channel devices. Consequently, a circuit designed for high
ElectroScience — 116 —
7.2. Optimization
speed can, when operated at low speed, reduce the power consumption below
that of a circuit only designed to operate at low speed.
As an example, consider the designs from Figure 7.4(a) and Figure 7.4(c).
The latter produces two samples each clock period and thus the frequency can
be halved while the original throughput is maintained. With (7.1) and (7.3)
and, for simplicity and rather traditional, V
t
is assumed to be much smaller
than V
dd
and α = 2, the dynamic power consumption is calculated as
f
par
=
f
org
2
. (7.4)
Hence, with (7.2) and (7.4)
t
par
= 2 t
org
. (7.5)
Let the capactive load of a multiplier, an adder, a register, and a multiplexer
be 24c, 3c, 1c, and 1c, respectively. Then, the total load of the original design
is C
L,org
= 56c and for the parallel design C
L,par
= 112c. The load along the
respective critical path is then C
crit,org
= 55c and C
crit,par
= 56c.
t
par

C
crit,par
V
dd
K(V
dd
−V
t
)
2

C
crit,par
KV
par
=
56c
KV
par
(7.6)
t
org

C
crit,org
V
dd
K(V
dd
−V
t
)
2

C
crit,org
KV
org
=
55c
KV
org
(7.7)
Substituting (7.6) and (7.7) into (7.5) yields
56c
KV
par
=
2 55c
KV
org
⇒ V
par
=
V
org
1.96
, (7.8)
and finally, using (7.1),
P
par
P
org
=
C
L,par
f
par
V
2
par
C
L,org
f
org
V
2
org
=
112c
f
org
2
(
V
org
1.96
)
2
56cf
org
V
2
org
≈ 0.26. (7.9)
The parallel design has the same throughput at approximately a quarter of
the original design’s power consumption and can achieve double throughput at
double power consumption. Note that the approximation V
t
< V
dd
becomes
too rough and unreliable as silicon processes shrink since V
t
does not scale with
V
dd
. Hence, the power savings will not be as large as in the presented example
for processes below 0.18µm. Also, α approaches numbers smaller than 2, for
example, between 1.3 and 1.5 for a 0.25µm process [12].
The drawback with the parallel design is that the amount of hardware is dou-
bled. In addition, if the design should be able to switch between high through-
put/power and low throughput/power mode during run time, an active power
controller will be required. To design a device with active power control will
increase the number of constraints on the design and add additional verification
time. A cell library that is characterized for multiple voltages is needed as well
as hardware to generate the voltage levels and logic to control it. However, it
is possible and an example of a commercial processor with active power control
is found in [13].
— 117 — Lund Institute of Technology
Chapter 7. Optimization strategies for power
To insert additional pipeline stages in the original design has the same effect
as parallelizing, either the throughput is increased or the power consumption is
reduced. However, there are some differences worth noticing. First and most
important, the amount of additional hardware is normally much less than in
the parallel case since only some extra registers are needed. Secondly, whereas
parallelizing is only restricted by area, pipelining has an upper limit to the
number of pipeline stages that can be inserted. At some point there will be no
more combinational logic that can be separated by pipeline registers. Finally, for
pipelining to be really efficient the delay through each stage should be balanced
since frequency and voltage are proportional to the critical path. In addition,
pipelining increases latency, that is, the number of clock cycles from which data
is present at the input until the corresponding result is presented at the output.
7.2.2 Clock gating
When a design increases in size and functionality there is a great chance that
parts of the hardware are only used at certain times or in certain operation
modes. These blocks will consume switching power and contribute with unnec-
essary load to the clock tree, even though the result is unused. One simple way
to deal with this problem is to gate the clocks into these blocks, that is, turn
off the clocks. In addition to reducing the clock load it will prevent unneces-
sary switching activity since almost all hardware modules have registers at their
input.
gclk
clk
en
(a)
clk
en
gclk
(b)
Figure 7.5: A simple clock gate (a) and its timing diagram (b).
The simplest clock gate is implemented as an AND-gate, as shown in Fig-
ure 7.5(a). The timing diagram is shown in Figure 7.5(b). When the enable
signal en is low the gated clock is turned off. This design is simple but sensitive
to glitches, which must be avoided under all circumstances.
A safer, but larger, implementation is shown in Figure 7.6(a). Here, en is
latched in order to avoid transitions while the clock signal is high and glitches
on the output. To turn off the gated clock for one clock cycle, en is lowered at
ElectroScience — 118 —
7.2. Optimization
D
enl
G
Q
gclk
clk
en
(a)
clk
en
gclk
enl
(b)
Figure 7.6: A latched clock gate (a) and its timing diagram (b).
any time in the preceding clock cycle and then raised again in the next clock
cycle, as shown in Figure 7.6(b). This design is safer since the latch is only
transparent when the clock is zero keeping the output of the AND-gate at zero.
7.2.3 Operand isolation
Operand isolation, also called guarded evaluation [14], prevents unnecessary op-
erations to be performed. However, instead of halting the clock signal, datapath
signals are directly disabled and hence no new results are evaluated.
Traditionally, datapath operators are always active. Due to changing inputs,
they dissipate power even when the output of the operators is not used. In
a particular clock cycle, the datapath operator output is not used if it is an
unselected input to a multiplexer or if the datapath operator is an input to a
register that is currently disabled.
With the operand isolation approach, additional logic (AND or OR gates)
called isolation logic is inserted along with an activation signal to hold the inputs
of the data-path operators stable whenever their output is not used. Isolation
candidates can be identified in the HDL source by using a pragma annotation
or at the GTECH level by using commands in the synthesis script.
do_operand_isolation = true
read -f vhdl /simple.vhd
set_operand_isolation_style -logic or
set_operand_isolation_cell DW_MULT
set_operand_isolation_slack 5
/* define constraints */
— 119 — Lund Institute of Technology
Chapter 7. Optimization strategies for power
compile
In Synopsys, operand isolation automatically builds isolation logic for the
identified operators. A rollback mechanism is also provided to remove the iso-
lation logic if you determine that the delay penalty is not acceptable.
D0
D1
S0
S1
Figure 7.7: Datapath operator and selection circuitry.
Consider the example from Figure 7.7. Here, the result from the datapath
operator is not used in further steps if S0=0 or if S1=1. By creating an activation
signal (AS) based on this observation, the inputs to the operator are kept stable,
see Figure 7.8.
D1
D0
AS
S0 S1
Figure 7.8: Operands isolated from the datapath operator.
For inputs that mostly stay at logic 1, use OR gates to create the operand
isolation since this holds the inputs to the operator at logic 1 during inactive
mode, minimizing switching activity at the transition from active to inactive
mode. Similarily, AND gates are used for inputs that mostly stay at logic 0.
Ideal candidates for operand isolation are arithmetic modules such as arith-
metic logical units (ALUs), adders, and multipliers that frequently perform
redundant computations.
7.2.4 Leakage power optimization
While the active power consumption varies with the workload and can be locally
eliminated using for instance clock gating, power consumption due to leakage
remains, as long as the power supply voltage is switched on. For burst mode
applications, such as mobile phones, a large part of the system is in idle mode
most of the time. This makes the energy consumed by leakage a large contributor
to the overall energy consumption and, thus, making simple saving schemes such
as clock gating less attractive.
One efficient and often-used method to combat leakage is to use dual or
variable threshold technologies [15, 16]. In a dual threshold technology, there
ElectroScience — 120 —
7.2. Optimization
are typically two versions of each digital gate, one with high threshold voltage
transistors for low power and one with low threshold voltage transistors for low
delay. Using low threshold devices only in timing critical paths can radically
decrease leakage currents both in standby and operational mode.
Due to transistor stacking in CMOS gates, the input pattern has great im-
pact on the leakage power. The leakage current may differ several orders of
magnitude for a simple CMOS gate depending on the input pattern [17]. Re-
duced leakage power is therefore achievable during idle mode depending on the
selected patterns for internal nodes.
In [18], a comparison is made between four different leakage reduction tech-
niques involving supply voltage scaling, increased channel length, stacking tran-
sistors, and reverse body bias. Increased channel length is a static design method
while stacking transistors, supply voltage scaling and reverse body bias are suit-
able as adaptive or reconfigurable power management methods. Results in [18]
indicate that reverse body bias is an efficient method to reduce leakage.
Supply voltage scaling directly affects leakage, making it a feasible technique
for power reductions. In [8], power consumption due to leakage is described as
I
leak
V
dd
, and in [19], it is shown that the leakage current decreases when the
supply voltage is decreased.
Stacking transistors increases the resistance between supply and ground, and
thereby reduces both leakage current and maximum transistor current [8]. In
Figure 7.9(a), the potential V
m
is higher than 0V due to the leakage current. The
increased potential at V
m
makes the voltage between gate and source negative
for transistor T2, which reduces the leakage current. The voltages between bulk
and source, and between drain and source also change due to the stacking effect,
which also results in less leakage current [8].
Reverse body bias as well as increased channel length can be utilized to
increase the threshold voltage and thereby reduce leakage [8]. Reverse body bias
involves increasing the bulk potential for the PMOS transistors and decreasing
the bulk potential for the NMOS transistors, see Figure 7.9(b). A technique
using reverse body bias for standby power management is described in [20].
7.2.5 Beyond clock gating
With aggressive gate length scaling follows an excessive gate leakage current
that is due to very thin oxy-nitride gate dielectrics. Therefore, the static power
dissipation will rapidly increase with the development of the technology and
methods for power saving, such as clock gating, will cease to work.
A simple way to save power is to shut down a function block when it is not
used. This method is applicable if it can be predicted when a block is going to
be used again. This stems from the fact that the time to power up a block is in
the order of a few microseconds. This rather long time duration is due to the
risk of interference with other parts on a chip. A way to reduce the power-up
time and still save power is to just lower the voltage in a block when it is not
used. A disadvantage with these methods is that the block loses its information.
To retain this information some memory stages can be inserted in the block that
stores the state the block was in before it was shut down.
For handheld applications a low power dissipation is very important. There-
fore, special low power cell libraries has been developed for new technologies.
The difference with these cell libraries is that the oxy-nitride dielectrics are as
— 121 — Lund Institute of Technology
Chapter 7. Optimization strategies for power
0V
0V
V
dd
V
m
T1
T2
(a)
V
dd
GND −V
2
V
dd
+V
1
(b)
Figure 7.9: The stacking effect (a). Reverse body bias (b).
thick as in the previous technology. As an example, the 90 nm and the 130 nm
technology use oxy-nitride dielectrics with same thickness. The drawback with
this is that the performance with respect to speed will also be similar to the
previous technology.
Using the methods above to decrease the power dissipation of the block has
an increase of 5% to 20% in chip area of a block.
One can also predict that technologies have to be developed with high-κ gate
dielectrics to meet stringent gate leakage and performance requirements. Also,
the downscaling of technology causes the drain current in a transistor (Gain
Induced Drain Leakage) to increase to be in the same order of the gate current.
More information about future technology predictions, see [21].
7.3 Power estimation with Synopsys Power Com-
piler
This section tries to give some practical examples of how to carry out power
estimation using Synopsys PowerCompiler. First, a format that models dif-
ferent properties of a cell’s and a design’s behaviour is presented. Information
provided by a cell library should be briefly verified by inspecting the respective
files in order to be able to manually calculate the power consumption of a cell.
ElectroScience — 122 —
7.3. Power estimation with Synopsys Power Compiler
7.3.1 Switching Activity Interchange Format (SAIF)
In order to allow for power estimation and optimization there has to be input
in form of toggle rate and static probability data based on switching activity,
typically available through simulation. Here, the toggle rate is a measure of how
often a net, pin, or port went through either a 0 → 1 or a 1 → 0 transition. The
static probability is the probability that a signal is at a certain logic state. With
SAIF [22] data, DC can perform a more precise power estimation, for example,
modelling state and path dependent power consumption of cells. Generally,
there are two types of SAIF files: forward annotation and back annotation. The
former is provided as input to the simulator, the latter is generated by the
simulator as input to the following power estimation and optimization tool.
For RTL simulation, the forward annotation SAIF file contains directives
that determine which design elements are to be traced during simulation. These
directives are generated by the synthesis tool from the technology-independent
elaborated design. Furthermore, it lists synthesis invariant points (nets, ports)
and provides a mapping from RTL identifiers to synthesized gate level identi-
fiers (variables, signals). Following is the command that can be used from the
dc shell prompt.
dc_shell> rtl2saif -output rtl_fwd.saif -design <design_name>
For gate level simulation, the forward annotation SAIF file contains information
from the technology library about cells with state and/or path dependent power
models. The extra information provided results in more accurate estimates.
dc_shell> read -format db <library_name>.db
dc_shell> lib2saif -output <library_name>.saif <library_name>.db
Following is an example of a forward annotation SAIF file for gate level simula-
tion that takes state and path dependent information into account. Considered
is a 2-input XOR cell from the UMC 0.13µm process.
(MODULE "HDEXOR2DL"
(PORT
(Z
(COND !A2 RISE_FALL (IOPATH A1 )
COND !A1 RISE_FALL (IOPATH A2 )
COND A2 RISE_FALL (IOPATH A1 )
COND A1 RISE_FALL (IOPATH A2 ) )
)
)
(LEAKAGE
(COND (!A2 * !A1)
COND (A2 * !A1)
COND (!A2 * A1)
COND (A2 * A1)
COND_DEFAULT)
)
)
When a transition at output Z occurs, the simulator should determine which
path caused the transition and evaluate the associated state condition at the
input transition time. For example, if a transition at Z is caused by A1 and
the value of A2 was “0” at the time, the transition belonged to COND !A2
RISE FALL (IOPATH A1). In this example, all four possible outcomes can
— 123 — Lund Institute of Technology
Chapter 7. Optimization strategies for power
be assigned different power values in the back annotation SAIF file. Also, the
leakage power section is divided into four outcomes for valid cell states and a
default outcome for cell states not covered by the aforementioned ones.
The back annotation SAIF file from the simulator includes the captured
switching activity based on the information and directives in the appropriate
forward annotation file. Switching activity can be divided into two categories,
non-glitching and glitching. The former is the basis of information about toggle
rate and static probability. This information usually yields reasonable analysis
and optimization results. To model glitching behaviour, a full-timing simulation
is required. Transport glitches (TG), for example, are extra transitions at the
output of a gate before the output signal settles to a steady state. They consume
the same amount of power as a normal toggle transition does and are an ideal
candidate for power optimization. Unbalanced logic often give rise to these kind
of glitches.
The second category of glitches is called inertial glitches (IG). Unlike TG,
their pulsewidth is smaller than a gate delay and would be filtered out if an
inertial delay algorithm were applied in the simulator. However, the number of
transitions is not enough to accurately estimate their power dissipation. There-
fore, the simulation tool provides a de-rating factor (IK) to scale the IG count
to an effective count of normal toggle transitions.
In a back annotated SAIF file there are timing and toggle attributes. The
former group includes:
• T0: Total time design object is in “0” state
• T1: Total time design object is in “1” state
• TX: Total time design object is in unkown “X” state
• TZ: Total time design object is in floating “Z” state
• TB: Total time design object is in bus contention (multiple drivers simul-
taneously driving “0” or “1”)
The toggle attributes can be summarized as:
• TC: Number of “0” to “1” and “1” to “0” transitions
• TG: Number of transport glitches
• IG: Number of inertial glitches
• IK: Inertial glitch de-rating factor
Following is an example of how state dependent timing attributes appear in the
back annotated file and how they are interpreted.
( COND (A*B*Y) (T1 1) (T0 8)
COND (!A*B*Y) (T1 1) (T0 8)
COND (A*!(B*C)) (T1 2) (T0 7)
COND B (T1 1) (T0 8)
COND C (T1 1) (T0 8)
COND_DEFAULT (T1 3) (T0 6))
ElectroScience — 124 —
7.3. Power estimation with Synopsys Power Compiler
D
E
F B
!
A
*
B
*
Y
A
*
B
*
Y
D
E
F
D
E
F C
A
*
!
(
B
*
C
)
A
B
C
Y
Figure 7.10: Simulated waveforms result in back-annotated SAIF data as listed
before.
The respective simulated waveforms are depicted in Figure 7.10. There are
9 cycles in total and for every condition defined, the number of cycles for which
this condition is fulfilled are counted.
Extending this, we list the back annotated SAIF-file that results from a gate
level simulation of a simple XOR gate in xor back.saif. Each net is modeled
with its timing and toggle attributes. From the simulation time t
sim
, the toggle
rates TR and static probabilities Pr¦I¦, are obtained according to
TR = TC/t
sim
(7.10)
and
Pr¦I¦ = T1/t
sim
. (7.11)
This is the basic information needed to calculate switching and leakage power.
7.3.2 Information provided by the cell library
As an example, the standard cell library to UMC’s 0.13µm process is considered.
The information is bundled in [23]. Included are general definitions of base units,
operating conditions, and wire load models. Each cell’s behaviour is described
in conjunction with timing and power model templates. These templates use
weighted input transition times and output net capacitances as indeces into the
respective look-up tables that specify the energy per transition. Figure 7.11
shows an example of how the internal energy per transition is obtained using a
look-up table.
Considering a simple cell, this look-up is done for every input pin. The path
dependent power model distinguishes which state the pin is in and whether a
rising or falling transition occured. Furthermore, as indicated in the figure, an
extrapolation takes place if the exact index into the table is not available.
7.3.3 Calculating the power consumption
As a simple example, we consider a 2-input XOR cell. The information needed
to calculate the power consumption is in a file called HDEXOR2DL.rpt. Included
is the necessary wire load model and the cell description. In order to calculate
— 125 — Lund Institute of Technology
Chapter 7. Optimization strategies for power
load cap.
Output
Energy/transition
Weighted average
input transition
time
z
y
x
0.20
0.56
0.82
0.67
10.2 30.8
65.9 42.4
69.1
Figure 7.11: Look-up table based internal energy calculation.
toggle rates and static probabilities of the nets, use the back annotated SAIF
file listed earlier in this section.
The total leakage power of a cell is determined by a superposition of the
probabilities of being in a certain state times the leakage power for this state.
Here you have to make use of the state dependent leakage power information
that is provided in the cell library.
P
leak
=
N

i=1
Pr¦I = i¦ P
leak,i
, (7.12)
where N = 2
m
is the number of combinations of m binary inputs and I =
[1 . . . 2
m
].
The switching power of a cell can be expressed as
P
switch
= 1/2
TR
t.u.
(C
wire
+C
pin
) V
2
dd
. (7.13)
Here, one has to take the load capacitance into account which is usually com-
prised of wire load and pin load. The factor 1/2 results from the fact that TR
accounts for both 0 → 1 and 1 → 0 transitions, but only half of them affect the
power consumption. Also, the number of transitions is counted on a per second
basis and hence one has to normalize TR to the default time unit (t.u.) of the
cell library, usually 1ns.
Based on these facts try to calculate both switching and leakage power of
the cell. The internal power is a lot more subtle to calculate since the tool’s
extrapolation algorithm is hard to track. For verification of your results, the
power estimation report is found in xor.rpt. Surely, one gets a feeling for the
amount of information that has to be processed to get an estimate of a design’s
power consumption.
7.4 Power-aware design flow at the gate level
After having introduced the basic concepts of power estimation, it is time to
conclude with a complete design flow. Figure 7.12 shows an example of a design
ElectroScience — 126 —
7.4. Power-aware design flow at the gate level
flow that estimates and optimizes power consumption of a digital design at the
gate level.
Analyze, elaborate
clock gating
RTL, GTECH
Compile
area, timing constr.
Gate-level sim.
capture toggle info
SDPD info
netlist (.v), delay (.sdf)
BW-SAIF FW-SAIF
Back-
annotate
Report
power
Compile -incr
power constr.
Gate-level sim.
capture toggle info
SDPD info
netlist (.v), delay (.sdf)
BW-SAIF FW-SAIF
Back-
annotate
Report
power
Figure 7.12: Gate-level power estimation and optimization flow.
After RTL clock gating is completed, timing and area constraints are set
to the design. In most cases clock gates have to be inserted manually in the
design together with control logic but in some cases there are tools that can
perform automatic low level clock gating. These tools automatically find flip-
flops that have an enable control and exchange these to a clock gate and standard
flip-flops. Figure 7.13(a) shows a flip-flop with enable input. These flip-flops
are implemented with a feedback from output to input and a multiplexer that
selects if the previous output is to be kept or a new input is to be accepted,
see Figure 7.13(b). Figure 7.14 shows the result of automatic clock gating on a
bank of enable flip-flops. Unlike manual clock gating, no extra control signals
are needed since all control logic is already there.
Compiling the elaborated RTL or GTECH description results in a gate level
netlist on which you can annotate switching activity from a back annotation file
(BW-SAIF). This file is created by a gate-level simulator, preferably nc-verilog
or vcs since these simulators make use of state and path dependent information
provided by the cell library (FW-SAIF). Note that MS can not make use of
this information and hence the power estimates based on the toggle files from
this simulator will be inferior. Now, the switching activity is annotated and
power constraints are set to the design. An incremental compile will then try to
optimize the netlist based on the premises it was provided with. If the design
was fully annotated, Synopsys claims that the power estimates lie within 10-25%
of SPICE accuracy.
— 127 — Lund Institute of Technology
Chapter 7. Optimization strategies for power
clk
D Q
out
en
in
(a)
clk
en
D Q
out
in
(b)
Figure 7.13: An enable flip-flop (a) and its implementation in Synopsys (b).
D Q
out in
gclk
en
clk
clock
gate
Figure 7.14: The result of automatic clock gating on a bank of enable flip-flops.
Optimization takes place by evaluating optimization moves and their impact
on the overall cost function. There is a predefined optimization priority that
can be changed. By default, design rule constraints set by the technology li-
brary, for example, max fanout, max transition, and max capacitance, are the
most important ones. Then, timing is the next critical issue. Following are in
descending order dynamic power, leakage power, and area constraints.
Generally, a positive timing slack will leave more possibilities for optimiza-
tion whereas already tight timing constraints might not result in any power
improvement at all. Also, diverse cell libraries make life easier for the optimiza-
tion tool. Furthermore, designs dominated by sequential logic often achieve less
power reduction since different sequential cells tend to have less variation in
power than combinational cells.
7.5 Scripts
This section provides a short description on scripts for DC and Cadence nc-
verilog that deal with power estimation and optimization. These scripts can
be downloaded from [?]. A general knowledge about UNIX and DC is assumed.
The scripts are written for the UMC 0.13 µm technology. No responsibility is
taken for the consequences of using these scripts, although great care has been
taken to debug the scripts. All feedback is welcome.
7.5.1 Design flow
Figure 7.15 shows the design flow for the power optimization scripts, the names
of each script and environments they are executed in. Inputs are header.scr
ElectroScience — 128 —
7.5. Scripts
read hdl.scr
timing area.scr
Design Compiler
vhdl description header.scr
netlist (.v)
delay (.sdf)
nc-verilog
compile.ncverilog
switching activity (.saif)
Design Compiler
power opt.scr
nc-verilog
power optimized
netlist (.v)
Design Compiler
final power.scr
new switching activity (.saif)
umc013.saif
first
power report
final
power report
netlist (.vhd)
delay (.sdf)
compile pwr.ncverilog
Figure 7.15: Design flow for power optimization with DC and nc-verilog.
and VHDL code. All information about the design and the target values for the
synthesis is listed in header.scr.
The first step in the optimization takes place in DC. Here the design is be
read, elaborated, and synthesized. Timing and area constraints are set in this
stage and DC will look for places to insert clock gates. The result is an initial
VERILOG netlist together with a standard delay format (SDF) file that will be
used in nc-verilog to estimate switching activity in each node of the design.
This is carried out by running
ncverilog -f compile.ncverilog
Then, the switching activity report is used to estimate the power consumption
in DC (power opt.scr). All important results are collected in a report file.
With this results as a guideline, power constraints on both dynamic and leakage
power can be set in the header.scr. An incremental synthesis is performed
and the result, a power optimized netlist, is used in nc-verilog to collect the
new switching activity by running
— 129 — Lund Institute of Technology
Chapter 7. Optimization strategies for power
ncverilog -f compile_pwr.ncverilog
The remaining step is to estimate the final power consumption and collect
all important information in a report file. This is performed in DC with
final pwr.scr. The final output are a VHDL netlist, an SDF file, and a report
file with the final power consumption estimate. To create an SDF file for VHDL
the write.sdf script has to be executed in PrimeTime due to some naming
conventions problems between the UMC libraries and the DC SDF file. Then,
this netlist and the SDF file can be used to perform a post synthesis simulation
for verification purposes in MS.
7.5.2 Patches
The aim was to generate a fully automatic design flow that could be executed
with one script and where all parameters are set in one constraint file. However,
since more than one program is involved there are some parts of the scripts that
have to be patched manually, which files and what to change are listed below.
In addition to run an automated flow a file hierarchy as shown in Figure 7.16
must be provided. The complete structure with all script files and an example
can be downloaded from [?].
PROJECT
sim
source setup.ncverilog
compile.ncverilog
compile pwr.ncverilog
syn vhdl
vhdl files source setup.syn
all.scr
read hdl.scr
timing area.scr
power opt.scr
final pwr.scr
report.scr
write sdf.scr
header.scr
stimuli
Figure 7.16: The required file hierarchy for an automated design flow.
To initialize the project, create the hierarchy shown in Figure 7.16. Then
source setup.syn in the syn folder and source setup.ncverilog in the sim
folder. Then, open the header.scr and fill in search paths, constraints, and
design names. Finally, the parts that have to be manually patched are shown
in Table 7.1.
If the above procedure is performed correctly, the all.scr script executed
in the syn folder will perform all steps in the design flow and the result can be
read in the report files. However, it is strongly recommended that the scripts
are executed one at a time and the observed results are correct, at least once
before the all.scr script is executed.
ElectroScience — 130 —
7.6. Commands in DC
Table 7.1: Manual patches.
File Parameter Values and example
write.sdf design data base db/ALU W L8 PWR.db
write.sdf result file name netlist/ALU W L8 PWR.sdf
verilog tb file name must be tb ”MODULE”.v, e.g., tb ALU.v
verilog tb Design under test must be labeled as dut
verilog tb inputs and parameters see the example, tb ALU.v
compile.ncverilog tb name tb ”MODULE”.v, e.g., tb ALU.v
compile.ncverilog netlist name ../syn/netlists/ALU W L8.v
compile pwr.ncverilog tb name tb ”MODULE”.v, e.g., tb ALU.v
compile pwr.ncverilog netlist name ../syn/netlists/ALU W L8 PWR.v
7.6 Commands in DC
Some commands that are frequently used in DC and nc-verilog for power
optimization are briefly explained in this section. It is assumed that the reader
already has basic knowledge about common commands in DC. For a complete
explanation use the man command in dc shell, for example,
dc_shell> man <command_name>
To get a glimpse of how many commands there are, feel free to type:
dc_shell> list -commands
For a complete explanation of the commands in nc-verilog, type sold and look
in the Power Management/Power Compiler reference manual/Using Verilog.
7.6.1 Some useful DC commands
This section is an excerpt of the commands that appear in the power estimation
scripts.
Clock gating
set clock gating style -sequential cell latch -minimum bitwidth 3
Description: Tells DC to use latch based clock gates if it finds enable registers
that are at least -minimum bitwidth bits wide when elaborating.
Warning: Use clock gates with latch to avoid glitch sensitivity. Might cause
timing errors in DC, see also set false path.
elaborate ALU -lib WORK -update -gate clock
Description: Elaborates the design and looks for places to insert clock gates.
set false path -to find(-flat, -hierarchy, pin "latch/D")
Description: Tells DC to NOT perform timing checks on the enable signal in
clock gates. DC might otherwise insert one clock cycle of delays on the enable
signal.
Warning: Will only function correct if the latches in the clock gates are the
only latches in the design (this is normally the case).
— 131 — Lund Institute of Technology
Chapter 7. Optimization strategies for power
report clock gating -hier -gating elements
Description: Reports all inserted clock gates.
Power estimation and optimization
link
Description: Links all used libraries to the current design.
power preserve rtl hier names=true
Description: Preserves RTL-hierarchy in RTL design.
Warning: Must be set to true if rtl2saif is to be used.
rtl2saif
Description: Used to create a forward annotated SAIF file. This file is then
used in a simulation tool to capture switching activity.
Warning: Only necessary if power estimation is performed on RTL. The for-
ward SAIF file for gate-level simulation is provided by the technology library
(lib2saif).
set max dynamic power 8 mW
set max leakage power 20 uW
Description: Sets target values for power optimization.
compile -incremental
Description: Will only try to solve issues due to new constraints set on an
already compiled design. Used to recompile the design after power constraints
are specified.
Warning: Must be used if a second compile is performed on the same design.
report power -analysis medium -hier -hier level 3 >> REPORT FILE
Description: Reports power consumption in the design and lists it for each
module down to the third hierarchy level.
reset switching activity -all
read saif -input "backward.saif" -instance "tb ALU/dut" -unit base ns
Description: Removes previously annotated switching activity and reads new
switching information from the backward.saif file into the design. This in-
formation is used to estimate power consumption. See also the nc-verilog
command toggle report.
Miscellaneous commands
compile -boundary optimization
Description: Will optimize across all hierarchy levels.
report area > REPORT FILE
report timing >> REPORT FILE
report constraint >> REPORT FILE
report design >> REPORT FILE
report reference >> REPORT FILE
report routability >> REPORT FILE
report port >> REPORT FILE
ElectroScience — 132 —
7.6. Commands in DC
report clock gating -hier -gating elements >> REPORT FILE
report saif -flat -missing >> REPORT FILE
report power -analysis medium -hier -hier level 3 >> REPORT FILE
Description: Reports a lot of useful information and store it in REPORT FILE.
change names -rules vhdl -hierarchy
write -format vhdl -hierarchy -output "netlist/ALU syntesized.vhd"
Description: Change names to match with current db and save the netlist as
in vhdl-format. change names is used since DC sometimes change names on
nets and ports.
Warning: Will not fix all names for vhdl-format. Therefore, do not be sur-
prised about all the warnings.
write sdf -version 1.0 "ALU.sdf"
Description: Writes information about timing and delay between all pins in
the design. This file is used to perform a more accurate simulation of the netlist.
set register type -exact -flip flop HDSDFPQ1
Description: Replaces all registers with the specific register HDSFPQ1.
design list = find (design, "*", -hierarchy)
design list = design list + current design
foreach (de, design list) ¦
current design de
cell list = find (cell,"*plus*")
set implementation "DW01 add/cla" cell list
remove unconnected ports cell list
¦
Description: Finds all adders in the elaborated design and set them to be
implemented in carry-lookahead (CLA) style found in the DesignWare library
DW01.
7.6.2 nc-verilog
read lib saif("umce13h210t3 tc 120V 25C.saif");
Description: Will read the timing information specific to the technology li-
brary that is used.
sdf annotate("ALU.sdf", dut, ,"./sdf.log");
Description: Will read the timing information specific to the design dut that
is to be simulated.
set toggle region(tb ALU.dut);
Description: Specifies the top level in the design, in this case the dut module.
toggle start();
Description: Specifies at what time toggle count will start, for example,
#50*CYCLE_TIME $toggle_start()
states that toggle count will begin after 50 clock cycles (CYCLE TIME is a
constant specified in the testbench).
— 133 — Lund Institute of Technology
Chapter 7. Optimization strategies for power
toggle stop();
Description: Specifies at what time toggle count will stop, for example,
#1000*CYCLE_TIME $toggle_stop()
states that toggle count will begin after 1000 clock cycles (CYCLE TIME is a
constant specified in the testbench).
toggle report("/backward.saif",1.0e-9,"tb ALU");
Description: Writes all information about switching activity for each pin and
net in the design to the file backward.saif, the time scale is nano seconds. This
file is used in DC to perform a more accurate power consumption estimate.
7.7 The ALU example
The power optimization scripts described in the previous section includes an
example design, an ALU. Figure 7.17 shows the ALU schematics where A and B
are inputs, Q is output, and ALU CTRL are the control signals, see Chapter ??
for more details. What follows is a short description of the effects of running
the scripts on the ALU design.
ALU OP
S
X
A
M
B
Q
R
Mux
R reg
Q reg
Alu
Shift
Figure 7.17: An Arithmetic Logical Unit (ALU).
The two enable flip-flops will be identified as candidates for clock gating and
exchanged with a clock gate and basic flip-flops without enable input. The ALU
will then be synthesized according to timing constraints, that is, the required
clock frequency, and an estimate of the power consumption is performed based
on the inputs in the stimuli file. Then, power constraints are specified and an
incremental synthesis is performed. Finally, the new design is power estimated
and the result is presented in a report file, the intermediate results from synthesis
are also available in a different report file for comparison purposes.
ElectroScience — 134 —
Chapter 8
Formal Verification
With the growing complexity and reduced time-to-market the biggest bottleneck
in digital hardware design is design verification. An insufficiently verified ASIC
design may contain hardware bugs that can cause expensive project delays when
they are discovered during software or system tests on the real hardware. The
growing complexity of ASIC designs has caused the time of verification through
simulation to increasing very much. Of this reason there is a huge demand to
find verification methods that are faster without reducing the coverage of the
verification.
One method that has matured and established it self in real life design flows
is formal verification methods. The formal verification methods can be described
as a group of mathematical methods used to prove or disprove properties of a
design. It has been shown that for a design of 20 to 25 million gates the veri-
fication time was reduced to less than a tenth of the verification time using a
simulation tool. In addition to being much faster than simulation, formal verifi-
cation techniques are also inherently more robust. It is therefore not necessary
for the designer to identify all corner cases and test each one with a set of test
vectors. A formal verification tool will automatically find all cases where a de-
sign does not meet the specifications. This means that it will find the difficult
corner case even if the designer has missed it.
An example that will make the difference between verification using simula-
tion and formal verification visible is the verification of a 32 bit shifter. To fully
verify the shifter block, 256 billion test vectors are needed where as with formal
verification methods only a mathematical equivalence verification is needed.
8.1 Concepts within Formal Verification
In Figure 8.1 the most important concepts in formal verification, and how they
relate to another is described.
Requirements and needs are more or less precisely formulated requirements
and expectations of a system. By requirements formalization, one arrives at a
formal specification, which gives precis requirements. The formal specification
is validated to make sure that it correctly captures the requirements and needs.
The real system is the actual design. The system is modelled to obtain a
mathematical description of how it works. The modelled description of the real
— 135 — Lund Institute of Technology
Chapter 8. Formal Verification
Requirements & Needs
Formal Specification
System Model
Real System
Requirements Formalisation Validation
Verification Formal Construction
Modeling
Figure 8.1: The concepts in formal verification.
system is called the system model.
By formal verification one can prove or disprove that the system model
satisfies the requirements in the formal specification. Another possibility is to
use formal construction to create a system model which with certainty satisfies
the requirements.
8.1.1 Formal Specification
The formal specification gives an exact description of the system function. The
function of the formal specification is therefore to precisely describe the concepts
used when talking about the functionality of the system. It is also essential to
make certain that the formal specification correctly describes the functionality
of the system. This is done by validating the specification.
When a formal specification is generated by a tool it is important that the
formal specification correctly describes the behavior of the design. Since this is
an issue of the correctness of the tool used it is hard to have any control over it.
To increase the certainty, some designers uses two formal verification tools from
different vendors with the hope that the tools complement each other when it
comes to correctness.
8.1.2 The System Model
Since the formal specification is a mathematical description of what the sys-
tem should do, a mathematical description of what the system actually does is
needed. This description is called the system model.
Also when the system model is generated by a tool the correctness is an issue
and therefore the same procedure as for the tool generated formal specification
is used.
ElectroScience — 136 —
8.2. Areas of Use for Formal Verification Tools
8.1.3 Formal Verification
Formal verification means that using a mathematical proof it can be checked
with certainty whether the system (as described by the system model) satisfies
the requirements (as expressed by the formal specification).
Formal verification differs from testing in that it guarantees that the system
acts correctly in every case. When relying on testing there is no guarantee that
the system does not behave incorrectly in a case which has not been tested. A
precondition for meaningful verification is, of course, that the system model and
the formal specifications correctly capture the properties of the system and the
requirements, respectively.
A successful formal verification does not imply that testing is superfluous
since every step in the system development can not be treated formally. What
the formal verification will substantially reduced is the amount of testing, and
above all reduce the amount of debugging.
8.1.4 Formal System Construction
The feasibility of successful verification depends to a major extent on how the
system is constructed. The principle of designing for testability is most certainly
valid when formal methods are to be used. It is often difficult to verify an
existing hardware system already designed and constructed using traditional
methods.
Instead of developing the system with traditional methods and then verify-
ing it, the system can be developed using a formal construction process (often
referred to as synthesis of the system). In general, formal design and construc-
tion is therefore preferable to applying formal verification on top of a traditional
system development process.
8.2 Areas of Use for Formal Verification Tools
There are two main kinds of usage for formal verification tools namely, model
checking and equivalence checking, which is used in different stages in the design
flow.
8.2.1 Model Checking
When the formal verification tool is used as a model checker design specifications
are translated into a set of properties described in a formal language. The design
is then mathematically verified for consistency with these properties. This kind
of verification is often used to verify RTL code before synthesis, where the
designer does not have time to wait for lengthy simulation runs during each
design iteration.
8.2.2 Equivalence Checking
When the formal verification tool is used as an equivalence checker the tool
compares two design descriptions and verifies that they have the same behavior.
In this case the formal specification and the system model is automatically
generated by the tool. This kind of verification is often used to compare two
— 137 — Lund Institute of Technology
Chapter 8. Formal Verification
versions of RTL code to ensure that an optimization has not changed behavior
of the design, or it can compare two netlists to ensure that the insertion of a
scan chain or other test logic has not ruined the original behavior of the design.
8.2.3 Limitations in Formal Verification Tools
With formal verification methods only functionality checks of a design can be
done. To cover delay problems, timing problems and sequencing problems in a
designspecial verificatin tools has to be used.
In formal verification tools as in synthesize tools the same class of algorithms
is used. To reduce the risk of double error it is therefore important to use
independent tools. Often secured by using tools from different vendors.
In an asynchronous circuit designs formal verification methods are not usable
since formal verification methods can not handle non sequential circuit designs.
8.3 Other Methods to Decrease Simulation Time
8.3.1 Running the simulation in a cluster of several com-
puters
The simulation is divided up on several computers in a network. To be able to
do this the simulation tool used must have this possibility.
8.3.2 ASIC-emulators
The ASIC-emulators is mainly used to start software development earlier but
can also be used as a hardware simulation tool. A feature in ASIC-emulators
that makes hardware simulation easier is that the emulator can be compiled in
such a way so that nodes in the design is visible to an desirable extent.
8.3.3 FPGA-Prototypes
As for the ASIC-emulator the FPGA-prototypes is mainly used for software de-
velopment but can also be used for hardware simulation. The FPGA-prototypes
has some limitation when it comes to imitate designs on chip. The penalty of
these limitations for FPGA-prototypes is mainly that one has to adjust the de-
sign methodology to these limitations. Example of limitations is that FPGA’s
are developed for a certain type of applications which sets the architecture of the
FPGA. Therefore when implementing an ASIC-design on an FPGA it is fairly
often that the ASIC-architecture not agrees with the architecture of the FPGA.
Limitations in development tools for FPGA’s can also give some problems since
they are not so advanced as development tools for ASIC’s. Despite the limita-
tions with FPGA’s, it is one of the most preferred methods when simulating a
hardware design.
8.4 Future
Reliability of formal verification tool has increased very much since the tools
become commonly used for ASIC-verification. The reliability will properly in-
ElectroScience — 138 —
8.5. References
crease even more in the future and reduce the limitations.
A process in constant progress is to move hardware design to a higher ab-
straction level. The need of this comes from the desire to reduce time-to-market.
Formal methods has shown that they apply very well on the next level of ab-
straction that is the system level. Hence, it can be expected that in new tools
formal methods will be more common.
8.5 References
http://www.l4i.se/E_form.htm
http://www.edtnscandinavia.com/story/OEG20010403S0006
http://www.hardi.com/eda-tools/literature/Formell_verifiering_i_praktiken_1.pdf
http://www.hardi.se/eda-tools/why_formal_verification.htm
http://www.mentor.com/formalpro/css.html, click on Alcatel, Emmanuel Ligeon.
— 139 — Lund Institute of Technology
ElectroScience — 140 —
Chapter 9
Multiple Supply Voltages
Energy and power consumption are important factors in the design of integrated
hardware. To make the battery in a hand-held device, such as a PDA or a
cellular phone to last as long as possible, energy stored in the battery must
be used efficiently. Furthermore, limiting the on-chip power consumption is
important due to increased production costs for cooling arrangements.
For the past decades, the technology development has followed the predic-
tions by Gordon Moore in “Moore’s law”. The original version of “Moore’s
law” was an observation that the number of components on integrated circuits
doubles every 12 months [24]. Later, “Moore’s law” has been revised to a
factor two improvement every 24 months, and a prediction of doubled clock fre-
quency every 24 months has also been added [25,26]. If the increased computing
speed and packing density of modern technologies is to result in faster and more
complex chips, new methods must be developed to handle power consumption.
Power management has therefore gained much attention both in industry and
in universities.
Active power consumption due to capacitive switching is traditionally re-
garded as the main source of power consumption in digital design. However,
reaching deep submicron technologies also makes it necessary to take transistor
leakage into account, since it causes an increasing power consumption in each
technology generation [27]. Figure 9.1 shows the development in active and
leakage power [27]. If the trend shown in Figure 9.1 continues, the active power
consumption will soon be overtaken by the leakage power.
While the active power consumption varies with the workload, and can be
locally eliminated using for instance clock gating, power consumption due to
leakage remains, as long as the power supply voltage is switched on. For burst
mode applications, such as mobile phones, a large part of the system is in idle
mode most of the time. This makes the energy consumed by leakage a large
contributor to the overall energy consumption and, thus, making simple saving
schemes such as clock gating less attractive.
Reducing the power supply voltage is an efficient method to reduce all types
of power consumption (due to switching, short circuit, and leakage) [28]. How-
ever, reducing the power supply voltage for an entire chip might not be possible
due to performance constraints. Components of an ASIC or a microprocessor
have different requirements on power supply voltage due to different critical
paths. Therefore, running some components on a lower voltage can save power.
— 141 — Lund Institute of Technology
Chapter 9. Multiple Supply Voltages
1000
100
10
1
0.1
0.01
0.001
1960 1970 1980 1990 2000 2010
Active Power
Leakage
P
o
w
e
r
(
W
)
Source: G. E. Moore, Intel Corp [27]
Figure 9.1: Processor power, active and leakage.
Increasing the number of available supply voltages from one to two often
results in large power savings, while using more than two voltages only leads to
minor additional savings. In [29], and [30], power savings in the range of 25-
50 % are achieved for a number of design examples using dual supply voltages
compared to a single voltage approach. For the designs in [29], an increase in the
number of supply voltages from two to three results in less than 5 % additional
power savings.
Supply voltage scaling is associated with an overhead due to DC/DC-converters
for transforming the supply voltage down to lower voltage levels. The most fre-
quently used DC/DC-converter for power savings by lowering the supply voltage
is the Buck converter [28]. The Buck converter converts the main supply voltage
down to a lower voltage using large transistor switches, as shown in Figure 9.2a.
In the Buck converter, a control circuit is used for generating a square-wave
with variable duty factor that controls the switches. The switched supply is
filtered using a low-pass filter to minimize the ripple at V
out
. Besides the power
consumption in the control circuits for the Buck converter, there is always a
power loss in the switches due to resistance of the transistors. In [31] and [32],
there are examples of Buck converters reaching an efficiency as high as 90 %.
In a system with more than one supply voltage, signal level converters are
needed to ensure correct functionality when going from a low voltage region
to a higher voltage region. The signal level converter shown in Figure 9.2b is
used in [30, 33, 34]. Furthermore, if each output bit is connected to an inverter
supplied by a lower voltage, all buses and signals are run at the lower voltage.
This leads to further power savings [35].
ElectroScience — 142 —
9.1. Demonstration of Dual Supply Voltage in Silicon Ensemble
V
DD
V
ctrl
V
high
V
low
in
out
V
out
(a) (b)
Figure 9.2: Overhead components associated with multiple supply voltages.
Buck converter (a), and signal level converter (b).
9.1 Demonstration of Dual Supply Voltage in
Silicon Ensemble
This part describes a (fairly) simple method to Place & Route a design with
multiple supply voltages using the tool Silicon Ensemble. Advantages of this
method are that no other tool than Silicon Ensemble is necessary, and that
the designer has full control over the design procedure. The text is intended
for anyone having knowledge on how to use scripts in Silicon Ensemble. The
example is implemented using the UMC 0.13µm technology. A description of
the design used in this demonstration is found in [Matthias paper]. Scripts and
netlist files for the example are found in [?].
9.1.1 Editing Power Connections in the Netlist
The netlist contains information about which net to use for power and ground.
Changing an automatically generated netlist is generally avoided. However, as-
signments of power nets are easily manipulated in a netlist written in Design
Exchange Format (DEF).
It is likely that the design is initially described using a Verilog netlist (.v). To
simplify edits in the netlist, it is converted to DEF format. The hierarchy of the
design shall contain a top view where each block is intended to use a separate
supply voltage. In this example, the top view contains the two blocks “conven-
tional” and “improved”.
Start Silicon Ensemble as described in ...Joachims manual.... To import supply
pads use the DEF-file “supplyPads2Vdd.def”, which contains an extra supply
voltage pad with net name “VDP”. The contents of the file “supplyPads2Vdd.def”
is shown below. The added supply voltage pad is “pi VDP”, and it is connected
to the net “VDP”. The macro “Verilog2DEF.mac” can be executed in silicon
ensemble to import and convert the design “dummy top.v” to “dummy dop.def”.
— 143 — Lund Institute of Technology
Chapter 9. Multiple Supply Voltages
DESIGN demoDesign ;
UNITS DISTANCE MICRONS 1000 ;
COMPONENTS 9 ;
- pi_VDDC WVVDD ;
- pi_VDP WVVDD ;
- pi_VSSC WVVSS ;
- pi_VDDI WVV3IO ;
- pi_VSSI WVV0IO ;
- pi_CORNER_LT WCORNER ;
- pi_CORNER_RT WCORNER ;
- pi_CORNER_LB WCORNER ;
- pi_CORNER_RB WCORNER ;
END COMPONENTS
SPECIALNETS 5 ;
- V3IO ( pi_* V3IO ) + USE POWER ;
- V0IO ( pi_* V0IO ) + USE GROUND ;
- VDD ( pi_VDDC VDD ) + USE POWER ;
- VDP ( pi_VDP VDD ) + USE POWER ;
- VSS ( * VSS ) + USE GROUND ;
END SPECIALNETS
END DESIGN
Open the netlist “dummy dop.def” in a text editor and skip down to the line
“PINS 28 ;”. Change number of pins to 29 and add a supply voltage pin “VDP”
as shown below.
PINS 29 ;
- V0IO + NET V0IO + DIRECTION INPUT + USE GROUND ;
- VSS + NET VSS + DIRECTION INPUT + USE GROUND ;
- V3IO + NET V3IO + DIRECTION INPUT + USE POWER ;
- VDD + NET VDD + DIRECTION INPUT + USE POWER ;
- VDP + NET VDP + DIRECTION INPUT + USE POWER ;
Skip down to the line “SPECIALNETS 5 ;”. The text between “SPECIAL-
NETS” and “END SPECIALNETS” contains connections for power supply
wires. As default, both blocks use power supply “VDD”. This is changed to
“VDP” for the block named “conventional”.
Skip down to the first line that contains “conventional” and change the block
to use “VDP” as shown below.
( improved/l_trellis_instance/l_butterfly_instance_0/U70 VDD )
( improved/l_trellis_instance/l_butterfly_instance_0/U69 VDD )
ElectroScience — 144 —
9.1. Demonstration of Dual Supply Voltage in Silicon Ensemble
( pi_VDDC VDD ) + USE POWER ;
- VDP ( conventional/U16 VDD ) ( conventional/bm_sig_reg_0_0_3 VDD )
( conventional/bm_sig_reg_0_0_2 VDD ) ( conventional/bm_sig_reg_0_0_1 VDD )
( conventional/bm_sig_reg_0_0_0 VDD ) ( conventional/bm_sig_reg_0_1_3 VDD )
The line
( pi_VDDC VDD ) + USE POWER ;
sets the block “improved” to use the pad “pi VDDC”.
Skip down to the last assignment of VDD-pins for the block named “conven-
tional” and change the block to use the pad “pi VDP” as shown below.
( conventional/l_trellis_instance/l_butterfly_instance_0/U67 VDD )
( conventional/l_trellis_instance/l_butterfly_instance_0/U66 VDD )
( pi_VDP VDD ) + USE POWER ;
- V0IO ( U75 V0IO ) ( U74 V0IO ) ( U73 V0IO ) ( U72 V0IO ) ( U71 V0IO )
( U70 V0IO ) ( U69 V0IO ) ( U68 V0IO ) ( U67 V0IO ) ( U66 V0IO ) ( U65 V0IO )
( U64 V0IO ) ( U63 V0IO ) ( U62 V0IO ) ( U61 V0IO ) ( U60 V0IO ) ( U59 V0IO )
The line
- VDP ( pi_VDP VDD ) + USE POWER ;
which is found just before the line
END SPECIALNETS
should be removed to avoid warnings.
The file “dummy.def” contains all the edits described in this section.
9.1.2 Floorplan and Placement
Import the design with “1-import.mac”. This script imports the netlist
“dummy top.def”. Change this to “dummy.def” if you skipped the edits de-
scribed in chapter 9.1.1.
Floorplan and I/O placement is performed with the script “2-floorp and IO.mac”.
The I/O placement is set in the file “ioplace.ioc”, which is the default placement
except for the power pads.
Since the two parts of the design shall use separate supply voltages, they are
placed as two separate blocks. This is controlled by adding “regions” and se-
lecting one region for each block. Regions are added by running the script
“3-regions.mac”, which is shown below.
ADD REGION NAME USEVDP CELL conventional* BBOX (-100000 -45000) (-50000 46000);
ADD REGION NAME USEVDD CELL improved* BBOX (50000 -45000) (100000 25000);
SET VAR DRAW.REGION.AT "On";
REFRESH
— 145 — Lund Institute of Technology
Chapter 9. Multiple Supply Voltages
Figure 9.3: Floorplan with regions.
The command
ADD REGION NAME USEVDP CELL conventional* BBOX (-100000 -45000) (-50000 46000);
creates the region “USEVDP” at the specified coordinates, and sets the place-
ment of all components at hierarchical level “conventional” and below to use the
region “USEVDP”. Deciding size and placement of the regions is an iterative
process.
Figure 9.3 shows the floorplan with created regions.
Place cells by running the script “4-placecells.mac”. The cells are placed
within the defined regions.
9.1.3 Power and Signal Routing
In this design, all components belong to one of the blocks that are placed within
the two regions. Rows outside the regions can therefore be removed to give space
for power routing. Running the script “5-cutrows.mac” (shown below) deletes
rows outside the regions.
DELETE ROW AREA SPLIT (-50000,-50000) (50000,50000);
DELETE ROW AREA SPLIT (50000,25000) (100000,50000);
Initial power rings are created and then edited for this design. Run the script
“6-initialpower.mac” to create three power rings (VDD, VDP, and VSS).
ElectroScience — 146 —
9.1. Demonstration of Dual Supply Voltage in Silicon Ensemble
Figure 9.4: Initial power rings.
Figure 9.4 shows the initial power rings.
The script “7-modifypower.mac”, shown below contains two types of com-
mands; First, two wires are added for the net “VSS”, and then wires for the
nets “VDP”, and “VDD” are moved to optimize power routing. Adding and
moving wires is easily done using the menu commands (Edit-Wire-Move/Add),
and then saved as a script file.
ADD WIRE SPECIAL DRC NOVIASATCROSSOVER SHORTSCHECK NET VSS
LAYER MET2 WIDTH 10000 YFIRST (-41000, 55200) (-41000, -54800);
ADD WIRE SPECIAL DRC NOVIASATCROSSOVER SHORTSCHECK NET VSS
LAYER MET2 WIDTH 10000 YFIRST (40000, 55200) (40000, -54800);
MOVE WIRE NOVIASATCROSSOVER DRC SNAP NET VDP (120613,23800) (-28000, -22400);
MOVE WIRE NOVIASATCROSSOVER DRC SNAP NET VDD (-135842 25901) (27000, -14800);
MOVE WIRE NOVIASATCROSSOVER DRC SNAP NET VDD (62969,-80009) (62160, -68000);
MOVE WIRE NOVIASATCROSSOVER DRC SNAP NET VDD (62969,79721) (63120, 34800);
MOVE WIRE NOVIASATCROSSOVER DRC SNAP NET VDD (134658,16312) (124000,16312);
Power is routed by running the script “8-routepower.mac”, and finally, sig-
nals are routed by running the script “9-wroute.mac”.
There might be geometry violations on signal wires close to pads. This is due
to a grid mismatch and can (at your own risk) be ignored.
— 147 — Lund Institute of Technology
Chapter 9. Multiple Supply Voltages
9.1.4 Filler Cells
To fill out empty spaces between I/O-pads, run the script “10-fillperi.mac”.
Filler cells are also needed in empty spaces between standard cells. The two
blocks must be handled separately since they have different names on power nets.
The script “11-fillcore.mac” shown below adds filler cells to both blocks. For
each command in “11-fillcore.mac”, cells are added in a part of the routing
area, and connected to the power nets. The area “( 20000 -75000 ) ( 133000
44000 )” matches the block using supply voltage “VDD”, and the area “( -130000
-75000 ) ( -42000 55000 )” matches the block using supply voltage “VDP”.
Figure 9.5 shows the completed design with geometry violations.
#--Block using VDD
SROUTE ADDCELL MODEL HDFILL16 PREFIX FC16 NO FN SO FS
SPIN VDD NET VDD SPIN VSS NET VSS
AREA ( 20000 -75000 ) ( 133000 44000 ) ;
SROUTE ADDCELL MODEL HDFILL8 PREFIX FC8 NO FN SO FS
SPIN VDD NET VDD SPIN VSS NET VSS
AREA ( 20000 -75000 ) ( 133000 44000 ) ;
SROUTE ADDCELL MODEL HDFILL4 PREFIX FC4 NO FN SO FS
SPIN VDD NET VDD SPIN VSS NET VSS
AREA ( 20000 -75000 ) ( 133000 44000 ) ;
SROUTE ADDCELL MODEL HDFILL2 PREFIX FC2 NO FN SO FS
SPIN VDD NET VDD SPIN VSS NET VSS
AREA ( 20000 -75000 ) ( 133000 44000 ) ;
SROUTE ADDCELL MODEL HDFILL1 PREFIX FC1 NO FN SO FS
SPIN VDD NET VDD SPIN VSS NET VSS
AREA ( 20000 -75000 ) ( 133000 44000 ) ;
#--Block using VDP
SROUTE ADDCELL MODEL HDFILL16 PREFIX FC16 NO FN SO FS
SPIN VDD NET VDP SPIN VSS NET VSS
AREA ( -130000 -75000 ) ( -42000 55000 ) ;
SROUTE ADDCELL MODEL HDFILL8 PREFIX FC8 NO FN SO FS
SPIN VDD NET VDP SPIN VSS NET VSS
AREA ( -130000 -75000 ) ( -42000 55000 ) ;
SROUTE ADDCELL MODEL HDFILL4 PREFIX FC4 NO FN SO FS
SPIN VDD NET VDP SPIN VSS NET VSS
AREA ( -130000 -75000 ) ( -42000 55000 ) ;
ElectroScience — 148 —
9.1. Demonstration of Dual Supply Voltage in Silicon Ensemble
SROUTE ADDCELL MODEL HDFILL2 PREFIX FC2 NO FN SO FS
SPIN VDD NET VDP SPIN VSS NET VSS
AREA ( -130000 -75000 ) ( -42000 55000 ) ;
SROUTE ADDCELL MODEL HDFILL1 PREFIX FC1 NO FN SO FS
SPIN VDD NET VDP SPIN VSS NET VSS
AREA ( -130000 -75000 ) ( -42000 55000 ) ;
— 149 — Lund Institute of Technology
Chapter 9. Multiple Supply Voltages
Figure 9.5: Completed design.
ElectroScience — 150 —
Bibliography
[1] J. Bhasker, A VHDL Primer, 3rd ed. Prentice Hall, 1999.
[2] P. J. Ashenden, Designers Guide to VHDL, 2nd ed. Morgan Kaufman,
2001.
[3] Jiri Gaisler, “High-level design methodology,”
http://www.gaisler.com/doc/vhdl2proc.pdf.
[4] F. Kristensen, P. Nilsson, and A. Olsson, “A flexible FFT processor,” in
Proc. 20th NORCHIP Conference, Copenhagen, Denmark, Nov. 2002.
[5] ——, “A generic transmitter for wireless OFDM systems,” in Proc.
IEEE Conference on Personal, Indoor, and Mobile Radio Communication
(PIMRC), vol. 3, Bejing, China, Sept. 2003, pp. 2234–2238.
[6] H. Jiang and V. wall, “FPGA implementation of controller-datapath pair
in custom image processor design,” in Proc. IEEE Symposium on Circuits
and Systems (ISCAS), Vancouver, Canada, May 2004.
[7] F. Catthoor, Custom Memory Management Methodology, 1st ed. Kluwer,
1998.
[8] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand, “Leakage Current
Mechanisms and Leakage Reduction Techniques in Deep-Submicrometer
CMOS Circuits,” Proceedings of the IEEE, vol. 91, no. 2, pp. 305–327, Feb.
2003.
[9] A. J. Bhavnagarwala, B. Austin, A. Kapoor, and J. D. Meindl, “CMOS
System-on-a-Chip Voltage Scaling beyond 50nm,” in Proceedings of the
10th Great Lakes Symposium on VLSI, Chicago, Illinois, USA, 2000, pp.
7–12.
[10] “Managing power in ultra deep submicron ASIC/IC design,” Synopsys,
Inc., Tech. Rep., 2002.
[11] K. K. Parhi, VLSI Digital Signal Processing Systems. New York, NY:
Wiley, 1999.
[12] R. Gonzalez, B. Gordon, and M. Horowitz, “Supply and threshold voltage
scaling for low power CMOS,” IEEE Journal of Solid-State Circuits, vol. 32,
no. 8, pp. 1210–1216, Aug. 1997.
— 151 — Lund Institute of Technology
Bibliography
[13] Transmeta Corporation, “Transmeta LongRun Power Management,”
http://www.transmeta.com.
[14] V. Tiwari, R. Donnelly, S. Malik, and R. Gonzalez, “Dynamic power man-
agement for microprocessors: A case study,” in Proc. Tenth International
Conference on VLSI Design, Hyderabad, India, Jan. 1997, pp. 185–192.
[15] T. Karnik, S. Borkar, and V. De, “Sub-90 nm Technologies-Challenges and
Opportunities for CAD,” in Proceedings of the 2002 IEEE International
Conference on Computer Aided Design, Nov. 2002, pp. 203–206.
[16] G. Sery, S. Borkar, and V. De, “Life is CMOS: why chase the life after?”
in Proceedings of the 39th Design Automation Conference, DAC’02. ACM
Press, June 10-14 2002, pp. 78–83.
[17] R. Kumar and C. P. Ravikumar, “Leakage Power Estimation for Deep
Submicron Circuits in an ASIC Design Environment,” in Proceedings of
the 7th Asia and South Pacific Design Automation Conference and the
15th International Conference on VLSI Design, Jan. 2002, pp. 45–50.
[18] B. Chatterjee, M. Sachdev, S. Hsu, R. Krishnamurthy, and S. Borkar, “Ef-
fectiveness and Scaling Trends of Leakage Control Techniques for Sub-130
nm CMOS Technologies,” in Proceedings of the 2003 International Sympo-
sium on Low Power Electronics and Design, ISLPED’03, Aug. 2003, pp.
122–127.
[19] S. Mukhopadhyay and K. Roy, “Modeling and Estimation of Total Leakage
Current in Nano-scaled-CMOS Devices Considering the Effect of Parameter
Variation,” in Proceedings of the 2003 International Symposium on Low
Power Electronics and Design, ISLPED’03, Aug. 2003, pp. 172–175.
[20] L. T. Clark, S. Demmons, N. Deutscher, and F. Ricci, “Standby Power
Management for a 0.18µm Microprocessor,” in Proceedings of the 2002 In-
ternational Symposium on Low Power Electronics and Design, ISLPED’02,
Aug. 2002, pp. 7–12.
[21] ???, “International Technology Roadmap for Semiconductors,”
http://public.itrs.net.
[22] “Switching activity interchange format (SAIF),” Synopsys, Inc., Tech.
Rep., 2002.
[23] $UMC LIB/../UMCE13H210D3 1.1/lib/umce13h210t3 tc 120V 25C.lib.
[24] G. E. Moore, “Cramming More Components onto Integrated Cir-
cuits,” Electronocs, vol. 38, no. 8, Apr. 1965. [Online]. Available:
ftp://download.intel.com/research/silicon/moorespaper.pdf
[25] I. Toumi, “The Lives and Death of Moore’s Law,” First Monday, Nov.
2002. [Online]. Available: http://firstmonday.org/issues/issue7 11/tuomi/
[26] S. Borkar, “Obeying Moore’s Law Beyond 0.18 Micron,” in Proceedings
of the 13th Annual IEEE International ASIC/SOC Conference, Arlington,
VA, Sept. 2000, pp. 26–31.
ElectroScience — 152 —
Bibliography
[27] G. E. Moore, “No Exponential is Forever: But “Forever” Can Be Delayed!”
in Proceedings of the IEEE International Solid-State Circuits Conference,
ISSCC’03, Feb. 2003, pp. 20–23.
[28] A. P. Chandrakasan and R. W. Brodersen, “Minimizing Power Consump-
tion in Digital CMOS Circuits,” in Proceedings of the IEEE, vol. 83, no. 4,
Apr. 1995, pp. 498–523.
[29] M. C. Johnson and K. Roy, “Optimal Selection of Supply Voltages and
Level Conversions During Data Path Scheduling Under Resource Con-
straints,” in Proceedings of the 1996 IEEE International Conference on
Computer Design, Oct. 1996, pp. 72–77.
[30] V. Sundararajan and K. K. Parhi, “Synthesis of Low Power CMOS VLSI
Circuits using Dual Supply Voltages,” in Proceedings of the 36th Design
Automation Conference, DAC’99. New Orleans, LA, USA: ACM Press,
June 21-25 1999, pp. 72–75.
[31] G.-Y. Wei and M. Horowitz, “A Fully Digital, Energy-Efficient, Adaptive
Power-Supply Regulator,” IEEE Journal of Solid-State Circuits, vol. 34,
no. 4, pp. 520–528, Apr. 1999.
[32] T. Kuroda, K. Suzuki, S. Mita, T. Fujita, F. Yamane, F. Sano, A. Chiba,
Y. Watanabe, K. Matsuda, T. Maeda, T. Sakurai, and T. Furuyama, “Vari-
able Supply-Voltage Scheme for Low-Power High-Speed CMOS Digital De-
sign,” IEEE Journal of Solid-State Circuits, vol. 33, no. 3, pp. 454–462,
Mar. 1998.
[33] M. Johnson and K. Roy, “Scheduling and Optimal Voltage Selection for
Low Power Multi-Voltage DSP Datapaths,” in Proceedings of IEEE Inter-
national Symposium on Circuits and Systems, ISCAS’97, June 1997, pp.
2152–2155.
[34] T. Olsson, P.
˚
Astr¨ om, and P. Nilsson, “Dual Supply–Voltage Scaling for
Reconfigurable SoCs,” in Proceedings of IEEE International Symposium
on Circuits and Systems, ISCAS 2001, vol. 2, Sydney, Australia, May 6-9
2001, pp. 61–64.
[35] T. Nakagome, K. Itoh, M. Isoda, K. Takeuchi, and M. Aoki, “Sub-1-V
Swing Internal Bus Architecture for Future Low-Power ULSI’s,” IEEE
Journal of Solid-State Circuits, vol. 28, no. 4, pp. 414–419, Apr. 1993.
— 153 — Lund Institute of Technology
ElectroScience — 154 —
Appendix A
std logic 1164
PACKAGE std_logic_1164 IS
-------------------------------------------------------------------
-- logic state system (unresolved)
-------------------------------------------------------------------
TYPE std_ulogic IS ( ’U’, -- Uninitialized
’X’, -- Forcing Unknown
’0’, -- Forcing 0
’1’, -- Forcing 1
’Z’, -- High Impedance
’W’, -- Weak Unknown
’L’, -- Weak 0
’H’, -- Weak 1
’-’ -- Don’t care
);
-------------------------------------------------------------------
-- unconstrained array of std_ulogic for use with the resolution function
-------------------------------------------------------------------
TYPE std_ulogic_vector IS ARRAY ( NATURAL RANGE <> ) OF std_ulogic;
-------------------------------------------------------------------
-- resolution function
-------------------------------------------------------------------
FUNCTION resolved ( s : std_ulogic_vector ) RETURN std_ulogic;
-------------------------------------------------------------------
-- *** industry standard logic type ***
-------------------------------------------------------------------
SUBTYPE std_logic IS resolved std_ulogic;
-------------------------------------------------------------------
-- unconstrained array of std_logic for use in declaring signal arrays
-------------------------------------------------------------------
TYPE std_logic_vector IS ARRAY ( NATURAL RANGE <>) OF std_logic;
-------------------------------------------------------------------
-- common subtypes
-------------------------------------------------------------------
SUBTYPE X01 IS resolved std_ulogic RANGE ’X’ TO ’1’; -- (’X’,’0’,’1’)
SUBTYPE X01Z IS resolved std_ulogic RANGE ’X’ TO ’Z’; -- (’X’,’0’,’1’,’Z’)
SUBTYPE UX01 IS resolved std_ulogic RANGE ’U’ TO ’1’; -- (’U’,’X’,’0’,’1’)
SUBTYPE UX01Z IS resolved std_ulogic RANGE ’U’ TO ’Z’; -- (’U’,’X’,’0’,’1’,’Z’)
-------------------------------------------------------------------
-- overloaded logical operators
-------------------------------------------------------------------
FUNCTION "and" ( l : std_ulogic; r : std_ulogic ) RETURN UX01;
FUNCTION "nand" ( l : std_ulogic; r : std_ulogic ) RETURN UX01;
FUNCTION "or" ( l : std_ulogic; r : std_ulogic ) RETURN UX01;
FUNCTION "nor" ( l : std_ulogic; r : std_ulogic ) RETURN UX01;
FUNCTION "xor" ( l : std_ulogic; r : std_ulogic ) RETURN UX01;
function "xnor" ( l : std_ulogic; r : std_ulogic ) return ux01;
FUNCTION "not" ( l : std_ulogic ) RETURN UX01;
— 155 — Lund Institute of Technology
Bibliography
-------------------------------------------------------------------
-- vectorized overloaded logical operators
-------------------------------------------------------------------
FUNCTION "and" ( l, r : std_logic_vector ) RETURN std_logic_vector;
FUNCTION "and" ( l, r : std_ulogic_vector ) RETURN std_ulogic_vector;
FUNCTION "nand" ( l, r : std_logic_vector ) RETURN std_logic_vector;
FUNCTION "nand" ( l, r : std_ulogic_vector ) RETURN std_ulogic_vector;
FUNCTION "or" ( l, r : std_logic_vector ) RETURN std_logic_vector;
FUNCTION "or" ( l, r : std_ulogic_vector ) RETURN std_ulogic_vector;
FUNCTION "nor" ( l, r : std_logic_vector ) RETURN std_logic_vector;
FUNCTION "nor" ( l, r : std_ulogic_vector ) RETURN std_ulogic_vector;
FUNCTION "xor" ( l, r : std_logic_vector ) RETURN std_logic_vector;
FUNCTION "xor" ( l, r : std_ulogic_vector ) RETURN std_ulogic_vector;
function "xnor" ( l, r : std_logic_vector ) return std_logic_vector;
function "xnor" ( l, r : std_ulogic_vector ) return std_ulogic_vector;
FUNCTION "not" ( l : std_logic_vector ) RETURN std_logic_vector;
FUNCTION "not" ( l : std_ulogic_vector ) RETURN std_ulogic_vector;
-------------------------------------------------------------------
-- conversion functions
-------------------------------------------------------------------
FUNCTION To_bit ( s : std_ulogic; xmap : BIT := ’0’) RETURN BIT;
FUNCTION To_bitvector ( s : std_logic_vector ; xmap : BIT := ’0’) RETURN BIT_VECTOR;
FUNCTION To_bitvector ( s : std_ulogic_vector; xmap : BIT := ’0’) RETURN BIT_VECTOR;
FUNCTION To_StdULogic ( b : BIT ) RETURN std_ulogic;
FUNCTION To_StdLogicVector ( b : BIT_VECTOR ) RETURN std_logic_vector;
FUNCTION To_StdLogicVector ( s : std_ulogic_vector ) RETURN std_logic_vector;
FUNCTION To_StdULogicVector ( b : BIT_VECTOR ) RETURN std_ulogic_vector;
FUNCTION To_StdULogicVector ( s : std_logic_vector ) RETURN std_ulogic_vector;
-------------------------------------------------------------------
-- strength strippers and type convertors
-------------------------------------------------------------------
FUNCTION To_X01 ( s : std_logic_vector ) RETURN std_logic_vector;
FUNCTION To_X01 ( s : std_ulogic_vector ) RETURN std_ulogic_vector;
FUNCTION To_X01 ( s : std_ulogic ) RETURN X01;
FUNCTION To_X01 ( b : BIT_VECTOR ) RETURN std_logic_vector;
FUNCTION To_X01 ( b : BIT_VECTOR ) RETURN std_ulogic_vector;
FUNCTION To_X01 ( b : BIT ) RETURN X01;
FUNCTION To_X01Z ( s : std_logic_vector ) RETURN std_logic_vector;
FUNCTION To_X01Z ( s : std_ulogic_vector ) RETURN std_ulogic_vector;
FUNCTION To_X01Z ( s : std_ulogic ) RETURN X01Z;
FUNCTION To_X01Z ( b : BIT_VECTOR ) RETURN std_logic_vector;
FUNCTION To_X01Z ( b : BIT_VECTOR ) RETURN std_ulogic_vector;
FUNCTION To_X01Z ( b : BIT ) RETURN X01Z;
FUNCTION To_UX01 ( s : std_logic_vector ) RETURN std_logic_vector;
FUNCTION To_UX01 ( s : std_ulogic_vector ) RETURN std_ulogic_vector;
FUNCTION To_UX01 ( s : std_ulogic ) RETURN UX01;
FUNCTION To_UX01 ( b : BIT_VECTOR ) RETURN std_logic_vector;
FUNCTION To_UX01 ( b : BIT_VECTOR ) RETURN std_ulogic_vector;
FUNCTION To_UX01 ( b : BIT ) RETURN UX01;
-------------------------------------------------------------------
-- edge detection
-------------------------------------------------------------------
FUNCTION rising_edge (SIGNAL s : std_ulogic) RETURN BOOLEAN;
FUNCTION falling_edge (SIGNAL s : std_ulogic) RETURN BOOLEAN;
-------------------------------------------------------------------
-- object contains an unknown
-------------------------------------------------------------------
FUNCTION Is_X ( s : std_ulogic_vector ) RETURN BOOLEAN;
FUNCTION Is_X ( s : std_logic_vector ) RETURN BOOLEAN;
ElectroScience — 156 —
Bibliography
FUNCTION Is_X ( s : std_ulogic ) RETURN BOOLEAN;
END std_logic_1164;
— 157 — Lund Institute of Technology
Bibliography
std logic arith
--------------------------------------------------------------------------
--
-- -- Copyright (c) 1990,1991,1992 by Synopsys, Inc. All rights
reserved. -- --
-- -- This source file may be used and distributed without
restriction -- -- provided that this copyright statement is
not removed from the file -- -- and that any derivative work
contains this copyright notice. -- --
--
--------------------------------------------------------------------------
library IEEE; use IEEE.std_logic_1164.all;
package std_logic_arith is
type UNSIGNED is array (NATURAL range <>) of STD_LOGIC;
type SIGNED is array (NATURAL range <>) of STD_LOGIC;
subtype SMALL_INT is INTEGER range 0 to 1;
function "+"(L: UNSIGNED; R: UNSIGNED) return UNSIGNED;
function "+"(L: SIGNED; R: SIGNED) return SIGNED;
function "+"(L: UNSIGNED; R: SIGNED) return SIGNED;
function "+"(L: SIGNED; R: UNSIGNED) return SIGNED;
function "+"(L: UNSIGNED; R: INTEGER) return UNSIGNED;
function "+"(L: INTEGER; R: UNSIGNED) return UNSIGNED;
function "+"(L: SIGNED; R: INTEGER) return SIGNED;
function "+"(L: INTEGER; R: SIGNED) return SIGNED;
function "+"(L: UNSIGNED; R: STD_ULOGIC) return UNSIGNED;
function "+"(L: STD_ULOGIC; R: UNSIGNED) return UNSIGNED;
function "+"(L: SIGNED; R: STD_ULOGIC) return SIGNED;
function "+"(L: STD_ULOGIC; R: SIGNED) return SIGNED;
function "+"(L: UNSIGNED; R: UNSIGNED) return STD_LOGIC_VECTOR;
function "+"(L: SIGNED; R: SIGNED) return STD_LOGIC_VECTOR;
function "+"(L: UNSIGNED; R: SIGNED) return STD_LOGIC_VECTOR;
function "+"(L: SIGNED; R: UNSIGNED) return STD_LOGIC_VECTOR;
function "+"(L: UNSIGNED; R: INTEGER) return STD_LOGIC_VECTOR;
function "+"(L: INTEGER; R: UNSIGNED) return STD_LOGIC_VECTOR;
function "+"(L: SIGNED; R: INTEGER) return STD_LOGIC_VECTOR;
function "+"(L: INTEGER; R: SIGNED) return STD_LOGIC_VECTOR;
function "+"(L: UNSIGNED; R: STD_ULOGIC) return STD_LOGIC_VECTOR;
function "+"(L: STD_ULOGIC; R: UNSIGNED) return STD_LOGIC_VECTOR;
function "+"(L: SIGNED; R: STD_ULOGIC) return STD_LOGIC_VECTOR;
function "+"(L: STD_ULOGIC; R: SIGNED) return STD_LOGIC_VECTOR;
function "-"(L: UNSIGNED; R: UNSIGNED) return UNSIGNED;
function "-"(L: SIGNED; R: SIGNED) return SIGNED;
function "-"(L: UNSIGNED; R: SIGNED) return SIGNED;
function "-"(L: SIGNED; R: UNSIGNED) return SIGNED;
function "-"(L: UNSIGNED; R: INTEGER) return UNSIGNED;
function "-"(L: INTEGER; R: UNSIGNED) return UNSIGNED;
function "-"(L: SIGNED; R: INTEGER) return SIGNED;
function "-"(L: INTEGER; R: SIGNED) return SIGNED;
function "-"(L: UNSIGNED; R: STD_ULOGIC) return UNSIGNED;
function "-"(L: STD_ULOGIC; R: UNSIGNED) return UNSIGNED;
function "-"(L: SIGNED; R: STD_ULOGIC) return SIGNED;
function "-"(L: STD_ULOGIC; R: SIGNED) return SIGNED;
function "-"(L: UNSIGNED; R: UNSIGNED) return STD_LOGIC_VECTOR;
function "-"(L: SIGNED; R: SIGNED) return STD_LOGIC_VECTOR;
function "-"(L: UNSIGNED; R: SIGNED) return STD_LOGIC_VECTOR;
function "-"(L: SIGNED; R: UNSIGNED) return STD_LOGIC_VECTOR;
function "-"(L: UNSIGNED; R: INTEGER) return STD_LOGIC_VECTOR;
function "-"(L: INTEGER; R: UNSIGNED) return STD_LOGIC_VECTOR;
function "-"(L: SIGNED; R: INTEGER) return STD_LOGIC_VECTOR;
function "-"(L: INTEGER; R: SIGNED) return STD_LOGIC_VECTOR;
function "-"(L: UNSIGNED; R: STD_ULOGIC) return STD_LOGIC_VECTOR;
function "-"(L: STD_ULOGIC; R: UNSIGNED) return STD_LOGIC_VECTOR;
function "-"(L: SIGNED; R: STD_ULOGIC) return STD_LOGIC_VECTOR;
function "-"(L: STD_ULOGIC; R: SIGNED) return STD_LOGIC_VECTOR;
function "+"(L: UNSIGNED) return UNSIGNED;
ElectroScience — 158 —
Bibliography
function "+"(L: SIGNED) return SIGNED;
function "-"(L: SIGNED) return SIGNED;
function "ABS"(L: SIGNED) return SIGNED;
function "+"(L: UNSIGNED) return STD_LOGIC_VECTOR;
function "+"(L: SIGNED) return STD_LOGIC_VECTOR;
function "-"(L: SIGNED) return STD_LOGIC_VECTOR;
function "ABS"(L: SIGNED) return STD_LOGIC_VECTOR;
function "*"(L: UNSIGNED; R: UNSIGNED) return UNSIGNED;
function "*"(L: SIGNED; R: SIGNED) return SIGNED;
function "*"(L: SIGNED; R: UNSIGNED) return SIGNED;
function "*"(L: UNSIGNED; R: SIGNED) return SIGNED;
function "*"(L: UNSIGNED; R: UNSIGNED) return STD_LOGIC_VECTOR;
function "*"(L: SIGNED; R: SIGNED) return STD_LOGIC_VECTOR;
function "*"(L: SIGNED; R: UNSIGNED) return STD_LOGIC_VECTOR;
function "*"(L: UNSIGNED; R: SIGNED) return STD_LOGIC_VECTOR;
function "<"(L: UNSIGNED; R: UNSIGNED) return BOOLEAN;
function "<"(L: SIGNED; R: SIGNED) return BOOLEAN;
function "<"(L: UNSIGNED; R: SIGNED) return BOOLEAN;
function "<"(L: SIGNED; R: UNSIGNED) return BOOLEAN;
function "<"(L: UNSIGNED; R: INTEGER) return BOOLEAN;
function "<"(L: INTEGER; R: UNSIGNED) return BOOLEAN;
function "<"(L: SIGNED; R: INTEGER) return BOOLEAN;
function "<"(L: INTEGER; R: SIGNED) return BOOLEAN;
function "<="(L: UNSIGNED; R: UNSIGNED) return BOOLEAN;
function "<="(L: SIGNED; R: SIGNED) return BOOLEAN;
function "<="(L: UNSIGNED; R: SIGNED) return BOOLEAN;
function "<="(L: SIGNED; R: UNSIGNED) return BOOLEAN;
function "<="(L: UNSIGNED; R: INTEGER) return BOOLEAN;
function "<="(L: INTEGER; R: UNSIGNED) return BOOLEAN;
function "<="(L: SIGNED; R: INTEGER) return BOOLEAN;
function "<="(L: INTEGER; R: SIGNED) return BOOLEAN;
function ">"(L: UNSIGNED; R: UNSIGNED) return BOOLEAN;
function ">"(L: SIGNED; R: SIGNED) return BOOLEAN;
function ">"(L: UNSIGNED; R: SIGNED) return BOOLEAN;
function ">"(L: SIGNED; R: UNSIGNED) return BOOLEAN;
function ">"(L: UNSIGNED; R: INTEGER) return BOOLEAN;
function ">"(L: INTEGER; R: UNSIGNED) return BOOLEAN;
function ">"(L: SIGNED; R: INTEGER) return BOOLEAN;
function ">"(L: INTEGER; R: SIGNED) return BOOLEAN;
function ">="(L: UNSIGNED; R: UNSIGNED) return BOOLEAN;
function ">="(L: SIGNED; R: SIGNED) return BOOLEAN;
function ">="(L: UNSIGNED; R: SIGNED) return BOOLEAN;
function ">="(L: SIGNED; R: UNSIGNED) return BOOLEAN;
function ">="(L: UNSIGNED; R: INTEGER) return BOOLEAN;
function ">="(L: INTEGER; R: UNSIGNED) return BOOLEAN;
function ">="(L: SIGNED; R: INTEGER) return BOOLEAN;
function ">="(L: INTEGER; R: SIGNED) return BOOLEAN;
function "="(L: UNSIGNED; R: UNSIGNED) return BOOLEAN;
function "="(L: SIGNED; R: SIGNED) return BOOLEAN;
function "="(L: UNSIGNED; R: SIGNED) return BOOLEAN;
function "="(L: SIGNED; R: UNSIGNED) return BOOLEAN;
function "="(L: UNSIGNED; R: INTEGER) return BOOLEAN;
function "="(L: INTEGER; R: UNSIGNED) return BOOLEAN;
function "="(L: SIGNED; R: INTEGER) return BOOLEAN;
function "="(L: INTEGER; R: SIGNED) return BOOLEAN;
function "/="(L: UNSIGNED; R: UNSIGNED) return BOOLEAN;
function "/="(L: SIGNED; R: SIGNED) return BOOLEAN;
function "/="(L: UNSIGNED; R: SIGNED) return BOOLEAN;
function "/="(L: SIGNED; R: UNSIGNED) return BOOLEAN;
function "/="(L: UNSIGNED; R: INTEGER) return BOOLEAN;
function "/="(L: INTEGER; R: UNSIGNED) return BOOLEAN;
function "/="(L: SIGNED; R: INTEGER) return BOOLEAN;
function "/="(L: INTEGER; R: SIGNED) return BOOLEAN;
function SHL(ARG: UNSIGNED; COUNT: UNSIGNED) return UNSIGNED;
— 159 — Lund Institute of Technology
Bibliography
function SHL(ARG: SIGNED; COUNT: UNSIGNED) return SIGNED;
function SHR(ARG: UNSIGNED; COUNT: UNSIGNED) return UNSIGNED;
function SHR(ARG: SIGNED; COUNT: UNSIGNED) return SIGNED;
function CONV_INTEGER(ARG: INTEGER) return INTEGER;
function CONV_INTEGER(ARG: UNSIGNED) return INTEGER;
function CONV_INTEGER(ARG: SIGNED) return INTEGER;
function CONV_INTEGER(ARG: STD_ULOGIC) return SMALL_INT;
function CONV_UNSIGNED(ARG: INTEGER; SIZE: INTEGER) return UNSIGNED;
function CONV_UNSIGNED(ARG: UNSIGNED; SIZE: INTEGER) return UNSIGNED;
function CONV_UNSIGNED(ARG: SIGNED; SIZE: INTEGER) return UNSIGNED;
function CONV_UNSIGNED(ARG: STD_ULOGIC; SIZE: INTEGER) return UNSIGNED;
function CONV_SIGNED(ARG: INTEGER; SIZE: INTEGER) return SIGNED;
function CONV_SIGNED(ARG: UNSIGNED; SIZE: INTEGER) return SIGNED;
function CONV_SIGNED(ARG: SIGNED; SIZE: INTEGER) return SIGNED;
function CONV_SIGNED(ARG: STD_ULOGIC; SIZE: INTEGER) return SIGNED;
function CONV_STD_LOGIC_VECTOR(ARG: INTEGER; SIZE: INTEGER) return STD_LOGIC_VECTOR;
function CONV_STD_LOGIC_VECTOR(ARG: UNSIGNED; SIZE: INTEGER) return STD_LOGIC_VECTOR;
function CONV_STD_LOGIC_VECTOR(ARG: SIGNED; SIZE: INTEGER) return STD_LOGIC_VECTOR;
function CONV_STD_LOGIC_VECTOR(ARG: STD_ULOGIC; SIZE: INTEGER) return STD_LOGIC_VECTOR;
-- zero extend STD_LOGIC_VECTOR (ARG) to SIZE,
-- SIZE < 0 is same as SIZE = 0
-- returns STD_LOGIC_VECTOR(SIZE-1 downto 0)
function EXT(ARG: STD_LOGIC_VECTOR; SIZE: INTEGER) return STD_LOGIC_VECTOR;
-- sign extend STD_LOGIC_VECTOR (ARG) to SIZE,
-- SIZE < 0 is same as SIZE = 0
-- return STD_LOGIC_VECTOR(SIZE-1 downto 0)
function SXT(ARG: STD_LOGIC_VECTOR; SIZE: INTEGER) return STD_LOGIC_VECTOR;
end Std_logic_arith;
ElectroScience — 160 —
Bibliography
std logic unsigned
--------------------------------------------------------------------------
--
-- -- Copyright (c) 1990, 1991, 1992 by Synopsys, Inc.
-- -- All rights
reserved. -- --
-- -- This source file may be used and distributed without
restriction -- -- provided that this copyright statement is
not removed from the file -- -- and that any derivative work
contains this copyright notice. -- --
--
--------------------------------------------------------------------------
library IEEE; use IEEE.std_logic_1164.all; use
IEEE.std_logic_arith.all;
package STD_LOGIC_UNSIGNED is
function "+"(L: STD_LOGIC_VECTOR; R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function "+"(L: STD_LOGIC_VECTOR; R: INTEGER) return STD_LOGIC_VECTOR;
function "+"(L: INTEGER; R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function "+"(L: STD_LOGIC_VECTOR; R: STD_LOGIC) return STD_LOGIC_VECTOR;
function "+"(L: STD_LOGIC; R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function "-"(L: STD_LOGIC_VECTOR; R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function "-"(L: STD_LOGIC_VECTOR; R: INTEGER) return STD_LOGIC_VECTOR;
function "-"(L: INTEGER; R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function "-"(L: STD_LOGIC_VECTOR; R: STD_LOGIC) return STD_LOGIC_VECTOR;
function "-"(L: STD_LOGIC; R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function "+"(L: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function "*"(L: STD_LOGIC_VECTOR; R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function "<"(L: STD_LOGIC_VECTOR; R: STD_LOGIC_VECTOR) return BOOLEAN;
function "<"(L: STD_LOGIC_VECTOR; R: INTEGER) return BOOLEAN;
function "<"(L: INTEGER; R: STD_LOGIC_VECTOR) return BOOLEAN;
function "<="(L: STD_LOGIC_VECTOR; R: STD_LOGIC_VECTOR) return BOOLEAN;
function "<="(L: STD_LOGIC_VECTOR; R: INTEGER) return BOOLEAN;
function "<="(L: INTEGER; R: STD_LOGIC_VECTOR) return BOOLEAN;
function ">"(L: STD_LOGIC_VECTOR; R: STD_LOGIC_VECTOR) return BOOLEAN;
function ">"(L: STD_LOGIC_VECTOR; R: INTEGER) return BOOLEAN;
function ">"(L: INTEGER; R: STD_LOGIC_VECTOR) return BOOLEAN;
function ">="(L: STD_LOGIC_VECTOR; R: STD_LOGIC_VECTOR) return BOOLEAN;
function ">="(L: STD_LOGIC_VECTOR; R: INTEGER) return BOOLEAN;
function ">="(L: INTEGER; R: STD_LOGIC_VECTOR) return BOOLEAN;
function "="(L: STD_LOGIC_VECTOR; R: STD_LOGIC_VECTOR) return BOOLEAN;
function "="(L: STD_LOGIC_VECTOR; R: INTEGER) return BOOLEAN;
function "="(L: INTEGER; R: STD_LOGIC_VECTOR) return BOOLEAN;
function "/="(L: STD_LOGIC_VECTOR; R: STD_LOGIC_VECTOR) return BOOLEAN;
function "/="(L: STD_LOGIC_VECTOR; R: INTEGER) return BOOLEAN;
function "/="(L: INTEGER; R: STD_LOGIC_VECTOR) return BOOLEAN;
function SHL(ARG:STD_LOGIC_VECTOR;COUNT: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function SHR(ARG:STD_LOGIC_VECTOR;COUNT: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function CONV_INTEGER(ARG: STD_LOGIC_VECTOR) return INTEGER;
-- remove this since it is already in std_logic_arith
-- function CONV_STD_LOGIC_VECTOR(ARG: INTEGER; SIZE: INTEGER) return STD_LOGIC_VECTOR;
end STD_LOGIC_UNSIGNED;
— 161 — Lund Institute of Technology
Bibliography
std logic signed
--------------------------------------------------------------------------
--
-- -- Copyright (c) 1990, 1991, 1992 by Synopsys, Inc.
-- -- All rights
reserved. -- --
-- -- This source file may be used and distributed without
restriction -- -- provided that this copyright statement is
not removed from the file -- -- and that any derivative work
contains this copyright notice. -- --
--
--------------------------------------------------------------------------
library IEEE; use IEEE.std_logic_1164.all; use
IEEE.std_logic_arith.all;
package STD_LOGIC_SIGNED is
function "+"(L: STD_LOGIC_VECTOR; R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function "+"(L: STD_LOGIC_VECTOR; R: INTEGER) return STD_LOGIC_VECTOR;
function "+"(L: INTEGER; R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function "+"(L: STD_LOGIC_VECTOR; R: STD_LOGIC) return STD_LOGIC_VECTOR;
function "+"(L: STD_LOGIC; R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function "-"(L: STD_LOGIC_VECTOR; R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function "-"(L: STD_LOGIC_VECTOR; R: INTEGER) return STD_LOGIC_VECTOR;
function "-"(L: INTEGER; R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function "-"(L: STD_LOGIC_VECTOR; R: STD_LOGIC) return STD_LOGIC_VECTOR;
function "-"(L: STD_LOGIC; R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function "+"(L: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function "-"(L: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function "ABS"(L: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function "*"(L: STD_LOGIC_VECTOR; R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function "<"(L: STD_LOGIC_VECTOR; R: STD_LOGIC_VECTOR) return BOOLEAN;
function "<"(L: STD_LOGIC_VECTOR; R: INTEGER) return BOOLEAN;
function "<"(L: INTEGER; R: STD_LOGIC_VECTOR) return BOOLEAN;
function "<="(L: STD_LOGIC_VECTOR; R: STD_LOGIC_VECTOR) return BOOLEAN;
function "<="(L: STD_LOGIC_VECTOR; R: INTEGER) return BOOLEAN;
function "<="(L: INTEGER; R: STD_LOGIC_VECTOR) return BOOLEAN;
function ">"(L: STD_LOGIC_VECTOR; R: STD_LOGIC_VECTOR) return BOOLEAN;
function ">"(L: STD_LOGIC_VECTOR; R: INTEGER) return BOOLEAN;
function ">"(L: INTEGER; R: STD_LOGIC_VECTOR) return BOOLEAN;
function ">="(L: STD_LOGIC_VECTOR; R: STD_LOGIC_VECTOR) return BOOLEAN;
function ">="(L: STD_LOGIC_VECTOR; R: INTEGER) return BOOLEAN;
function ">="(L: INTEGER; R: STD_LOGIC_VECTOR) return BOOLEAN;
function "="(L: STD_LOGIC_VECTOR; R: STD_LOGIC_VECTOR) return BOOLEAN;
function "="(L: STD_LOGIC_VECTOR; R: INTEGER) return BOOLEAN;
function "="(L: INTEGER; R: STD_LOGIC_VECTOR) return BOOLEAN;
function "/="(L: STD_LOGIC_VECTOR; R: STD_LOGIC_VECTOR) return BOOLEAN;
function "/="(L: STD_LOGIC_VECTOR; R: INTEGER) return BOOLEAN;
function "/="(L: INTEGER; R: STD_LOGIC_VECTOR) return BOOLEAN;
function SHL(ARG:STD_LOGIC_VECTOR;COUNT: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function SHR(ARG:STD_LOGIC_VECTOR;COUNT: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR;
function CONV_INTEGER(ARG: STD_LOGIC_VECTOR) return INTEGER;
-- remove this since it is already in std_logic_arith --
function CONV_STD_LOGIC_VECTOR(ARG: INTEGER; SIZE: INTEGER) return
STD_LOGIC_VECTOR;
end STD_LOGIC_SIGNED;
ElectroScience — 162 —

ElectroScience

—2—

Contents
Contents 1 Introduction to VHDL 1.1 Background . . . . . . . . . . . . . . . . . 1.2 Event-driven simulation . . . . . . . . . . 1.3 Design units . . . . . . . . . . . . . . . . . 1.3.1 Entity . . . . . . . . . . . . . . . . 1.3.2 Architecture . . . . . . . . . . . . 1.3.3 Configuration . . . . . . . . . . . . 1.3.4 Component . . . . . . . . . . . . . 1.3.5 Package . . . . . . . . . . . . . . . 1.4 Data types and modelling . . . . . . . . . 1.4.1 Type declaration . . . . . . . . . . 1.4.2 Signals . . . . . . . . . . . . . . . . 1.4.3 Generics . . . . . . . . . . . . . . . 1.4.4 Constants . . . . . . . . . . . . . . 1.4.5 Variables . . . . . . . . . . . . . . 1.4.6 Common operations . . . . . . . . 1.5 Process statement . . . . . . . . . . . . . 1.5.1 Sequential vs. combinatorial logic 1.6 Instantiation and port maps . . . . . . . . 1.7 Generate statements . . . . . . . . . . . . 1.8 Simulation . . . . . . . . . . . . . . . . . . 1.8.1 Testbenches . . . . . . . . . . . . . 1.8.2 Modelsim . . . . . . . . . . . . . . 1.8.3 Non-synthesizable code . . . . . . 1.9 Design example . . . . . . . . . . . . . . . 2 Design Methodology 2.1 Structured VHDL programming . 2.1.1 Source code disposition . 2.1.2 Records . . . . . . . . . . 2.1.3 Clock and reset signal . . 2.1.4 Hierarchy . . . . . . . . . 2.1.5 Local variables . . . . . . 2.1.6 Subprograms . . . . . . . 2.1.7 Summary . . . . . . . . . 2.2 Example . . . . . . . . . . . . . . 3 11 11 12 12 13 13 13 14 14 16 16 16 17 17 18 18 19 19 21 21 23 23 24 25 25 31 32 32 33 33 34 35 35 36 37

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

—3—

Lund Institute of Technology

. . 69 4. . . . . . . . 78 5. . . . . . . .4. . . . . .3. . . 3. . . . .2 Multiplication of signed numbers . . . . . . . . 3. . . . . . . . 3. . . . . . . . . 3. . . . . . . . . . . . . .4. .2 Manual Selection of the Implementation in dc 3.5 Comparison operation . .4. . . . . . . . . . . . . . . .4 Addition and Subtraction . . 84 5. . . . . . . . . . . . . 80 5. .4.3. . .1 Wireload model . . . . . . . . . . . .2 Increasing wordlength . 3. . . . . . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . . . .1 Introduction . . .4.3. . . . . . . . . . . . . . . . .3 Counters . .3 FPGA modules . . . . . . . . . . . . . . . . . . . . . . 76 5. . . . . . . 3. . . . . . . . . . . . shell . . . . . . . . . . . . 89 ElectroScience —4— . . . . . . 86 5. 2. . .3 Baseline synthesis flow . . . . . .3. .4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5. . . 3. . . . . . . . . . . . .1 Basic concepts . . . . . . . . . . . . . . . .7 Optimization flow . . . . .3 VHDL examples . . . . 3.6. . . .5 ALU unit . . . . . .3. . . . . . . . .4. . . . . . . .2. . .5 Design Optimization and constraints . . . . . . 37 39 39 40 41 42 43 43 43 44 45 45 48 48 51 53 53 54 55 56 57 58 59 60 61 62 63 3 Arithmetic 3. . . . . . . 3. . . . . . . . . .5. . . . . .1 Basic VHDL example . . . . . . . . . . . . . . . . . . . . .3. . . . . . . .3. . . . . . . 3. . . . . . . . . . . . .4. . . . . . . . .3. 3. 67 4. . . . . . .8 Behavioral Optimization of Arithmetic and Behavioral Re-timing .4. . . . . . . . 78 5. . . 2. . . . . . . . . . . . . . . . . . . .2 ASIC Pads .Contents 2. . . . . . . . . . . . . . . .3 Cache example . . . . . . . . . . .2 Constraining the design for better performance . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Technology independence 2. . . . . . . . 88 5. . . . . . . . . . . . . . . . . . . . . . . .3 Compile mode concepts .2 VHDL Packages . .2. . . . . . . . . .6 Resolving multiple instances .1 Sample synthesis script for a simple design . . .2 Operations . . . . .1 Multiplication by constants . .4 FPGA Pads . . . . . . . . . . . 82 5. . . 3. . . . .1 SR example . . . . . . . . . . . . . . . . . . .4. . . .4 Multioperand addition . . . 75 5. . . . . . .5 Implementations . . . . . . . . . .4. . . . . . . . . . . . . . . . . . . . . . . . 3. . . . .1 ASIC Memories . 3. . . . . .7 Datapath Manipulation . . . . . . . . 4 Memories 67 4. . . . . . . . . . 2. . . .2 Memory split example .6. . . . . . . . . . .2 Setting up libraries for synthesis . . . 3. . . . . . . . . .4. . . . . . . . . .4 Compile strategy . . . . . . . . . . . . . . . . 3. .3 DesignWare . . . . . . . . . .4 Advanced topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . . . . . . . 2. . . .1 Data types . . . . . . . . . . . . 83 5. . . . . . . . . . . . . . . . . . . . . . 85 5.6 Multiplication . . . 79 5. . . . . . . . . . . . . . . . . . . . . . . . . . .1 Basic VHDL example . . . . . . . . . . . . . . . . . . . . .5 Coding style for synthesis . . . . . . .2. . . . . . . . . . . . . . . . . . . . . . . 70 5 Synthesis 75 5. .1 Arithmetic Operations using DesignWare . . . . . . . . . . . . . . . . . .4. . . . .

. . . . . . . .7 Power Routing .3 Import to SE . . . . . . . . . . . . . . . . . . . . . . . .2 Patches .2. . . . . . . . . . . . . .4 Creating a mac file . . . . . . . . . . . . . . . . . . . . . . . . . . . 7. .1 Model Checking . . . . . . 6. . . . . . . . .2 Areas of Use for Formal Verification Tools 8. . . . . . . . .7 The ALU example . . .5. . 8. . 7. . . . . . . . . . .2. . . . . . . . . . 7. . . . . . . . . . . . . . .2 Optimization . . . .5. . . . . . . . . . . . . . . . . . . . . . 8. . 7. . .1 Sources of power consumption .2. 8 Formal Verification 8. . .1 Concepts within Formal Verification . . . . . . . . . . . . Coding style for loop statement . .2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Operand isolation . . . . . . . . . . . . . . . .5 Scripts . . . . . . . 6. . . . . . 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Architecural issues—Pipelining and parallelization 7. . . . . . . . . . . . . . . . 6. . . . . . . . . . . 6. . . . . . . . . . . . . . . . . . . .3 Power estimation with Synopsys Power Compiler . . . 8. . 7. . . . . . . . . . . . . . . . . . . . . . . . . 7. . . . . . . . . . . . . . . . . 7. . . .1. . . .3 Formal Verification . .8 Connecting Rings . . . . 7. . 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 Design flow . . . . .2 Equivalence Checking . . . . . . . . . . . . . 7. . . . . . . . . . . . .6. . . 7 Optimization strategies for power 7. . . . . . .2. . . . . . . . . . . . . . . 7. 7. . . . . . . . .5 Floorplanning .1. . . . . .2 nc-verilog . . . . . . . . . . .1. . . . . . . . . . . . . . . . . . . . 8. . .5. . .1 Formal Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7. . .5. . . . .6 Commands in DC . . . . . . . . . . . . . . 6. . . . . . . . . . . .1. . . .6 Block and Cell Placement . . . .1 5. . . . . 89 90 93 93 93 96 97 99 101 103 106 107 109 110 111 112 113 113 113 114 114 115 118 119 120 121 122 123 125 125 126 128 128 130 131 131 133 134 135 135 136 136 137 137 137 137 137 6 Place & Route 6. . . . . 7.2 Clock gating . . . . . . . . . .10 Filler Cells . . . . . . . . . . . . . . .3 Calculating the power consumption . . .12 Verification and Tapeout . . . . . . . —5— . .1 Some useful DC commands . . . . . . . . . . . . . . . . . . . . . .2. .4 Leakage power optimization . . 6. . . . . . . . . . . . . . . . . . .11 Clock and Signal Routing . . . . . . . . . . . . . . . . . . 7. . . . . . .2 Starting Silicon Ensemble . . . . . .5 Beyond clock gating . . . . . . . . . . . . . . . . . . . . . . . .13 Acknowledgements . .3. . . .1.2 Static power consumption . . . . . . . . . . . . . . . . Lund Institute of Technology . . . 6.3. . . . . . . . .1. . . . . . . . . .2 Information provided by the cell library . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6. . . . . .Contents 5. . . . . . . .2. .1 Switching Activity Interchange Format (SAIF) . . . .1 Introduction and Disclaimer 6. . .1 Dynamic power consumption . . . .2 The System Model . . . . 7. . . .2 Coding style if statement .4 Power-aware design flow at the gate level . . . . . . . . . . . . . . 6. . . . . . . . . . . . . . . . . .4 Formal System Construction . 8. . . . . . . . . . . . . . . . . .9 Clock Tree Generation . 6. . . . . . . . . . . . . . . . . . . . . . . 7. . . . . . . .3. . . . . . . . . . . 8. . . . . . . . . .

. .Contents 8. . .3 FPGA-Prototypes . 8. . . . . . .1 Running the simulation in a cluster of several 8. . . . . . . . . . References . . . . . . computers . . . . . . . . . . 138 138 138 138 138 138 139 141 143 143 145 146 148 9 Multiple Supply Voltages 9. . . . . . . . . .1. .3 8. . . . . . . .1. . . .4 8. . . . . . . . . . Other Methods to Decrease Simulation Time . . . . . . . . . . . . . . . . .5 8. 9. Future . . . . . . . . . . . . .2 ASIC-emulators . . .3. . . . . . . . . . . . . . . 8. . . . . . . . . . . . . . . . . . . . . . . 9. . . . . .1 Editing Power Connections in the Netlist . . . . . . . .3. . . . . . . . . . . . .1 Demonstration of Dual Supply Voltage in Silicon Ensemble 9. . ElectroScience —6— . . . . .3 Power and Signal Routing . . . .3. . . . . . . . . 9. .2 Floorplan and Placement . . . . . . . . . . . . . .4 Filler Cells .2. . . . . . . . . . . .1.1. . .3 Limitations in Formal Verification Tools .

we will try to describe the design flow from developing code to chip layout. see Figure 1. It could be on signal processing. system level design. The manual is divided into the following main sections: Specification Floating/Fixed Point Modeling C/C++ MATLAB Test Vectors Behavioral Function VHDL/Verilog Mentor Graphics: ModelSim RTL Coding RTL Function Synopsys: Design Compiler Standard Library Synthesis Constraints Prelayout Function Timing Timing Information Cadence: Silicon Ensemble Standard Library Layout Postlayout Function Timing Timing Information Tape out Figure 1: Common digital ASIC design flow. VHDL and other programming languages or arithmetic.Preface When buying a book on hardware design. In this manual. the focus is often limited to one area. —7— Lund Institute of Technology .

This chapter also describes how to develop platform independent code. A design-rule check (DRC) verifies that the fabrication constraints are met and finalizing your design for fabrication. Place & Route Place and Route is the final step before sending the design for fabrication. are added. The design methodology described here is developed by Jiri Gaisler at Chalmers. small area or a combination of these. more an overview and a selection of the most important parts of the language. In order to better understand the effect of setting constraints. and covers the basic synthesis concepts and guidelines for design optimization. This is however not a complete reference to the language itself. low power. Arithmetic The chapter on arithmetic describes how to infer a component for an arithmetic operation in a digital design. supporting several technologies such as programmable logic and various cell libraries. followed by environment setups for synthesis. Each memory hierarchy should be optimized for high speed. VHDL coding style is mentioned as a guideline for better synthesis results. the synthesis process is described. depending on the application. The standard cells determined during synthesis are placed automatically on the cell core. However. At last. ElectroScience —8— . The most commonly used arithmetic operations are described and the connection between the VHDL description and the final hardware structure realized during synthesis is illustrated. Chapter 2 presents a strucured way of writing and using VHDL. It starts with the basic concept of what synthesis is. Instead three different designs are presented in this Chapter in order to show possible optimizations to various memory structures. including caches and off-chip memories. Memories One of the most important topics in digital ASIC design today is memories. Location of the pads and placement of the on-chip memory has to be defined and carried out by the user. Synthesis This chapter gives a overview of the whole synthesis process. has to be considered.Contents VHDL Chapter 1 presents an introduction to VHDL and the basic concepts of describing and simulating hardware. The core supply. power and ground rings. and wireload model basics. Not only the amount of memory but also the memory hierarchy. This chapter gives the reader an overview on how the determine a highly effective hardware structure by coding and controlling the synthesis tool. this is a too large topic to cover in this compendium.

It is also shown how the design tool interacts with information from the cell library and external simulation tools to estimate and optimize power at gate level. After a short review of the sources of power consumption in a digital circuit. —9— Lund Institute of Technology . tool-independent optimization techniques are presented for different abstraction levels. For those who like to probe further.Contents Power Estimation This chapter gives an introduction to power estimation and optimization techniques in an ASIC design flow with Synopsys Power Compiler. additional information is found in the reference list at the end of the chapter.

ElectroScience — 10 — .

usually with a new version number and a higher price tag. VHDL has been standardized by IEEE and the most widely used version is called VHDL’87 (1079-1987). i. the book written by J. When problems arise. The main focus is functionality. since it is only a small part of the program language that can be synthesized to hardware. The focus of this chapter is not to present VHDL in detail. J. or lowest resource requirements.Chapter 1 Introduction to VHDL This chapter is a brief introduction to hardware design using a Hardware Description Language (HDL). memory blocks and input/output connections will affect the size of the design and therefore also the manufacturing cost.e. A programming language often specifies more functionality than actually needed. lots of money. Each mm2 of a chip costs money. In 1994. The degrees of freedom compared with software design have dramatically increased and must be taken into account. In the next chapter. The demands on hardware design are high compared to software. Bhasker [1] is recommended as first choice. an updated — 11 — Lund Institute of Technology . Pascal. or at least very tricky. or other software languages. the basics parts of VHDL programming that can be used in hardware design will be presented. In this chapter. A language describing hardware is quite different from C. but it is still not uncommon that software programs can behave quite unexpected. the functionality must be correct and in addition how the code is written will affect the size and speed of the resulting hardware. to patch hardware after fabrication. Ashenden [2] and is used at many Universities. Often it is not possible. Clearly. sharing the same resources. A software designer using a HDL has to be careful. This is especially true for VHDL.1 Background VHDL is a acronym which stands for Very High Speed Integrated Circuit Hardware Description Language. There are numerous available books that addresses VHDL coding and to probe further. optimal memory management.. new versions of the programs are distributed by the vendor. Another book worth mentioning is written by P. A computer program is dynamic. 1. a design methodology is applied and a structured way of writing VHDL will be presented. allocating resources when needed and not always optimized for maximum speed. The amount of logic cells.

1: Event-driven simulation of a small design. 0 0 1 1 0 1 1 1 0 1 0 1 1 1 0 0 Stable t = T − 1 New inputs t = T t=T +∆ t = T + 2∆ Figure 1. The entity. Therefore. The architecture. packages to store common or global information. it is easy to debug and understand the behaviour of the program. However.e. Introduction to VHDL version called VHDL’93 introduced new functionality and a more symmetric syntax.. The component that is used to include a building block as part of a new building block.2 Event-driven simulation A computer program running on a microprocessor executes one instruction at a time. 1. i. the master clock. The smallest time quantum is called delta (∆). Finally. A compiler translates the source code into machine language and executes the instructions sequentially.. ElectroScience — 12 — .g. This simulator understands the concept of time and delay. 1. which is the time it will take to assign a value to a signal. Hardware is by nature concurrent.e. describing the structure or behaviour of a block. and complex numbers.1 shows an example of event driven simulation.e. and other features not directly associated with hardware design. while testbenches and other nonsynthesizable parts can be written using VHDL’87/93. updating the system registers on a rising or falling edge. but is not even close to being synthesizable. When an external event triggers the system. Figure 1. describing the interface between building blocks. Instead of compiling the VHDL code into a sequential program. the timescale is incremented with ∆ until all signals are stable. using the VHDL’87 syntax guarantees compatibility with both old and new synthesis and simulation tools. floating point. Although being a hardware description language. It is shown that it takes two (∆) to propagate signals from input to output in this design.3 Design units This chapter will present the five design units in VHDL. The external event could be. it is executed using an event-driven simulator. VHDL still supports file access.Chapter 1.. until there are no more events in the system to process. i. statements are executed in parallel depending on the chosen architecture. This comes in handy when writing reference designs or testbenches. A recommendation is to use VHDL’87 syntax for synthesizable code. i. The configuration associating an architecture with an entity. transferred into physical gates and how they are connected which is equivalent to a gate-level netlist or simply called a netlist. e.

: out std_logic Figure 1. : out std_logic. entity TTL is port( A1.3. the circuit in Figure 1. as shown in Section 1. Y3 <= not (A3 and B3). To use two architecture to describe different behaviourals of the same entity. 1. They come in small packages with a number of pins connected to the outside world.3. end.3 Configuration A configuration specifies which architecture to use for a certain instantiation of the entity. after synthesis. it is useful to write a configuration for post-synthesis simulations.. binary counters and flip-flops. see Chapter 6 for more details. This way. Power and ground signals will not be part of the design flow until the design is placed and routed. since it is not synthesized. and finally after place and route. since it is not always supported in the synthesis tool.B4 Y1 .3. Y4 <= A4 or B4. selecting and simulating the gate-level netlist generated by the synthesis tool. input or output. and in this case no configuration is needed. The entity describes the interface..2: The entity describes the interface of the block. For example. : in std_logic. end.B1 .2. and the output signals. to use it in the testbench is okay. Y3 <= A3 or B3. Boolean logic. : in std_logic.2 Architecture Every entity has one or several architectures describing the behaviour of the block. Normally. architecture OR of TTL is begin Y1 <= A1 or B1. A4. there is only one architecture for each entity.3.2 the interface of a TTL circuit is described in VHDL. Design units 1. — 13 — Lund Institute of Technology . However. Y2 <= A2 or B2. end. the input signals.. However. is not commonly used. it is possible to create several configurations to simulate before synthesis.1 Entity The concept of VHDL coding can be compared to building a design using discrete logical circuits and wires. However. This can be compared to an entity in VHDL. without any information about the functionality. inverters. Y2 <= not (A2 and B2). i.e.1.. architecture NAND of TTL is begin Y1 <= not (A1 and B1).3. In Figure 1. 1. using the same entity but replacing the architecture with a description of 4 NOR gates changes the behaviour but not the interface to the block. Y4 <= not (A4 and B4). The once so popular 74xx circuits are series of TTL logic blocks with different logical functions. All signals except for power and ground are declared in the entity together with signal direction.2 contains 4 NAND gates. Y4 )..

: in std_logic. end for. : in std_logic. If no such architecture exists.Chapter 1.3. The default architecture is selected by the simulation or synthesis tool. end for.. ElectroScience — 14 — . For example.B1 . configuration cfg_post_syn_adder of tb_adder is for behavioural for iadd: adder use entity WORK. the code could look like this: entity larger_design is port( A : in std_logic.. The component declaration is either included in a package or placed in the architecture part of the code before begin. : out std_logic.ADDER(STRUCTURAL).3.5 Package A package can be used to store global information such as constants. A4. A package in VHDL is like the header files used in C.. package MY_STUFF is constant ZERO : std_logic := ’0’. end. types. to include the previously mention TTL design in a new larger design. Y4 ). constant ONE : std_logic := ’1’. end for. : out std_logic begin -. the entity will be linked to an empty black box. Y : out std_logic).ADDER(SYN_STRUCTURAL). ----- configuration before synthesis testbench architecture name DUT instantiation name point to behavioural adder ----- configuration after synthesis testbench architecture name DUT instantiation name point to synthesized adder If no configuration exists. function invert(v : std_logic_vector) return std_logic_vector. The package is separated in a declaration and a body. Introduction to VHDL configuration cfg_pre_syn_adder of tb_adder is for behavioural for iadd: adder use entity WORK. The body contains implementations of the functions in the declaration part. end comonent. architecture my_first_arch of larger_design is component TTL port( A1. end for. component declarations or subroutines/functions.use TTL here end.4 Component Component declarations is used to include a pre-made design as part of a new design. a default architecture will be selected. end.B4 Y1 . taking the most recent compiled architecture with the exact same name and port description as the entity. To be able to use the pre-made design in the new design it has to be declared in the code as a component. This could be used for inserting components in the empty space after synthesis. 1. 1.. end.

— 15 — Lund Institute of Technology . package body MY_STUFF function invert(v : variable result : begin result := not v.from the function invert The package can be included at the beginning of a VHDL file using library MY_LIB use MY_LIB. is std_logic_vector) return std_logic_vector is std_logic_vector((v’length)-1 downto 0). return result.all.1. end. -.3.Input v is returned bit wise inverted -.my_stuff. Design units end. end.

or in a package if the types are going to be used between entities. type reg_type is record state : state_type. the next step is to connect the building blocks. and an output signal can only be assigned a value. of . START.4. which makes it possible to pass constant values to a module and to dynamically configure the wordlength of input and output signals in the port map. the direction is not specified. Inputs and outputs described in the entity are also signals but with a predefined direction. only the most useful types will be described. is type state_type is (IDLE.0.. while std logic is defined in a package standardized by IEEE. To simplify this section.std_logic_1164. Signals are used as interconnects between ports and are assigned a value using the <= operator.state := IDLE Representation 32-bit number max U. routing the traces between the individual chips. Subtypes are limited sets of a type. Finally. There are two reasons to use subtypes.U. variable. Types and subtypes can be declared in the architecture.all. Records are structures containing several values of any type. Signals are like wires on a PCB (Printed Circuit Board).Z. For signals declared in the architecture body.2 Signals Signals can be viewed as physical connections between blocks and inside modules. end record.4.. use IEEE.H.. begin -. subtype byte is integer range 0 to 255.W. Integers and enumerated data types are built-in (package standard).any legal name 1. ElectroScience — 16 — . constants and variables will be presented.Z..1. STOP).std_logic package ------ enumeration 8 bit integer 8 bit vector define record using types above Type integer std logic std logic vector enumeration record Description numbers a single bit bit vector set of names set of types Assigning i := 32 v := ’0’ v := ”00” s := IDLE r.1 Type declaration A constant.0. One. signal. or generic can be of any type. specified in the entity. The entity describes the interface of each block and these interfaces are connected using signals.L. to give informative names to the subtypes and secondly.1. Introduction to VHDL 1.g. subtype data_type is std_logic_vector(7 downto 0).. This chapter also presents generics. and will be further discussed in the next chapter about design methodology.4 Data types and modelling After describing the design units.X.Chapter 1. data : data_type. Signals are connected to the building blocks with port maps that specifies which input/output is connected to which signal.H.X. the range 0 to 7 is a subtype of the type integers. The value of an input signal can only be read. the simulator will raise a warning if values exceeds the subtype range. architecture . e.L.W. 1. library IEEE.

begin signal c. creating dynamic designs.WIDTH is given the value 9 and the signals c. The generic value of a specific instantiation is given a value with the generic map command that is similar to the port map command..e. it will improve the readability and simplify the maintenance of the code.1. For example. end.. constant VDD : std_logic := ’1’.4 Constants It is often useful to define constants instead of hard-coding values in the VHDL code.3 Generics One important feature in VHDL is the possibility of building generic components. Data types and modelling entity INVERTER is port( x : in std_logic.4. If the constant names are chosen with care. If the adder component is reused in several places in the design. when designing a component for adding two values.boolean function 1.4. Therefore. ). end component. MEM ADDR BITS to describe the width of an address bus to a memory. constant MAGIC : std_logic_vector(3 downto 0) := "1010". port( a : in std_logic_vector(WIDTH-1 downto 0). . begin internal <= ’0’ when x = ’1’ else ’1’. ----in package logical zero logical one constant vector — 17 — Lund Institute of Technology . end. . d. end entity.g. architecture struct of ALU is component adder generic( WIDTH : integer ). z : out std_logic_vector(WIDTH downto 0) ).the in and out ports of the adder instantiation adder_1. it would be inconvenient if the number of bits of the input operands were static values.using WHEN statement -. package my_constants is constant GND : std_logic := ’0’.d : std_logic_vector(8 downto 0). The value of a generic parameter can be used in the port specification of the entity as well as in the architecture.. no direction is specified -. a static approach would force the designer to create several versions of the same component with various wordlengths. the entity can be described using generic values in addition to the interface. i. z => q).. y : out std_logic.4. architecture behavioural of INVERTER is signal internal : std_logic. and q are connected to -. Constants can be declared in a package or in the architecture. ----component declaration generic parameter signals controlled by generic parameter 1. -. adder_1 : adder generic map(WIDTH => 9) port map(a => c. while constants defined in a package can be used as global values. signal q : std_logic_vector(9 downto 0). b => d. -.. b : in std_logic_vector(WIDTH-1 downto 0). e.internal signal. Constants declared in the architecture are only local. y <= not x..

-. For more details of processes see Section 1. Introduction to VHDL architecture arch_counter of counter is constant MAXVAL : std_logic_vector(3 downto 0) := "1111".. but the hardware will be exactly the same in both cases.in architecture -. The input to comparison operations do not have to be of the same length. From a simulation point of view.assume that z=5 and a=3 then a<=z+1. which all change values at the same time.b=4 after 1 delta and b=7 after 2 delta -. ElectroScience — 18 — .g. Assigning a value to a variable has a different syntax than for a signal. begin -. These changes will trigger the process one more time. b.q=4 after 1 delta and 7 after 2 delta end process. b. The advantage of variables is that they are sequentially executed since a value is assigned immediately. -. -..z) -. begin . variables are only used inside a process.changes to a and b will trigger process again q<=b.a=6 after 1 delta b<=a+1. All logical operations. For wordlengths concerning arithmetic operations see Chapter 3. will assign correct values to a.4.. The first time this process is executed due to a change in z.a=6 b:=a+1. Below is an example of the differences using variables and signals. process(a. are performed bit wise and both input most be of the same length. the old value of a plus one. process(z) -. and and or. it is also faster to work with variables than signals.signal q=7 after one delta end process. -. It is important to note that the hardware will be the same whether variables or signals are used! The hardware will not be sequential.sensitive to a. we will show how variables can be used to avoid signal assignment and to change the concurrent behaviour into sequential execution.6 Common operations Table 1. Normally. using the := operator.sensitive to changes on z variable a.b=7 q<=b.4. and q the first and only time the process is triggered by a change in z. When discussing design methodologies.5 Variables Variables can be used to temporarily store values in a process. e. but the smallest input will be extended to the same length as the widest input in the hardware. -. 1. since a and b are in the sensitivity list of the process. The process to the right uses only signals. -. end. The second time correct values will be assigned to b and q.all one vector 1.Chapter 1.b. z begin -. signals a will become z + 1 and b will become 3 + 1. -.b integer. it is only the code that is executed sequentially. Variables are used to increase the readability of the code and to reduce simulation time.assume that z=5 then a:=z+1.1 shows a list of common operators and which data types that can be used with them. but in later versions of VHDL it is also possible to declare shared variables directly in the architecture.5. The process to the left that uses variables. as in software programs.

b. end. the simulation tool will lock up. counters.1. Otherwise. the combinational blocks are separated by sequential logic. and other sequential units. However. A simple MUX is presented below.. SEL 1. sensing changes in the input signals and changing the output signals accordingly. std_logic.. see Figure 1. else q <= b. The accumulator in Figure 1. flip-flops and latches. The command a <= (others => 0 ) is used to assign zero to all bits in a. using one process to describe the combinatorial logic and one process for the sequential unit.. signals that are being read in the process. adders or multiplexers. in in in out std_logic.5.5. -.4 uses a sequential unit to store the sum of all previous input values. independently of the wordlength of a. i. one containing sequential logic and one containing combinatorial logic. continuing execution when the condition is met. A process without a sensitivity list is executed continuously and has to contain some kind of wait statement. The process is sensitive to changes to the input operands a and b and the select signal sel. Therefore. the simulation result will not be correct. end entity. A process can be viewed as a self-contained module. The code example below shows the implementation of the accumulator. The reset signal forces the sequential unit back to zero.g. A hardware architecture can contain an arbitrary amount of processes. std_logic. to the sensitivity list.sensitive to changes on A. Normally. B.5 Process statement Processes are used to increase code readability. If a signal in the sensitivity list changes. combinatorial logic All the examples so far have demonstrated combinatorial circuits. end process.3. std_logic architecture behavioural of mux is begin process(a. for example Boolean functions. e. flipflops and latches. updated only on the rising or falling edge of the clock signal. It uses a feedback signal to add the current input to the accumulated sum. By combining sequential and combinatorial logic. to increase code readability it is recommended that a maximum of two processes are used. A process has a sensitivity list which reacts to changes in the input signals. reduce simulation time. e. entity mux is port( a : b : sel : q : ).e. and for the synthesis tool to recognize some common hardware blocks. all statements in the process will be executed once. entity ACCUMULATOR is — 19 — Lund Institute of Technology . it is possible to describe state machines.sel) begin if (sel = ’1’) q <= a. A wait statement could for example be wait until sel=’1’. Otherwise.g. it is very important to add all dependent signals. Process statement 1.1 Sequential vs. end if.

4: Simple accumulator with reset. else sum <= feedback + d. architecture MY_GOD_I_HOPE_THIS_WORKS of ACCUMULATOR is signal sum. end if. end process. feedback) begin if (rst = ’0’) then sum <= (others => ’0’). d.assign ’0’ to all vector elements -. d 0 1 0 q clk rst Figure 1. begin comb: process(rst. q <= feedback. Introduction to VHDL clk clk clk Figure 1.update on the rising clock edge ElectroScience — 20 — . std_logic. end process. generic( BITS : port( clk : in rst : in d : in q : out end. end.3: Combinatorial logic divided by sequential registers. The register is sequentially updated by the clock signal. seq: process(clk) begin if rising_edge(clk) then feedback <= sum. end if. std_logic.Chapter 1. std_logic_vector(BITS-1 downto 0)).sequential update -. feedback : std_logic_vector(BITS-1 downto 0). integer ). -. std_logic_vector(BITS-1 downto 0). The adder and MUX are the combinatorial part of the circuit.combinatorial update -.

A frequently used signal processing element is the CORDIC (Coordinate Rotation Digital Computer). Small logical cells build arithmetic units such as adders... architecture . FIR begin myadd1: ADDER generic map ( WIDTH => 4) port map( A => A.3. Instantiate the adder in the FIR architecture (after begin) 6.2. end. of ADDER is . Changing the filter size would require manual work to instantiate a different number of components.6... In the case of an FIR filter.9.. Using previously designed components in a new component is called instantiation. FIR FIR architecture .5: FIR filter instantiating a generic adder component. the number of components depends on the actual filter size. of FIR is component ADDER . an FIR filter instantiates a number of adders and multipliers and connects wires between the instantiated blocks. Connect the signals in the port map to create a filter function ADD entity ADDER is . Z => Z ). 1 2 3 4 5 Figure 1. The generate statement.. then using (instantiating) these basic blocks to build more complex designs. Create a new entity and architecture to the FIR. end.... For example. 5.. first specifying the smallest and most fundamental building block.. Include a component declaration of the adder. for example a direct map FIR filter. The — 21 — Lund Institute of Technology . The procedure for instantiation is presented below. useful for evaluating trigonometrically functions. in the FIR architecture (before begin).. end.. entity ADDER is . This way of designing is called bottom-up. 1. An N bit CORDIC element can be implemented as a time multiplexed unit performing N iterations or by pipelining N units. and several signal processing algorithms can form a complete system.. Specify the adder architecture. Instantiation and port maps 1. can be used for instantiating blocks or connecting signals using the loop counter as index. It performs a series of pseudo-rotations based on shift and add operations. Examples of how to instantiate components can be found in Section 1. myadd2: ADDER . 2.6.6 Instantiation and port maps Hardware design is well suited for a hierarchical approach. similar to the for statement. ADD entity FIR is . However. 4..3. B => B. end component.. Create a new entity adder..7 Generate statements When building highly regular structures. end.. Adders are basic building blocks in signal processing.. VHDL has a construct to cope with the situation of dynamic instantiation. as shown in Section 1.. of FIR is . instantiating components by hand would be a cumbersome procedure. end.1. as shown in Section 1.. 3. as shown in Figure 1. end. architecture . begin .. 1.1.

z : data_type.x. STAGE => i ) port map( clk => clk.x. xo => pipeline(i+1). rst => rst.6: A pipelined CORDIC element with N bit precision. yo => pipeline(i+1). end. yi => pipeline(i). end generate. end record.Chapter 1. xi => pipeline(i). begin gen_pipe: for i in 0 to N-1 generate inst_unit: cordic_unit generic map( BITS => N.ctrl. example in this section shows how to instantiate N units and connecting them in a pipeline.z. signal pipeline : cordic_pipeline. y. ctrli => pipeline(i). x. architecture structural of cordic is subtype data_type is std_logic_vector(N-1 downto 0).ctrl.y. Introduction to VHDL ctrl xyz 0 1 N-1 ctrl xyz Figure 1.z ). zo => pipeline(i+1). type cordic_pipeline is array(0 to N) of cordic_data. ElectroScience — 22 — . ctrlo => pipeline(i+1).y. type cordic_data is record ctrl : cordic_control_type. zi => pipeline(i).

8. component test_design port (clk : in rst : in input : in output: out end component.written to the output file inp_sig <= CONV_STD_LOGIC_VECTOR(d.7. as shown in Figure 1.is written to the DUT and the output is read(buffer_in. use IEEE.std_logic_1164.textio. generate test vectors (a set of inputs)./stimuli/output". instantiating a DUT (Device Under Test).result :integer.buffer_in). it is important to verify its functionality. DUT : test_design port map ( clk => clk.8 1.half period = half clock period std_logic.1. -. architecture behav of tb_example is file in_data:text open read_mode is ". file out_data:text open write_mode is "./stimuli/input". std_logic.result). -. std_logic_vector(1 downto 0). -.start value of clk is ’0’ signal rst: std_logic:=’1’. constant half_period: time :=10 ms.at each positive clock edge a new input readline(in_data. e. process(clk) variable buffer_in.. begin if (clk=’1’) and (clk’event) then -. and monitor the output.. output=> ouput_sig).g.all.d). library IEEE. -. signal clk: std_logic:=’0’. i.7: Testbench providing stimuli to a device under test (DUT).2) after 2 ns. Writing the output values to a file makes it possible to verify the result using. -.device pin (clk) connected with (=>) signal in tb (clk) rst => rst. Simulation 1. — 23 — Lund Institute of Technology . variable d. output_sig : std_logic_vector(1 downto 0). use std. The most common method of doing this is to create a testbench. and read/write information to file. Common testbench tasks are to generate clock and reset signals. An example testbench is found below. signal input_sig.1 Simulation Testbenches When writing a design. -.all. begin rst <= ’0’ after 20 ms.. std_logic_vector(1 downto 0)). input => input_sig. Matlab.buffer_out:line.8.e.delay input 2 ns result:=CONV_INTEGER(output_sig). entity tb_exmple is end tb_exemple. clk rst 0101 read d q DUT write 0101 Figure 1.. write(buffer_out.

The waveform window shows how the output value is incremented with d for each clock cycle.8 and Figure 1.accumulator ModelSim> run 1000 ns Figure 1.5. clk <= not clk. Introduction to VHDL writeline(out_data. ElectroScience — 24 — .Chapter 1. process --clock generator begin wait for half_period. end. end process. The screenshots from the program in Figure 1.9 show the simulation of the accumulator example from Section 1. Can be used for compiling and simulating designs.vhd ModelSim> vsim MYLIB. ModelSim> vcom -work MYLIB accumulator. Simulation is performed by compiling a VHDL file using the Modelsim command vcom.8: Modelsim main window. loading the design with vsim and finally starting the simulation with the run command.2 Modelsim Modelsim is a widely used simulator for VHDL and Verilog-HDL.1. All commands can either be written at the prompt or found in the GUI.buffer_out). end process. 1.8. end if.

and some logical functions. it is possible to delay the signal assignment to emulate the behaviour of gates. The architecture for the design example is found in Figure 1.8. — 25 — Lund Institute of Technology . input control signal.10. these lines of code must be discarded. 1. generating a warning message. and -. The ALU supports addition. Design example Figure 1.pragma translate_on after the part you want to hide from the synthesizing tool. and one output for the ALU result. The accumulated values increases with d for each clock cycle. There are two input operands.pragma translate_off before. most synthesis tools understand this syntax and will simply ignore it. subtraction. memories or external interfaces. It could be printing a message in the simulation window. During synthesis. 1. This can be done by placing a special line before and after the non-synthesizable code. or writing important information to a file.9.3 Non-synthesizable code Sometimes it is convenient to include functions that are non-synthesizable during simulation. It is not necessary to use translate off when adding signal delays. During simulation. This is done by setting the following variable in the synthesis script: hdlin_translate_off_skip_text = true An example of non-synthesizable code is signal delays. When using Synopsys. Place the line -.9: Modelsim waveform window. a system variable must also be set in order for Synopsys to understand the translate off and translate on commands.1. However.9 Design example A small design example of an ALU (arithmetic logical unit) will be presented in this section.

integer). when "100" => X <= -(A+M). : : in : in : out integer). std_logic_vector(1 downto 0). end case. end case.R) begin if MUX=’0’ then ElectroScience — 26 — . std_logic_vector(2 downto 0). M X end. when "111" => X <= (A xor M). std_logic.10: Schematic view of the ALU design.B. std_logic_vector(WL-1 downto 0). when "110" => X <= (A or M). end process. ALU unit entity ALU is generic (WL port( ALU_OP A. Shift unit entity SHIFTOP is generic(WL : port( SHIFT : in X : in S : out end. when others => null. std_logic_vector(W_L-1 downto 0). when "10" => S <= X(WL-3 downto 0) & "00". end.ALU_OP. end.M) begin case ALU_OP is when "000" => X <= A.SHIFT) variable S_temp : std_logic_vector(WL-1 downto 0). when "11" => S <= X(WL-2 downto 0) & ’0’. when "001" => X <= A+M. when "011" => X <= M-A. std_logic_vector(W_L-1 downto 0)). when "010" => X <= A-M. architecture BEHAVIOURAL of SHIFTOP is begin process (X. std_logic_vector(WL-1 downto 0)). begin case SHIFT is when "01" => S <= X(WL-1) & X(WL-1 downto 1).R : in M : out integer). std_logic_vector(WL-1 downto 0). when "101" => X <= (A and M). end process. MUX unit entity MUX is generic(WL : port( MUX : in B. architecture BEHAVIOURAL of ALU is begin process (A. architecture BEHAVIOURAL of MUX is begin process (MUX. std_logic_vector(WL-1 downto 0)).Chapter 1. when others => S <= X. Introduction to VHDL B A Mux M Alu X Shift S R R_reg ALU_OP Q_reg Q Figure 1.

end if.S) begin if (clk=’1’) and (clk’event) then if WR_R=’1’ then R <= S. end if. Design example end. M <= B.9. end. out std_logic_vector(WL-1 downto 0)). else M <= R.1. in std_logic. Registers entity REGISTER generic(WL : port( clk : WR_R : S : R : end. is integer). end. in std_logic_vector(WL-1 downto 0). end if. architecture BEHAVIOURAL of REGISTER is begin process (clk.WR_R. in std_logic. end process. — 27 — Lund Institute of Technology . end process.

Chapter 1. ALU_PORT(5 downto 4).S. in std_logic_vector(WL-1 downto 0). in std_logic. S => S. S => S. M X end component. ALU_PORT(2 downto 0). B : ALU_PORT : clk : Q : end. X => X).connect signals -. ALU_PORT(3). R. M_OP: mux generic map (WL => WL) port map (MUX => MUX. end. in std_logic_vector(7 downto 0). std_logic_vector(WL-1 downto 0)). A_OP : alu generic map (WL => WL) port map (ALU_OP => ALU_OP. M => M). R => R). out std_logic_vector(WL-1 downto 0)). S_OP : shiftop generic map (WL => WL) port map (SHIFT => SHIFT. . -. -. B => B.instatiate units ElectroScience — 28 — . -. WR_R. architecture BEHAVIOURAL of ALU_TOP is component alu generic (WL port( ALU_OP A. <= <= <= <= <= ALU_PORT(7). std_logic_vector(WL-1 downto 0). Introduction to VHDL ALU toplevel The ALU toplevel instantiates the previously created designs and connects the port maps using signals. std_logic_vector(2 downto 0). MUX : std_logic. ALU_OP : std_logic_vector(2 downto 0). S => S.declare components : : in : in : out integer). M => M. Q_OP: register generic map (WL => WL) port map (clk => clk. entity ALU_TOP is generic(WL : port( A.X : std_logic_vector(WL-1 downto 0). nteger := 8).. WR_R => OEN. ALU_PORT(6). R_OP: register generic map (WL => WL) port map (clk => clk. SHIFT : std_logic_vector(1 downto 0).. R => Q). X => X). A => A. WR_R => WR_R.declare signals signal signal signal signal begin OEN WR_R SHIFT MUX ALU_OP OEN.M. R => R.

vector. vector. integer multiplication bit. integer greater or equal bit. string Example a := b and c a := b or c a := b nand c a := b nor c a := not b a := b xor c a := b xnor c if (a = 0) then if (a /= 0) then if (a < 2) then if (a <= 2) then if (a > 2) then if (a >= 2) then a := b + ”0100” a := b . integer addition bit. vector logical or bit. vector equal bit. vector. integer exponent (shift) bit. Description Operand type logical and bit. integer greater than bit. vector.9. vector logical xnor VHDL’93 bit. integer not equal bit. vector. integer subtraction bit. vector logical nor bit.”0100” a := b * c a := 2 ** b a := ”01” & ”10” — 29 — Lund Institute of Technology . vector logical not bit. vector. integer less or equal bit. vector. vector. vector logical xor VHDL’93 bit. vector. vector. integer less than bit. Design example Operator and or nand nor not xor xnor = /= < <= > >= + − ∗ ∗∗ & Table 1.1: List of operations. vector logical nand bit. vector. integer concatenation bit.1.

ElectroScience — 30 — .

After having gotten acquainted with VHDL..g. — 31 — Lund Institute of Technology ..Chapter 2 Design Methodology This chapter proposes an HDL Design Methodology (DM) that is based on utilizing certain properties of the VHDL language. As the design grows. not use more than two processes per entity. abstraction level. An effective DM can simplify these procedures. • Decrease design time. Using a proper DM will certainly help to minimize the time spent on debugging. based on the methodology proposed by Jiri Gaisler [3]. In the subsequent sections. since code orientation and single block functionality verification becomes easier. the actual length and number of source files. • Simplify maintenance.. an ad-hoq coding style often arises. As the design grows. the time spent on functionality debugging becomes a larger part of the design time. The readability can be increased by exploring several important properties of HDL design and VHDL. i. As a result. signals and different source file dispositions. • Simplify debugging. hierarchy. it seems that the language contains many useful properties. the source code will have a uniform disposition simplifying code orientation and the way to address a certain design problem. together with the time spent on debugging. etc.e. together with some additional guidelines. The proposed DM main objectives are to improve the following important properties of HDL design namely: • Increase readability. i. the way a Finite State Machine (FSM) is written. This is especially important when several designers are involved in the same project..e. An important issue after having released an IP-block is proper maintenance and support. one combinatorial and one sequential. a DM is proposed. As a result. e. resulting not only in non-synthesizable code but also in many concurrent processes.e. using case statements. which might lead to an advantage over competitive designs. the need for an effective DM becomes evident. i.

One process contains the sequential logic (synchronous). Component instantiation(s). adopting all paragraphs may seem excessive. The sequential part always looks the same but the combinatorial part looks different in every source file due to the functionality of the block. and the other contains the combinatorial part (asynchronous). The proposed DM is based on these properties and they have to be utilized throughout the complete design process. but many will. behavioral logic. • Utilize records.Chapter 2. e. 2. 4. work for all types of designs. Architecture declaration. two processes per entity..1. • Use local variables.e. The guidelines are summarized below: • Use a uniform source code disposition. Constant declaration(s).g. 6.. 5. Entity declaration(s). The benefits of each paragraph will be explained in detail in the subsequent sections. A typical source code disposition is compiled below: 1. • Explore hierarchy. Combinatorial process declaration. e. i. Included library(s).. with some modification.. Due to this fact. Globally Asynchronous Locally Synchronous (GALS). registers. • Use synchronous reset and clocking. Furthermore. • Utilize sub-programs. the presented guidelines are applicable to any synchronous single-clock design.1 Source code disposition Limiting the number of processes to two per entity improves readability and results in a uniform coding style. i. but as the skill of the designer increases. Design Methodology 2. Furthermore. 2. the sequential part is always put after the combinational part. 7.e.1 Structured VHDL programming Structured VHDL programming involves specific properties of the VHDL language as well as addressing design problems in a certain way. Signal declaration(s). 3. At first glance. the individual benefits will become evident. This section will propose some simple guidelines to achieve the goals mentioned in the previous section. the guidelines should be seen as recommendations and following them will most certainly result in synthesizable and well structured designs. ElectroScience — 32 — .g.

see example below: sequential_part : process(clk) begin if rising_edge(clk) r <= rin. A simple way of avoiding this is to use records since once the signal is included in a record. integers. component declaration.e. etc. updating the registers.1. which obviously can become quite hard to keep track of as the number of lines of code increase.3 Clock and reset signal The clock signal solely controls the sequential part. Sequential process declaration . i. other records..2. entity. i. The sequential part always looks the same and should only have the clock signal in the sensitivity list. many clock tree generation and timing analysis tools do not support clock signals that are part of a bus. 2..1. code orientation. which otherwise can lead to unnecessary latches in the synthesized design.e. sensitivity list.e. This is due to the fact that the record is already included in the necessary places and the actual signal is hidden within the record. A record in VHDL is analogous to a struct in C and is a collection of various type declarations. thus enabling reuse. This is mainly because the clock signal is treated differently by the synthesis tools. namely synchronous and asynchronous.1. Structured VHDL programming 8. end if.2 Records A typical VHDL design consists of several hundreds of signal declarations. of the source file and should be left out of any record declaration. The declarations should not contain any direction. in or out. all these updates are done automatically. and maintenance. Below is an example how a simple record is defined VHDL: type my_type is record A : std_logic_vector(WIDTH-1 downto 0). Also. end process. end record. i. standard logic vectors. simplifying readability. Using records not only reduces many lines of code but also simplifies signal assignment since none of the default signal assignments are forgotten.g.. component instantiation. There are two methods of implementing the reset signal. B : integer. Z : record_type. There are also two places to put the reset assignment in the — 33 — Lund Institute of Technology . e. The record declarations should be compiled in a declaration package that can be imported to the various files. 2.. since it is routed from an input pad throughout the complete design. The source file disposition should always have this given order. Each signal has to be added on several places in the code.

there is no general methodology to explore hierarchy rather that it can and should be utilized in every design. enabling debugging of each block individually. and should be triggered on a logic zero. In the proposed DM. end process. The following example illustrates where the synchronous reset assignment is placed in the source file: combinational_part : process (various signals and records. In VHDL.Chapter 2.. just before the current state record is updated. Choosing between the two methods and deciding where to put the assignment seems to be a religious matter and many of the arguments for choosing either of them are obsolete. which is a commonly used convention. end if. Each program level is controlled by component instantiation. ElectroScience — 34 — . .. synchronous reset assigned in the combinational part is utilized. i. in the sequential or combinational part.1: 1 A synchronous reset signal can be added in a record since it is then treated as any other signal but it is not advisory since it does not support a uniform coding style... the reset signal should not be included in any kind of record declaration1 . hierarchy can be utilized to split a given problem or model into individual modules.. Design Methodology alu operation shift operation alu (top level) mux operation register operation L evel 1 L evel 2 Figure 2. 2.1.state := IDLE. source code.4 Hierarchy Hierarchy is an important concept and is general for all types of modeling since altering the abstraction level tends to simplify a given problem.. since most of the commercial synthesis tools now support them. if (reset = ’0’) then v. Clearly. . and for the same reasons as for the clock signal. reset) begin . designs or sub programs.1: An example of how hierarchy can be used in the ALU example. A block diagram of how hierarchy can be used in the ALU example can be seen in Figure 2.. The reset assignment is put in the latter part of the combinational block. rin <= v.e.

state := CALC.package declaration function shift_operation (A : in data_type. r.1.. The subprograms are defined in packages and imported into the different source files. please refer to Section 2. case r.1.2..2. To see the complete code of the ALU example.. thus increasing readability. end. hiding complexity and enabling reuse.. Basically. Then. Subprograms can be used to increase the hierarchy and abstraction level.5 Local variables By utilizing local variables instead of signals in the combinational part. Below is an example of how to use local variables in the ALU example: ALU_example : process("additional signals and records". . Using local variables will also increase simulation speed since simulation is often event triggered and the tool can execute events in order of appearance and does not have to consider parallel events. 2. control : in control_type) return data_type. end case.. This is mainly to avoid long logic paths and combinational feedback loops. m : in data_type. -.. begin v := r.package body — 35 — Lund Institute of Technology . reset) variable v : reg_type. the input signal to the registers rin is updated to the computed values of the local variable v. Below is an example of how subprograms can be used in the ALU example: -. i.state is when IDLE => v. This is advantageous since humans find it easier to think in sequential terms rather than in parallel.6 Subprograms Using sub-programs is often a good way of compiling commonly used functions and algorithms. each line of code is executed in consecutive order during simulation. Structured VHDL programming 2.1. the local variable v is set to the current values of the registers stored in r. a given design problem is transformed from concurrent to sequential programming. rin <= v. the local variable v is used throughout the combinational part in a sequential programming fashion. A restriction in the use of subprograms is that functions should only contain purely combinatorial logic.e. there is only one restriction when using local variables namely that a computational result stored in a variable should not be reused within the same clock cycle but rather stored in a register to be used in the succeeding clock cycle. . Finally.. . Initially. Variable and signal assignments should be placed exactly as in the example above.

7 Summary The previous sections have proposed guidelines to establish a sound DM to be used throughout the whole design process. a uniform coding style can be established to simplify and increase readability. Finally. when "001" => return A + m. debugging. when "110" => return A or m. when "111" => return A xor m. As a result. 2. • Improved simulation and synthesis speed. when others => return null. adopting proposed guidelines hopefully results in synthesizable and well disposed designs that will help the design team on the way towards the final goal. when "100" => return (others => ’0’). • Altering the abstraction level. control : in control_type) return data_type is begin case control is when "000" => return A. There are also several other benefits of adopting this DM where most important are summarized in the table below: • Less lines of code. end case. Design Methodology function shift_operation(A : in data_type . hiding complexity. which is a robust design using a minimal design time.1. end function shift_operation. and maintenance. Each subprogram can be debugged individually with a separate testbench to ensure correct functionality of the block.m. when "011" => return m.Chapter 2. ElectroScience — 36 — . and enabling reuse. when "101" => return A and m. • Parallel coding is transformed into sequential. m : in data_type. when "010" => return A .

"011". All parts that are not standard cells and pure logic.std_logic_signed. : : : : : in in in in out std_logic.std_logic_1164. "11". Memories and pads are instantiated in the VHDL code by name. fb : std_logic. "100". avoid hard coding the memory name or pad name in the design. for example on-chip memories and the pads connecting the design to the outside world. library IEEE. shift : std_logic_vector(1 downto 0). component alu port ( clk A B control Q end component. use IEEE. "010". constant constant constant constant constant constant constant constant constant constant constant constant ALU_STORE ALU_ADD ALU_SUB ALU_FB ALU_CLEAR ALU_AND ALU_OR ALU_XOR : : : : : : : : std_logic_vector(2 std_logic_vector(2 std_logic_vector(2 std_logic_vector(2 std_logic_vector(2 std_logic_vector(2 std_logic_vector(2 std_logic_vector(2 : : : : downto downto downto downto downto downto downto downto 0) 0) 0) 0) 0) 0) 0) 0) := := := := := := := := "000". end. package alu_pkg is constant WL : integer := 8. "10". The ALU code is divided into a global package and the actual implementation.2 Example The same example as in the previous chapter is now presented. For example.for signed add/sub SHIFT_NONE SHIFT_RIGHT SHIFT_2LEFT SHIFT_1LEFT std_logic_vector(1 std_logic_vector(1 std_logic_vector(1 std_logic_vector(1 downto downto downto downto 0) 0) 0) 0) := := := := subtype data_type is std_logic_vector(WL-1 downto 0). Example 2. which selects the proper component based on the user constraints and the chosen technology.3. and can be found below. using the proposed design methodology. data_type ). use IEEE. The rest of the ALU implementation is presented in Section 2. type control_type is record oen : std_logic. "110". Then let the mapping file choose the correct memory — 37 — Lund Institute of Technology . "101". A simple solution is to use abstraction.5. 2. end record.2. are technology dependent. The package contains constants and type declarations to improve the readability and portability. -.all.2. Start with creating a mapping file. the target technology may not yet be known or changed at a later stage.3 Technology independence During code development. data_type. alu_op : std_logic_vector(2 downto 0). "01". data_type. mux : std_logic. when creating a memory.for std_logic -. specify only the number of data bits and address bits in the code. "111". "00".all. At the same time. "001". control_type. An important design aspect is therefore to keep the code free from parts that are technology specific.

. mem: gen_ram generic map ( ADDR_BITS => 5.set technology here ElectroScience — 38 — .13um memory by name Design. d : std_logic_vector(DATA_BITS-1 downto 0).tech_umc13. instead of the actual memory.all. addr. use work. entity gen_ram generic map ( ADDR_BITS : integer.tech_map.. port map ( clk : std_logic. memories and pads. port map ( clk. q : std_logic_vector(DATA_BITS-1 downto 0) ). The mapping file contains the following information: use work. ).vhd Instantiate and connect UMC . tech_virtex. instantiate a memory from the mapping file as: use work. d. end. In your design. if (target_tech = VIRTEX) then mem: virtex_ram . UMC13 ).2: A chip layout with standard cells (logic). type target_type is ( VIRTEX. 16 ) port map ( . addr : std_logic_vector(ADDR_BITS-1 downto 0).all... if (target_tech = UMC13) then mem: umc13_ram .vhd tech_umc13. Design Methodology pads logic memory Figure 2. Instantiate and connect Virtex memory by name mem: gen_ram generic map ( 6. end.3: Using a mapping file to handle target specific modules.available technologies -. architecture structural of gen_ram is -. q ).all.. end. constant target_tech : target_type := VIRTEX. DATA_BITS => 16 ).Chapter 2.vhd Figure 2. DATA_BITS : integer ). based on the current technology and by using your specifications about the address and data bus. The instantiation will point to the technology mapping file.each technology -.vhd target_tech tech_map.one library for -.tech_virtex. which selects and connects the chosen technology dependent module..

several different pad types are usually required. d. However. Schmitttrigger.3. pad insertion is only a couple of lines in the synthesis script.2 ASIC Pads For smaller projects. use: dc_shell> dc_shell> dc_shell> dc_shell> dc_shell> set_port_is_pad all_inputs() set_port_is_pad all_outputs() set_pad_type -exact IBUF all_inputs(). library files and physical description files. Technology independence begin if (target_tech = VIRTEX) generate mem: virtex_ram generic map ( ADDR_BITS. q ). Physical size phys GDS II LEF PaR DB SYN VHDL SIM Verilog netlist MEM GEN comp entity Figure 2. set_pad_type -exact B2CR all_outputs(). port map ( clk. end generate. and a layout file specifying the size and connections of the memory. for larger designs. The number associated with the output pad is usually the driving strength in mA. if (target_tech = UMC13) generate mem: umc13_ram generic map ( ADDR_BITS. in this case IBUF for input ports and B2CR for output ports. d. end. For ASIC technologies. end generate. insert_pads -respect_hierarchy The only thing you have to select is the pad name from your target library. pads are normally inserted during synthesis. addr. DATA_BITS ).VIRTEX memory -. entity and architecture. 2. port map ( clk.2. For Synopsys. Assigning pads individually and at — 39 — Lund Institute of Technology . The ideal memory generator creates a VHDL simulation model. open-drains or bi-directional pads. names that are technology specific.UMC ASIC memory 2. It could be pads with different driving strength. q ). The memory is instantiated from the VHDL code by using exactly the same name and port connections.3.3. addr. -.4: The memory generator creates simulation files.1 ASIC Memories ASIC memories must be created using a memory generator or by contacting the vendor. The reason is simple: it is less work. DATA_BITS ).

replacing the black box when reading the . Usually the on-chip memories are small.EDF file SYN . The tool is called Core Generator and creates modules (. create generic pads in a mapping file.UCF user constraint file Figure 2. Component description Bitstream EDN ISE CORE GEN VHDL SIM . memories and pads are only two examples of technology specific components. core generator can create memories and multipliers of any size. Xilinx has created a tool for automatically creating small cores that can be used in the design. The solution is to apply the same technique as described for memories. combining the resources in the FPGA. 2. Adding one pad requires changing the synthesis script for all technologies. ElectroScience — 40 — . The modules are instantiated by name (which will create a black box during synthesis). For example. will be difficult.3 FPGA modules For FPGAs.3. Design Methodology the same time creating multiple version. instantiated by name in the VHDL code.edn) that can be inferred during the place and route step.5: Core Generator can create all kinds of modules. but can be grouped to form a larger memory. and the correct pads will be chosen from the mapping file.Chapter 2.edf file. one for each technology. Adding a new pad will only require a pad instantiation at the top level of your design. and point to the current technology.

6: Core Generator main window. Technology independence Figure 2.3. 2. The synthesis tool can therefore handle a lot of things automatically. Generated cores can be instantiated in the VHDL code.4 FPGA Pads Memory and pad insertion is more simple for FPGAs than for ASIC. — 41 — Lund Institute of Technology .2.3. The reason is that the synthesis tool for FPGA already knows what kinds of memories and pads that are available for a certain FPGA.

end record. if (control. case control. s : data_type. x.alu_op is when ALU_STORE => x := A. use work. end. entity alu is port ( clk A.feedback enable -. The ALU is implemented using the two process methodology.shift is when SHIFT_NONE => s when SHIFT_RIGHT => s when SHIFT_2LEFT => s when SHIFT_1LEFT => s when others => null. end process.use ALU types : : : : in in in out std_logic.update registers ElectroScience — 42 — . -. end if.all registers -. signal r. rin <= v. -. control) variable m..drive output from register -. B. end process. . rin : reg_type.output enable -.fb = ’1’) then v. architecture behavioural of alu is type reg_type is record output : data_type.select input or feedback -. if control. data_type.output := s. All sequential elements are declared in a record type. end if.all.alu_pkg. data_type ). begin v := r.output. Q <= r. end case. when ALU_SUB => x := A . else m := r. process(clk) begin if rising_edge(clk) then r <= rin.oen = ’1’) then v. when others => null.arithmetic operation := := := := -. -.Chapter 2.feedback := s.. x(WL-2 downto 0) & ’0’. feedback : data_type.shift operation x. begin process(r.5 ALU unit The ALU unit imports the package on the previous page and uses the type and constant declarations. end case. variable v : reg_type. x(WL-3 downto 0) & "00". Design Methodology 2. if (control. when ALU_ADD => x := A + m.m. separating the sequential and combinatorial logic.feedback. A. x(WL-1) & x(WL-1 downto 1). control_type.3. end if. B control Q end. case control.mux = ’1’ then m := B. end if.

arithmetic describes the four fundamental rules of mathematics: addition. an ALU can perform several other operations. performance or power.all. subtraction.1 Introduction In the original meaning. 3. use IEEE.Library clause in VHDL library IEEE.2 VHDL Packages Operations and data types are specified in the VHDL language in packages and are made accessible by use of a library clause at the beginning of the VHDL file.std_logic_1164. Design Compiler or DC for short. The description in this chapter will concentrate on the VHDL data types based on an array of std logic and the operations that can be performed on this specific data type. This is decided in the step which hardware designers refer to as synthesis. Although the VHDL language specify both the data type and the operations to be performed on this type. the data type and the operation is specified in the VHDL language. multiplication and division. this chapter addresses the implementation of arithmetic operations in a digital ASIC. In addition to the four fundamental rules of arithmetic. When it comes to hardware design with VHDL. To understand the implementation of arithmetic operations we must first consider and distinguish two important concepts: the operation and the data type. it does not say anything about how the operations are to be implemented as a hardware structure.g. — 43 — Lund Institute of Technology . as we want an efficient hardware structure as possible to perform the operation. e. -. a device that performs these operations is called Arithmetic and Logical Unit (ALU). In computer technology. Arithmetic and logical operations are of course very important to understand when designing hardware. which in this chapter will be discussed in the context of the Synopsys synthesis tool. Hence. Different implementations of a particular operation can be distinguished in terms of area. shifting and logical operations.Chapter 3 Arithmetic 3.

In std logic 1164. These types have definitions identical to the type std logic vector. ’0’. are: • std logic 1164 • std logic arith • std logic signed • std logic unsigned The standard package called std logic 1164 is an IEEE standard and the other three are Synopsys de facto industry standards. The type describes a 9-level information carrier. The most important and widely used standard packages when it comes to ASIC design. Table 3. 3. This is an important concept in hardware design. ’-’ Resolved std ulogic Array of std logic Array of std logic Array of std logic −(231 − 1) to (231 − 1) Type std ulogic std logic std logic vector unsigned signed integer Package std logic 1164 std logic 1164 std logic 1164 std logic arith std logic arith standard ElectroScience — 44 — .2.1: Important VHDL types Description ’U’. a resolution function will be automatically invoked to solve the conflict and give the signal a well defined value. the type std logic and the types based on an array of std logic should be used exclusively. is controlled by operator overloading. The standard package std logic arith defines two additional types: signed and unsigned. originate is std ulogic. Arithmetic As there may be several definitions of the same operator in a standard package. ’X’ (forcing unknown). ’1’. ’L’. ’0’ (forcing 0). to describe shared busses. Table 3. by the compiler. When writing synthesizable VHDL code. ’X’. ’L’ (weak 0).1 summarizes the data types and the standard package from which they originate. the important types std logic and std logic vector are defined. ’H’ (weak 1) and ’-’ (don’t care). to find the correct operation from the list of all operations currently accessible through the library clauses in the file. ’W’ (weak unknown). there will always be a defined value of that signal. ’Z’. Operator overloading means that the compiler checks input and output data types. These are the values you can find on a signal when simulating the design in Modelsim. The information carriers are ’U’ (uninitialized). We will briefly explain the content of the packages which also are found in Appendix A.Chapter 3. ’Z’ (high impedance). the actual operator selected. When a conflict occurs in a resolved signal. The data type std logic is a resolved std ulogic. ’H’.1. ’W’. used for writing synthesizable VHDL code. simulation and synthesis. ’1’ (forcing 1).1 Data types The basic data type from which the data types. which means that if several processes assign the same signal. see Table 3.

— 45 — Lund Institute of Technology . In the VHDL language. VHDL Packages 3.2. use IEEE.2. For details about operator overloading and all defined functions. use IEEE. use IEEE. Signed operations are done with two’s complement number representation and unsigned operations with natural binary number representation.std_logic_unsigned. library IEEE.std_logic_signed.all.all. in the type definition.3. -. these cases have been separated in the standard packages. specify the contents and thereby the operation. there are two ways of controlling if signed or unsigned operations should be performed.all. respectively. The other method is to use the types signed and unsigned and hence explicitly.std_logic_1164.3 VHDL examples Here is a VHDL code example where std logic vector is used as the data type in a multiplication.Library clause for signed operations with std_logic_vector library IEEE. -. Hence.std_logic_1164. The operators for signed and unsigned data types are all specified in the std logic arith package. A binary number representation must be specified to perform a predictable operation on the type. use IEEE. several functions for the same operation are defined in each specific package. we refer the reader to Appendix A where all packages described are found. Table 3.std_logic_1164.Library clause for signed or unsigned operations with signed or -.2 Operations It should be familiar to the reader that a number can be coded into a binary representation in different ways.std_logic_arith. Since hardware for unsigned arithmetic is less complex than hardware for signed arithmetic in a given number range. the operation is controlled in the library clause by including the package std logic signed or std logic unsigned to select signed or unsigned operations. -.2 summarizes the arithmetic operations in the packages.2. Whether a signed or an unsigned operator are selected is then controlled by operator overloading.all. use IEEE.unsigned data types library IEEE. use IEEE.Library clause for unsigned operations with std_logic_vector library IEEE. The std logic vector is synonymous to an unsigned by including the std logic unsigned package. the type is just a container for either unsigned or signed numbers.all.all. 3. As already stated. An obvious drawback is that it is not straightforward to mix signed and unsigned operators in the same VHDL file. If std logic vector is used in the design.

library IEEE. Arithmetic Operation ABS + − + − ∗ < <= > >= = /= SHL SHR EXT SXT Table 3.std_logic_1164. entity multiplication is generic(wordlength: integer). use IEEE.std_logic_1164. use IEEE. end multiplication. end.all. The std logic vector is synonymous to a signed by including the std logic signed package. Here is a VHDL code example where std logic vector is used as the data type in a multiplication.std_logic_unsigned. port(in1.all.all. end process. architecture rtl of multiplication is begin process(in1. in2) begin product <= in1 * in2. product : out std_logic_vector(2*wordlength-1 downto 0)).2: Arithmetic operations in Description Arith Absolute value Yes Unary addition Yes Negation Yes Addition Yes Substraction Yes Multiplication Yes Less than Yes Less than or equal to Yes Greater than Yes Greater than or equal to Yes Equal to Yes Not equal to Yes Left shift Yes Right shift Yes Zero extension Yes Sign extension Yes packages Signed Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No No Unsigned No Yes No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No No use IEEE.Chapter 3. ElectroScience — 46 — .all. in2 : in std_logic_vector(wordlength-1 downto 0).std_logic_signed. entity multiplication is generic(wordlength: integer). use IEEE.

use IEEE. end multiplication. library IEEE. product_s <= std_logic_vector(product_s). use IEEE.all. end process.std_logic_arith. in2) variable product_u : unsigned(2*wordlength-1 downto 0). in2) begin product <= in1 * in2.all. architecture rtl of multiplication is begin process(in1. — 47 — Lund Institute of Technology . Only the std logic arith package are included.3. variable product_s : signed(2*wordlength-1 downto 0). architecture rtl of multiplication is begin process(in1. in2 : in std_logic_vector(wordlength-1 downto 0). end process. product_u : out std_logic_vector(2*wordlength-1 downto 0). entity multiplication is generic(wordlength: integer).Signed multiplication product_s := signed(in1) * signed(in2). product : out std_logic_vector(2*wordlength-1 downto 0)). VHDL Packages port(in1. -. port(in1. product_u <= std_logic_vector(product_u). Hence. end. in2 : in std_logic_vector(wordlength-1 downto 0).2.Unsigned multiplication product_u := unsigned(in1) * unsigned(in2).std_logic_1164. Here is a VHDL code example where signed and unsigned are used as the data types in a multiplication. begin -. end multiplication. no arithmetic operations can be performed directly on std logic vector. product_s : out std_logic_vector(2*wordlength-1 downto 0)). end.

The operator inference is performed by the synthesis tool during the Analyze and Elaborate phase. Table 3.1 Arithmetic Operations using DesignWare In synposys synthesis tool Design Compiler the libraries with pre-built components are called DesignWare libraries.. ElectroScience — 48 — . That is decided in the synthesis step.3. Technology independent means that the implementations are described as a netlist of common logic elements.. to hardware components. During this phase the synthetic operator is mapped to a synthetic module and a implementation is selected. The DesignWare libraries which will be loaded into DC and used during the synthesis process are controlled in the file . and delay.sldb dw01. These pre-built components are technology independent and several components for implementing the same operator may exist. or mapped.Chapter 3. This is done in a hierarchy of abstractions. the designer provides power.sldb standard. the operations are specified.. You can list all libraries currently loaded.3 DesignWare In a VHDL description.3. using the dc shell command list -libraries. or delay constraints to the synthesis tool. which is bound to Synthetic modules.sldb Path ---/usr/local-tde/.sldb. /usr/local-tde/.1 illustrates the flow for operator inference using the addition operator..sldb dw02. The synthesis tool comes with pre-built components that implement the built-in VHDL operators. A DesignWare component can be included in a design using basically three different methods..setup. Each Synthetic module can have several implementations. Operator inference: In operator inference. area. which then selects the most suitable implementation. Figure 3.sldb dw02. the designer sets the design constraints and compiles. A DesignWare library has the file extension . A complete list of the DesignWare components can be found by typing sold at the UNIX prompt. Since.sldb standard. In these libraries there are technology independent implementations of the in-built VHDL operators as well as other more specialized operations.. 3.synopsys dc. which is the most commonly used method. /usr/local-tde/. Arithmetic 3. area. which could later be mapped to standard cells in a specific cell library. select DesignWare Library and then select DesignWare Building Block IP. Next. during which the VHDL operator are mapped to a Synthetic operator. The VHDL operator is associated to a Synthetic operator. the synthesis tool automatically maps operators in the VHDL code to components in the included DesignWare libraries. dc_shell> list -libraries Library File ---------dw01.3 list the most commonly used synthetic modules and implementations used for operator inference. the different implementations have different properties when it comes to power. but it is not decided how these operations are to be implemented.

1: The figure illustrates the steps and abstraction levels used for operator inference of the addition operator Function inference Including the DesignWare component through function inference is done by first making the function available though a library clause in the VHDL file. One method is to use component instantiation or function inference and add VHDL code that specifies also the implementation. This can be done in two different ways. the designer selects only the synthetic module and leaves selection of implementation to the synthesis tool. which makes the VHDL code implementation — 49 — Lund Institute of Technology . However.3. an entity declaration of the component is included and the architecture body contains a port map. The other method. Normally. when including a DesignWare component through one of the above mentioned methods. A complete description on how to include a specific DesignWare component through function inference or component instantiation is found in DesignWare Building Block IP under the specific component. DesignWare sum = in1 + in2 Analyse and Elaborate in1 sum in2 Synthetic operator Constrain and Compile DW01_add (cla) DW01_add (rpl) Figure 3. The component is made accessible through a library clause. and then call the function inside the architecture body.3. it is possible to control also the implementation. Component instantiation Including a DesignWare component through component instantiation is done similarly as including a component through function inference.

Arithmetic Table 3.3: Basic arithmetic synthetic modules and their implementations Synthetic Module Operator Description Implementations DW01 absval ABS Absolute Ripple (rpl) value Carry Look-Ahead(cla) Fast Carry Look-Ahead (clf) DW01 add + Adder Ripple (cla) Carry Look-Ahead (cla) Fast Carry Look-Ahead (clf) Brent-Kung architecture (bk) Conditional Sum (csm) DW01 addsub +. > 2-Function Ripple (rpl) comparator Fast Carry Look-Ahead (clf) Brent-Kung architecture (bk) DW01 cmp6 <. − Adder and Ripple (rpl) subtractor Carry Look-Ahead (cla) Fast Carry Look-Ahead (clf) Brent-Kung architecture (bk) Conditional Sum (csm) DW01 cmp2 <. − Incrementer Ripple (rpl) and Carry Look-Ahead(cla) Decrementer Fast Carry Look-Ahead (clf) DW01 add − Subtractor Ripple (cla) Carry Look-Ahead (cla) Fast Carry Look-Ahead (clf) Brent-Kung architecture (bk) Conditional Sum (csm) DW01 mult ∗ Multiplier Carry Save Array (csa) Non-Both-recoded Wallace (nbw) Wallace Tree (wall) ElectroScience — 50 — .Chapter 3. 6-Function Ripple (rpl) <=. >= comparator Fast Carry Look-Ahead (clf) Brent-Kung architecture (bk) DW01 dec − Decrementer Ripple (rpl) Carry Look-Ahead(cla) Fast Carry Look-Ahead (clf) DW01 inc + Incrementer Ripple (rpl) Carry Look-Ahead(cla) Fast Carry Look-Ahead (clf) DW01 incdec +. >.

14 is implemented.3. select the implementation using the dc shell command set implementation.06-SP1-3 Date : Mon Jul 5 16:58:42 — 51 — Lund Institute of Technology . area and power in the given order. This method is preferable since it keeps the VHDL code free from implementation specific details. if one wants to realize hardware that meets special needs. it can be necessary to select the implementation manually. number and type of adders.. The dc shell script command report resources can be used after the compile step to check which synthetic modules and implementations the synthesis tool has selected for the design. which is then handled in the Synthesis script. can be obtained by dc_shell > report_resources The table below shows an example of how the structure in Figure 3. e. function inference or component instantiation in the VHDL file. If one wants to change the implementation of one of the adders the desired type has to be determined manually. For example: dc_shell> set_implementation DW01_add/rpl cell_name selects the cell cell name to be implemented as a Ripple Carry adder (rpl) from the synthetic module DW01 add. for a specific cell. is to use operator inference.3. see Table 3. The chosen implementation is a ripple-carry structure for all adders.3.4 and needs to be assigned to the target cell as dc_shell > set_implementation cla add_37/plus/plus The implementation of the targeted adder cell has now been changed to a carrylookahead structure. \label{sc:report_resources} **************************************** Report : resources Design: adder Version: 2003. However. Before the implementation type can be chosen the following steps are required a priori: • analyze • elaborate • uniquify • compile Information on the used resources. 3. it is for example possible to map all addition operators in a design to a specific synthetic module and implementation.see the table below. DesignWare independent. after the elaborate stage. Then.g.2 Manual Selection of the Implementation in dc shell The Synopsys Design Compiler automatically determines the hardware implementation to meet the design constraints speed. With some knowledge of the dc shell script commands.

"*plus*") if (plus_list != {}) { set_implementation DW01_add/ + DW_ADD_IMPL plus_list } if (BF == sub) { minus_list = find (cell.vhd =============================================================================== | | | | Contained | | | Resource | Module | Parameters | Resources | Contained Operations | =============================================================================== | r139 | DW01_add | width=12 | | add_37/plus/plus | | r142 | DW01_add | width=12 | | add_37/plus/plus_1 | | r145 | DW01_add | width=12 | | add_37/plus/plus_2 | | r148 | DW01_add | width=12 | | add_37/plus/plus_3 | =============================================================================== Implementation Report ============================================================================= | | | Current | Set | | Cell | Module | Implementation | Implementation | ============================================================================= | add_37/plus/plus | DW01_add | rpl | cla | | add_37/plus/plus_1 | DW01_add | rpl | | | add_37/plus/plus_2 | DW01_add | rpl | | | add_37/plus/plus_3 | DW01_add | rpl | | ============================================================================= If all the implementations in one design needs to be set manually a script can be executed in the dc shell SELECT_DW_CELLS = true /* Choose DW implementations (rpl or cla)*/ DW_ADD_IMPL = rpl DW_SUB_IMPL = rpl DW_CMP_IMPL = rpl design_list = find (design.4: Adder Synthesis Implementation Implementation name Function Library rpl Ripple-carry DWB cla Carry-look-ahead DWB clf Fast carry-look-ahead DWF bk Brent-Kung architecture DWF csm Conditional-Sum DWF rpcs Ripple-carry-select DWF clsa Carry-look-ahead-select DWF csa Carry-select DWF fastcla Fast-carry-look-ahead DWF 2004**************************************** Resource Sharing Report for design adder in file /home/toll/jrs/Synopsys/vhdl/course/adder. "*minus*") if (minus_list != {}) { set_implementation DW01_sub/ + DW_SUB_IMPL minus_list }} else { cmp_list = find (cell. Arithmetic Table 3. -hierarchy) design_list = design_list + current_design if (SELECT_DW_CELLS == true) { foreach (de. "*lt*") if (cmp_list != {}) { ElectroScience — 52 — .Chapter 3. "*". design_list) { current_design de plus_list = find (cell.

DW01 dec. the wordlength of the input operands and the result should be the same in the VHDL code. Although there is no difference between an adder/subtractor for unsigned or signed operands. Figure 3. or DW01 addsub synthetic module will be instantiated through operator inference during compilation. DW01 incdec. Addition and Subtraction set_implementation DW01_cmp2/ + DW_CMP_IMPL cmp_list }}}} 3. Hence. Generally. They are used both in data paths as well as in controllers. independent of whether unsigned or signed data types are used in the addition or subtraction. When a + or − sign is used in the VHDL code. A similar example for unsigned or signed substraction shows that the same basic subtractor can be used for both signed and unsigned numbers. unsigned 0 1 2 3 4 5 6 7 signed 0 1 2 3 −4 −3 −2 −1 code 000 001 010 011 100 101 110 111 signed 2 + (−3) −1 unsigned + 2 5 7 + 010 101 111 3 + (−3) 0 + 3 5 8 011 + 101 (1)000 Figure 3.g. An unsigned number with wordlength w bits lies in the range 0 to 2w − 1 and a signed number with wordlength w lies in the range −2w−1 to 2w−1 − 1. see Table 3. we see that adding 011 with 101 generates the result 000. DW01 sub.1 Basic VHDL example The wordlength of an addition or substraction specifies the number of bits used in the operands.4. DW01 inc. e. but it is an overflow for the unsigned addition. overflow conditions or increasing the wordlength of the operands.4.. the same basic architectures used for the implementations of adders are found in the implementations of subtractors. Addition and subtraction are closely related operations.4 demonstrates signed and unsigned addition for a 3-bit number representation. either the DW01 add. The same number of bits should be used in the result as used in the input operands.3. there are indeed differences when it comes to exceptions.3. The same synthetic modules and implementations are used.4 Addition and Subtraction Addition and substraction are two of the most frequently used arithmetic operations. From this example we see that the same basic addition operation can be used for both unsigned and signed numbers. This is the correct result for the signed addition. A substraction can be performed by negating the subtrahend and add the result to the minuend with a carry in. The wordlength of an addition or — 53 — Lund Institute of Technology .2: 3-bits unsigned and signed addition 3. Hence.

-.Unsigned addition with extended wordlength in1.Signed substraction in1. for the sake of readability. in2. sum <= in1 + in2. to avoid overflow in the result. in2. i. This is handled differently depending on. sum <= in1 + in2. the wordlength of the operands must be increased by one bit before the operation. diff : signed(wordlength-1 downto 0).Unsigned addition in1. wordlength+1) + ext(in2.Signed addition in1. -. The VHDL code used in the examples will not be complete when it comes to entity declaration. 3. we show only the signal declarations and the actual operation. through operator inference during compilation. To be able to do wordlength optimizations.Signed addition with extended wordlength ElectroScience — 54 — . sum : std_logic_vector(wordlength downto 0).2. Instead. etc.e. -. -. diff <= in1 . -. diff <= in1 . sum <= ext(in1. as the result must have the same wordlength as the largest wordlength of the input operands in the VHDL code. extend with the most significant bit of the operand.Chapter 3. wordlength+1). architecture body. The implementation will be selected to meet the design constraints as well as possible. A signed operand should be sign extended. The synthetic module DW01 addsub can perform both addition and substraction and will be inferred into the design if it is possible to do resource sharing between an addition and subtraction. sum : signed(wordlength-1 downto 0). The following VHDL code will instantiate the synthetic module DW01 add for the addition operator and the synthetic module DW01 sub for the substraction operator. Arithmetic subtraction is one of the most important parameters used to control the final area and performance of the implementation. -. sum : unsigned(wordlength-1 downto 0). in2. diff : unsigned(wordlength-1 downto 0).4.in2. An unsigned number should be extended with zeroes. Hence. in2 : std_logic_vector(wordlength-1 downto 0). in2. whether the operands are signed or unsigned.2 Increasing wordlength Addition or subtraction of two operands with the dynamic range −2w−1 to 2w−1 − 1. the designer must be aware of the dynamic range of the input signals and the precisions needed in the operation.in2. as seen in Figure 3. gives a result with the dynamic range −2w to 2w − 1.4.Unsigned substraction in1.

sum <= sxt(in1. respectively. sum : signed(15 downto 0). The incremeter/decrementer is the only case among counters that has a dedicated synthetic module in the DesignWare library.3 Counters Counters and addressing arithmetic are used extensively in controllers. The operand with shortest wordlength will be zero or sign extended. In VHDL. Firstly this means that a half adder is — 55 — Lund Institute of Technology . sum : std_logic_vector(wordlength downto 0). In the following VHDL example in1 will be automatically sign extended to 16 bits. The area and delay through a counter depends on the constant. -. An incrementer will be mapped to the synthetic module DW01 inc and a decrementer to the synthetic module DW01 dec. Addition and Subtraction in1 in2 w w w+1 sum Figure 3. in2 : std_logic_vector(wordlength-1 downto 0).4. wordlength+1) + sxt(in2. In such operations the addition or substraction can be optimized compared to the two-operand operation. counter : std_logic_vector(wordlength-1 downto 0). In the special case when the constant is equal to +1. we have a incrementer. in2 : signed(15 downto 0). A counter is defined as an addition or subtraction where one of the operands is a constant.Signed addition with implicit sign extension in1 : signed(7 downto 0). counter <= input + 5.3. a counter is best described by using an integer to represent the constant. If it is possible to perform resource sharing between incrementation and decrementation. When the constant is equal to −1.3: The dynamic range of the result of a two-operand addition or substraction is increased by one bit compared to the dynamic range of the input operands in1. a logic 0 at bit position i in the constant will prevent carry propagation to the next stage in a ripple carry structure. wordlength+1). An important condition for addition or substraction in VHDL is that the output wordlength must be equal to the largest wordlength of the operands.4. the synthetic module DW01 incdec will be used. -. For example.Counter adding 5 input : std_logic_vector(wordlength-1 downto 0). sum <= in1 + in2 . 3. depending on whether it is unsigned or signed. a decrementer.

to make a counter with a constant that is a power of two.4. sum <= in1 + in2 + in3 + in4 . if the operation where expressed writing +8. in2. The following VHDL example illustrates this procedure with the constant equal to 8.Counter adding 8 input : std_logic_vector(wordlength-1downto 0). which contains optimized architectures for multioperand addition. adding four operands.Multioperand addition in1. in3. the wordlength must be manually controlled in1 in2 sum in3 in4 Figure 3. To perform this in a single cycle a binary tree of adders is used. Implementation as a binary adder tree dictates that the wordlength of the last addition will be w+ log2 M . The beneficial architecture of the incrementer/decrementer in the DesignWare library may also be used. The module is inferred into the design through function inference or component instantiation. 1 bit is added to the wordlength of the input operands.4: Multioperand addition.4. The code will instantiate 3 DW01 add synthetic modules during synthesis. counter <= (input(wordlength-1 downto 3) + 1) & input(2 downto 0). for each depth in the binary adder tree. Arithmetic sufficient at bit position i + 1 and secondly that the critical path through the carry chain will be reduced. see Figure 3. where w is the wordlength of the input operands and M is the number of input operands. Hence.4. ElectroScience — 56 — . -. Multioperand addition can also be performed using the specialized synthetic module DW01 sum. To avoid overflow. -. in4.4 Multioperand addition Multioperand addition means addition with more than two input operands. implemented as a binary adder tree and selected by the designer. The following example shows a single cycle 4-operand addition. sum :signed(wordlength-1 downto 0).Chapter 3. 3. counter : std_logic_vector(wordlength-1 downto 0). Single cycle multioperand addition are expressed in VHDL by using the + operator between all operands. with some efforts. There is no support for operator inference. This example gives the area 205 area units compared to 224 area units.

which spans through the different implementations. depending on the mapping of the structure to a cell library.4. we have used a 16bits adder (DW01 add) and a 16-bits subtractor (DW01 sub) and plotted delay versus area for the different implementations in a 0. The wordlength of the adder and the register should then be selected as the wordlength of the input operands plus log2 (M ) bits.8 1 1. Finally.5: Area-Delay plot for 16-bits DW01 add implementations — 57 — Lund Institute of Technology . see Figure 3. To make some benchmarks which can be illustrated. 3.8 2 Figure 3.) 1000 Infeasible region 900 800 cla 700 600 500 400 0.5 and Figure 3.6 0. Secondly.5 Implementations The design space for an operator is a multidimensional space.13 µm CMOS process.5.4 1.3. This optimal area-delay curve is also plotted in the figures and the optimal implementation for a given delay interval is printed. The plots are generated by running DC several times with different delay constraints. for each implementation there exists an area-delay-power space. Addition and Subtraction A recurrent problem closely connected to multioperand addition. 1500 1400 1300 csm 1200 1100 Area (a. With an area-delay plot for all different implementations we can find the optimal area-delay curve.u.4 0.6 1. The point most far to the right in each implementation is the smallest area possible to achieve with that specific implementation. First. there are several possible implementations for each operator.2 bk cla rpcs rpl clf rpcs bk Optimal rpl cla clf bk csm rpcs rpl 0.4.2 Delay (ns) 1. Such a multidimensional space is not easily illustrated. is that of calculating a sum of M operands using a single time-shared adder and a register to store the partial results. to avoid overflow.4.4. the wordlength of the operator is crucial for the designer to decide.

<=. Figure 3.5 Comparison operation Comparison operators are used frequently in controlling the data flow in data paths and selecting operands and outputs. >= and \ =.u. which often is used to control multiplexors in the data path. <.4 0. 100 < 011 if the operands are signed and 100 > 011 if the operands are unsigned.6: Area-Delay plot for 16-bits DW01 sub implementations 3.) 1400 Infeasible region 1200 clf csm 1000 rpl rpcs 800 cla 600 bk 400 0.4 1. When it comes to comparison operation it is important to distinguish between signed and unsigned operators. The only two operators where it does not matter if the operators are unsigned or signed is equal to (=) and not equal to (\ =).6 1.6 0.8 2 cla rpcs rpl Figure 3.2 Delay (ns) 1.2 0. In VHDL there are six comparison operators: =. >. The result of a comparison operation is a boolean type (true or false). If DW01_cmp6 in1 in2 Less than Greater than Equal to Less than or equal to Greater than or equal to Not equal to signed or unsigned Figure 3.5 shows the input and outputs to the synthetic module DW01 cmp6.7: Synthetic module for the six-function comparator DW01 cmp6 ElectroScience — 58 — .Chapter 3. A comparison operation in the VHDL code will instantiate the synthetic module DW01 cmp6 or DW01 cmp2 through operator inference during compilation. For example.8 1 1. Arithmetic 2200 bk 2000 Optimal rpl cla clf bk csm rpcs 1800 1600 Area (a.

5. Comparing two operands of different types will result in an implicit type conversion. — 59 — Lund Institute of Technology . adding no extra area to the design. Thus. It is also possible to make a comparison between operands with different wordlengths. else diff <= in1 . which will result in sign or zero extension of the operand with smallest wordlength.Comparison used to control a multiplexer in1 : signed(wordlength-1 downto 0). comparing one input operand with a constant makes it possible to reduce the area and delay of the operation. The counter as a special case when building an adder/substractor structure corresponds to doing a comparison with a constant. -. result <= in1 < in2.in1. -. result <= in1 < in2.5. result : boolean. diff : signed(wordlength-1 downto 0). if (in1 < in2) then diff <= in2 . end if.in2. A constant is best described in VHDL by using an integer. in2 : unsigned(wordlength-1 downto 0). 3. in2 : signed(wordlength-1 downto 0).5.Comparison with implicit sign extension in1 : signed(wordlength-1 downto 0). the unused logic will be removed. -.3. The code may be translated to the hardware blocks (or synthetic modules) shown in Figure 3. as shown in the following VHDL example. result : boolean. result : boolean. Then sign extend in1 and zero extend in2.1 Basic VHDL example The most basic example is to compare operands of equal size and data type. As the output of the comparison operation is very often used as an input to a multiplexor. Comparison operation not all the outputs of the synthetic module for comparison operations are used. Hence. result <= in1 < in2. in the context of a comparator structure. For example.Comparison in1 : std_logic_vector(wordlength-1 downto 0). we demonstrate such an example as well. in2 : std_logic_vector(wordlength-1 downto 0). -. the following VHDL code will increase the wordlength of the comparison by one bit. in2 : signed(wordlength-9 downto 0).Comparison between different types in1 : signed(wordlength-1 downto 0).1.

Dependent on the design constraints. wall 3.12. speed. 3. 4. The width of the adders can be reduced to Nx by carrying out the multiplication with right-shifts.g. to control the data flow Table 3. DC chooses the most suitable implementation to meet the design constraints.13. A multiplier is basically a construct the adds the partial products of the multiplicand and multiplier.9: Sequential multiplication of 2 two’s complement numbers. The main differences between the different implementations is the Adder implementation. However. nbw. 6 -stage Multiplier DW02 prod sum component instantiation Sum of products csa. 3. The width of the required Adders is Na + Nx − 1. ElectroScience — 60 — . area. if an operation such as a a product-sum multiplier is desired the instantiation has to be done in the VHDL code. 5.5: Various Multiplier Synthesis Implementation Synthetic Module Inferencing Description * Implementation DW02 mult * Multiplier DW02 mult stage * 2. nbw.8: The output from the comparator used as input to multiplexors. Arithmetic in1 mux in2 in1 Less than in2 DW01_cmp2 in2 DW01_sub diff mux in1 Figure 3. a4 x4 −a4 x0 a3 x1 a2 x2 a1 x3 −a0 x4 p4 a3 x3 a3 x0 a2 x1 a1 x2 a0 x3 p3 a2 x2 a2 x0 a1 x1 a0 x2 p2 a1 x1 a1 x0 a0 x1 a0 x0 a0 x0 x −a4 x2 a3 x3 −a2 x4 p6 −a4 x1 a3 x2 a2 x3 −a1 x4 p5 a4 x4 p8 −a4 x3 −a3 x4 p7 p1 p0 Figure 3.6 Multiplication The implementation of a multiplier in digital hardware can be done in various ways.Chapter 3. and power consumption.9. wall csa. see Table 3. see Figure 3.5. e. The multiplication of 2 two’s complement numbers is illustrated in Figure 3. wall csa. nbw.

9)*input(0).9)*input. 8.hundred28) begin hundred28 <= conv_std_logic_vector(128.3. signal hundred28 : std_logic_vector(N+9-1 downto 0).3)*input(0). signal hundred26 : std_logic_vector(N+9-1 downto 0). — 61 — Lund Institute of Technology . Therefore.6. full-sized multipliers will be implemented. end process. DC replaces the multiplier automatically by a bit-shift operation. hundred26 <=hundred28 . see Figure 3. c. 2. end process. 4. A VHDL example on how to implement a multiplication as a shift-and-add operation is given below: architecture STRUCTURAL of adder126 is signal input : std_logic_vector(N-1 downto 0). it is recommended to use as many power of 2 numbers as possible when designing hardware. Bit-shifts are hardwired and do not require any gates. . if c = 2k it might be beneficial to express the multipliers by bit shifts and additions/subtractions. However. . is a power of 2 number. Multiplication 3. Moreover. if the multiplier.1 Multiplication by constants For applications where multiplications with fixed coefficients are required. DC optimizes the multiplication with smaller multipliers. begin process(input. begin process(input) begin hundred26 <= conv_std_logic_vector(126. The more interested reader is referred to []. If fixed coefficients are read from a ROM. There are various articles on how the find the minimum number of adders for the implementation of a multiplier. signal hundred26 : std_logic_vector(N+9-1 downto 0). . Therefore.conv_std_logic_vector(2.6. the multipliers must be implemented as constants rather than having them stored in a ROM. A multiplication such as 126 · x can be written as 128 · x − 2 · x A VHDL example on how to write a multiplication with a constant is given below: architecture STRUCTURAL of mult126 is signal input : std_logic_vector(N-1 downto 0). Synthesis results show that such a multiplication results in least area and delay if implemented as carry-lookahead.10. The area and delay can be reduced significantly by expressing a multiplication by a bit-shift and add operations. .

u.5 1 1.10: Area/Delay comparison of a implemented multiplication. see Figure 3. Non-Boothrecoded Wallace-tree (nbw) and Booth recoded Wallace-tree (wall).2 Multiplication of signed numbers A multiplication can be implemented as carry-save array (csa).5 1 1. ElectroScience — 62 — .5 Delay (ns) 2 1200 1000 1000 800 800 600 600 400 400 0.) 0.6. Arithmetic 1600 nbw wall 1400 1600 rpl cla clf bk csm rpcs 1400 1200 Area (a.5 Delay (ns) 2 Figure 3.11. The graphs on the left panel refer to an implementation as 126 · x and on the right panel as 128 · x − 2 · x 3. Synthesis results show that the nbw implementation is superior in terms of area and delay compared csa and wall.) Area (a.Chapter 3. The synthesis results of the csa implementation is not shown as it can not compete with the csa and wall implementation.u.

Datapath Manipulation 3.5 2 2.) 1 1. 4000 nbw wall 3500 12000 11500 11000 10500 nbw wall Area (a.) Area (a. The same hardware construct can be obtained as follows: z <= (A + B) + (C + D).5 3000 10000 9500 9000 8500 8000 2500 2000 1500 7500 1. If information on the arrival time of the signal is available the delay can be shortened by rearranging the order of the adders. The signals within the parentheses will be arranged as a balanced adder tree. Assuming that signal E arrives late. see Figure 3. The instruction is mapped to hardware as a balanced tree if possible in order to minimize the delay.7. the delay time can be shortened by using parentheses as z <= (A + B + C + D) + E. A simple arithmetic instruction as an addition can be written as: z <= A + B + C + D.u. — 63 — Lund Institute of Technology . On the left-hand side a 8-bit multiplier and on the right-hand side a 16-bit multiplier.15.3.u.5 2 Delay (ns) 2.11: Area/Delay comparison of multipliers.7 Datapath Manipulation This section shows how the construction of the datapath can be determined by writing appropriate VHDL code. The parser is forced to place E latest in the adder chain.5 Figure 3.5 Delay (ns) 3 3.

Chapter 3.12: Negative Right-Shift Multiplier. Arithmetic x 10101 p(0) +x0 a 2p(1) +x1 a 2p(2) +x2 a 2p(3) +x3 a 2p(4) 00000 10101 10101 1 10101 11010 1 00000 1 11010 1 11101 01 10101 1 10010 01 11001 001 00000 1 11001 001 11100 1001 +(−x4 a) 0 1 0 1 1 00111 1001 p(5) Figure 3.13: Positive Right-Shift Multiplier. A B C D Z Figure 3.14: Balanced Adder Tree ElectroScience — 64 — . x 01011 00000 p(0) 10101 +x0 a 2p(1) 1 1 0 1 0 1 11010 1 10101 +x1 a 2p(2) 1 0 1 1 1 1 1 10111 11 00000 +x2 a 2p(3) 1 1 0 1 1 1 1 1 11011 111 10101 +x3 a 2p(4) 1 1 0 0 0 0 1 1 1 11000 0111 00000 +x4 a 2p(5) 1 1 1 0 0 0 0 1 1 1 11100 00111 p(5) 10101 Figure 3.

3.7. Datapath Manipulation

A

B

C

D

E

Z

Figure 3.15: Unbalanced Adder Tree

— 65 —

Lund Institute of Technology

ElectroScience

— 66 —

Chapter 4

Memories
One of the most important topics in digital ASIC design today is memories. Memories occupy a lot of space and consume a lot of power and the situation becomes even worse if off-chip memories are considered. Even though each new generation of silicon processes can include more transistors on a single chip, it is predicted that the percentage of memory on a chip will increase. In the road map from Japanese system-LSI industry it is predicted that in 2014 more than 90% of a chip is covered with memory, compared to todays 50%. With these numbers in mind, it is reasonable to spend some time to optimize the memory system when designing a new ASIC. Each saved square millimeter of silicon corresponds to a neat pile of money and to be able to use ASICs in mobile or extremely fast applications, low power consumption is required. In mobile applications to maximize battery life time and in fast applications to avoid expensive cooling equipment. Another memory issue is that the larger a memory is the slower and more power hungry it becomes. The consequence is, that the designer have to split large memories into smaller once in order to be more power efficient or to meet timing requirements. This will create a more complex memory hierarchy and introduce more design criterias. This section does not try to cover every aspect of memory design; instead some examples on what can be done to optimize memory systems. These examples do not serve as the final truth but rather show the reader that memory systems have to be tailor made for each specific application. The first example is about minimizing the area of a Shift Register (SR) [4]. The second example deals with a memory that is not always utilized 100% [5]. A cache system to read from a off-chip memory in a image convolution application is presented in the last example [6]. For more information on memory designs consult [7].

4.1

SR example

In this example shift register (SRs) are considered. SRs are commonly used to delay or align data in data paths, for example, in a pipelined Fast Fourier Transform (FFT) processor. Here, four different ways to implement a SR are presented together with an estimate of the required silicon area. Since one input is read and one output is written each clock cycle, the SR could be implemented in at least four different ways, as shown in Figure 4.1: — 67 — Lund Institute of Technology

FF MUX FF Single-port Memory Dual-port Memory FF MUX (a) (b) (c) (d) Figure 4. Single-port memories have smaller memory cells than dual-port memories.1: Four ways to realize a SR. Memories FF FF FF RD FF FF WR FF Single Port Mem. flip-flops have a fast access time and thus allow high operation speed. but additional logic is required outside the memories in order to gather and separate input and output. One dual-port memory: A dual-port memory can perform both a read and write operation each clock cycle and thus suits the requirements perfectly. the memory has to be twice as wide. However. two words are read from the memories every other clock cycle. Two single-port memories of half length and alternating read/write: The input is gathered in blocks of two and written to the memories every other clock cycle. In a similar manner. There are no interconnections included in the flip-flop area. c. One single-port memory of half length and double width. that flip-flops is only the best solution. However. All memory SRs include control logic and a 50µm power ring on three sides of the block. but instead of storing the two inputs in separate memories they are stored as one word in the same memory location. Single Port Mem. compared to a single port memory. Figure 4. Thus. Additionally. in order to perform simultaneous read/write additional logic is required in each memory cell. b. In the figure it can be seen. flip-flops are not optimized for area. Flip-flop based (a).Chapter 4. and one single-port memory of twice width (d). two single-port memories (c). Flip-flops connected in series: This is an easy solution since there is no control logic needed. d. which reads and writes every other clock cycle. from an area perspective. dual-port memory (b). a. if the SR is less than approximately 400 ElectroScience — 68 — .35µm CMOS technology. This solution works in the same way as the previous solution.2 shows the estimated area for the different implementations of SRs in a 0.

to reduce power consumption for cases when N is less than 1024. This is due to extra transistors in each memory cell and double address logic.1: Average current drawn for 128-1024 words memories. With Vdd =3.4. As a trade-off between power reduction and the difficulty to place and — 69 — Lund Institute of Technology .2 are extrapolated down to a zero bit SR. Memory split example 2. can be any power of two between 32 and 1024.3 V and a wordlength of 24 bits.5 Flip-flops Dual-port memory Single-port memories Double width memory 2 1. this operation will be performed continuously. The length of the sequence. bits and that one single port memory of double width is the most area efficient solution for SRs larger than 400 bits. Particularly. This is due to the decreased relative overhead from address logic and buffers. the memory is split into smaller blocks. the memory has to be 1024 words long. If the memory is implemented as one block it will consume much power even if only a subset of the addresses is used. Table 4. N . It can also be seen that memory size does not increase linearly with the number of bits such as flip-flops do. 4. Hence. it can be seen that the overhead for a dual-port memory is extremely large compared to single-port memories in this process. Size [words] 1024 512 256 128 IDD [µA/MHz] 516 463 436 74 Since the largest N in this design is 1024. The application will write N words of data into the memory and then read the same data in a different order from the memory.2 Memory split example The target application in this example is a memory that continuously stores data sequences of different lengths. these parts do not increase in size as fast as the cell matrix does. A rough number on the overhead can be estimated if the memory areas in Figure 4.5 0 0 1 2 3 4 5 FIFO size [Kbits] 6 7 8 9 Figure 4.2.2: Area versus SR size for different implementations.5 Area mm2 1 0.

respectively. the complete image has to be stored on off-chip memory. With N smaller than 512-points. N 1024 512 256 128 One memory [mW] 85 85 85 85 Memory bank [mW] 59 42 12 12 Power savings [%] 30 50 86 86 N . provided as macrocells.2. the memory is split into four blocks of size 128. Table 4.3 Cache example The target application is a custom image convolution processor.. During the convolving operations. In this example x is 256×256 pixels and h is 15×15 pixels large. only the low power memories are used.3 shows a comparison of the power consumption between the one memory and the memory bank solution. The values are given for memories in a standard 0. a very high data throughput is needed.35 µm CMOS process with five metal layers. However.3: Average power consumption for different sizes of given for Vdd =3. Table 4.1) where x and h denote the image to be filtered and kernel functions.3 V and a clock frequency of 50 MHz. The values are given for Vdd =3. Each 128-words memory is used a quarter of the time and the 256-words memory is used half the time. m2 ). The power savings are between 30 and 86 percent. The values are given for a clock frequency of 50 MHz and Vdd =3. In this case the average current is calculated as 436/2 + 74/4 + 74/4 = 255 µA/MHz. Memories Table 4.e. The average current drawn for the memory bank for different sizes of N is shown in Table 4. the drawback is that a larger chip is required.3 V. The last row shows the power savings in percent. For example. The values are 64 85 12 86 32 85 12 86 route many memories. accessing large off-chip memory is expensive in terms of power consumption and so is the signaling due to the large capacitance of package pins and PCB wires.1 shows the average current that the memory draws. both 128-words memories and the 256-words memory is used. since the silicon area increases with the number of memories due to the overhead in address logic and power supply. 128. K2 − m2 )h(m1 . (4. if N = 512-points. Two dimensional image convolution is one of the fundamental processing elements among many image applications and has the following mathematical form: y(K1 . this operation is performed once for each output pixel. When this is performed in real-time. N 1024 512 256 128 64 32 IDD [µA/MHz] 359 255 74 74 74 74 Table 4. Note that the 128-words memory is a low power memory and this was the largest low power memory available for this process.Chapter 4. Furthermore.3 V and a wordlength of 24 bits. ElectroScience — 70 — . 4. 256. each kernel position requires 15×15 pixel values from off-chip memory directly. As a consequence of the image size. K2 ) = x(K1 − m1 . i. and 512 words. K1 and K2 is the pixel index.2: Average current drawn for the memory bank for different sizes of N .

except the one in the extreme corners.9 mJ 20.16.1 when using one.6 mJ 68. a two level memory hierarchy was introduced. for each successive pixel position. or three levels of the hierarchy. Instead of accessing off-chip memories 225 times for each pixel operation. The small cache memories are composed of 15 separate memories to provide one column pixel values — 71 — Lund Institute of Technology . the value in the upper left is discarded and a new value is read from the off-chip memory.4: memory hierarchy schemes and access counts for an image of 256×256 (M0=off-chip 0. 14 full rows of pixel values and 15 values on the 15th row are filled into 15 onchip line memories before any convolution starts. pixel values from the off-chip memories are read only once for the whole convolving process from upper left to lower right. Cache example Table 4. as shown in Figure 7.16 a three level memory hierarchy is shown and the number of memory access required for Equation 4. Image memory off-chip Line memories Kernel memories M2 N Scheme Image Image Image Image line 2 N (M 1) M Image M2(N-M+1)2 N2 MN(N-M+1) Line Kernel M2(N-M+1)2 M2(N-M+1)2 MN(N-M+1) M2(N-M+1)2 kernel line kernel N2 Figure 4.8 mJ By the observation that each pixel value.4. Since accessing large memory consumes more power. After the first pixel operation.3: Three levels of memory hierarchy and the number of reads from each level. Thus the input data rates are reduced from 15×15 accesses/pixel to only 1 read. As a result. is used in several calculations. Using a two level memory structure can reduce both power consumptions and I/O bandwidth requirements.3. two.35µm) Scheme A: M0 B: M0→ C0 C: M0→ C1 D: M0→C0→C1 M0 13176900 65536 929280 65536 C0 13176900 929280 13176900 13176900 C1 energy cost 790 mJ 56. In Figure 7. 15 small cache memories are added to the two level memory hierarchy.18µm. C0 and C1=0. one extra cache level is introduced to further optimize for power consumptions during memory operations.

In order to simplify address calculation. Instead of reading pixel values directly from line memories 15 times for each pixel operation.4. the least access counts to both external memories and line memories are identified in three level memory structure. Compared with the total clock cycles in the magnitude of 107 . Image processor without controller 4 bits address line Cyclic column storage Data out . that the power consumption for the total memory operation could be reduced over 2. this will contribute 61696 clock cycles in total. While for ASIC ElectroScience — 72 — . but this is achieved at the cost of extra clock cycles introduced during each pixel operation. All these possible solutions differ substantially in memory accesses. one column of pixel values are read to cache memories first from the line memories for each new pixel operation except for the first one.4: Controller Architecture with incremental circuitry In fact. In Table 4. Between scheme B and D. such a side effect is negligible. two level hierarchies prevails due to its lower clock cycle counts. respectively.Chapter 4. During each pixel operation. Thus. the counts to the large memory M0 are reduced by nearly 225 times compared to that of the one level solution. reading pixel values from line memories 15 times could be replaced by only once plus 15 times small cache accesses. In the case of an image size of 256×256. direct line memory access is replaced by reading from the cache.4. Memories during each clock cycle. the achieved low power solution has to be traded for extra clock cycles introduced during the initial filling of the third level cache memories. Assuming that the embedded memories are implemented in a 0. alternative schemes exists by different combinations of the three memories in the hierarchy. C1 denote off-chip memory. In addition to the new cache memory. This allows circular operations on the cache memories.4x24 bits APU 3 Cache level 3 (15x16) Processor core 1 Processor core 2 Processor core 3 Processor core 4 Control signals from controller Kernel moving one pixel to the right New column written to cache for each new pixel operation System bus 15x8 bits 8 bits APU 1 2 Line memories with pipelined registers level 2 (15x256) Input buffer Large off-chip memories level 1 (256x256) Unfilled memory elements New value feeded during each new pixel operation Figure 4. the depth for each cache is set to 16. For the two and three level hierarchy structures. by the calculation method in [7]. one extra address calculator is synthesized to handle the cache address calculation. During new pixel operations each new column of data is filled into the cache in a cyclic manner. different memory schemes are shown where M0. line memories and smaller caches. C0.35µm process CMOS. the access counts to the large off-chip memories varies. trade off should be made when different implementation techniques are used. However.5 times compared to that of a two level hierarchy. As a result. For the FPGAs where power consumption is considered less significant. Although the amount of data provided to the datapath remains the same for all four schemes. it is shown in Table 4.

three level hierarchy is more preferable as it results in a reasonable power reduction.3.4. Cache example solutions. — 73 — Lund Institute of Technology .

ElectroScience — 74 — .

The Design Compiler product is the core of the Synopsys synthesis software products. This is illustrated in Figure 5. this is done by either the command read vhdl or analyze/elaborate.1 Basic concepts Synthesis is the process of taking a design written in a hardware description language. design re-timing. the process makes trade-offs between — 75 — Lund Institute of Technology . It comprises tools that synthesize your HDL designs into optimized technology-dependent. However. creating a gate level netlist from RTL behavior. and compiling it into a netlist of interconnected gates which are selected from a user-provided library of various gates. and power. behavioral(high-level)synthesis. gate-level designs. the design in the form of netlists of generic components needs to be mapped into the real logic gates (contained in cell libraries provided by IC vendors). and selecting alternative cells. In design compiler. such as VHDL. logic flattening. According to various user specified constraints and optimization techniques. It supports a wide range of flat and hierarchical design styles and can optimize both combinational and sequential designs for speed. This process is accompanied by design optimization.1. then builds the design using generic (GTECH) components. design compiler maps and optimizes the design with a range of algorithms.. The design after synthesis is a gate-level design that can then be placed and routed onto various IC technologies. which is widely used in industry and universities. namely. Aiming to meet the design constraints.2. etc. RTL synthesis. transformation. Its process of synthesis can be described as translation plus logic optimization plus mapping. logic synthesis. The much simplified version of the design flow is shown in Figure 5. this tutorial will only cover the synthesis tool named Design Compiler by Synopsys. eg. which maps gate level designs into a logic library. Such a process is generally automated by many commercial CAD tools. There are three types of hardware synthesis.Chapter 5 Synthesis 5. After translation. Translation is the process of translating a design written in behavioral hdl into a technology independent netlists of logic gates. that creats a RTL description from an algorithmic representation. area. Design Compiler deals with both RTL and logic synthesis work. It performs syntax checks.

For example. As with technology libraries.1: Simplified ASIC Design flow speed. The libraries come with two types: the standard DesignWare library and the Foundation DesignWare library. The DesignWare libraries are composed of pre-implemented circuits for arithmetic operations in your design.2 Setting up libraries for synthesis Before Design Compiler could synthesize a design. symbol libraries. there are three types of libraries that should be specified in Synopsys setup file ”. The Foundation provided some advanced implementation to provide significant performance boost at the cost of a higher gate count. Wireload models will be discussed in detail in the following sections. If only library source code is provided. the unmapped design in Design Compiler memory is overwritten by the new. such as ”+−∗ <><=>=”. it is necessary to tell the tool which libraries it should use to pick up the cells. In addition to cell information. the classic implementation of multiplication by a 2-D array of full adders. Wireload models are also provided to calculate wire delays during optimization process. Synthesis VHDL Back Annotate Simulation Correct Function? Y N Modelsim Synthesis Design Compiler Place & Route Silicon Ensemble Layout Figure 5. area and power consumption. technology libraries. The technology libraries. you have to use Synopsys Library Compiler to generate .db format. The standard version provides some basic implementation of arithmetic operations. for example.setup”. see Library Compiler manual for further information. a Booth-coded Wallace tree multiplier is provided ElectroScience — 76 — . mapped design.Chapter 5. namely. ready for use in later physical design of place and route. The symbol libraries contain graphical symbols for each library cells. Design compiler uses it to display symbols for each cell in the design schematic.db format. Upon completion. are provided and maintained by the semiconductor vendors and must be in . which contain information about the characteristics and functions of each cell. and such designware parts are inferred through arithmetic operators in the HDL code. In total. 5.synopsys dc. and DesignWare libraries. it is also provided and maintained by semiconductor vendors.

db — 77 — Lund Institute of Technology ..sldb dw01... DW02 prod sum.5. SOP (Sum-of-Products). target libraries set the target technology libraries where all the cells are mapped to.sldb dw02.db c35_IOLIB_4M../. constraints Technology netlist Figure 5. architecture . . ”c35 CORELIB. Here is one example segment of the setup files for AMS035 technology at the department.sldb dw01. while the foundation library need a extra license..2. To setup these libraries in Synopsys setup file.. . vector adder(A+B+C+D).51/synopsys/c35_3.db } symbol_library = { c35_CORELIB.sdb c35_IOLIB_4M. These are the MAC (Multiply-Accumulate)..sdb } synthetic_library = { standard. One thing to note is these parts can not be inferred through HDL operators. entity test is port ( a : in . in the example above.05 -AMS 0.2: Synthesis Process for multiplications. Setting up libraries for synthesis library ieee.sldb } From Synopsys manual. uniquify .35u CMOS v3. DW02 sum respectively. /* -.synopsys_dc.. several complex arithmetic operations are implemented to obtain even higher performance. ”target library...db c35_IOLIB_4M.setup file for Synopsys synthesis Synopsys v2002. In addition to basic arithmetic operations. search_path = { /usr/local-tde/cad1/amslibs/v3.sldb} target_library = { c35_CORELIB. link library.setup -..sldb dw02..*/ company = "TdE LUND" cache_read = ./. symbol library. cache_write = ..50 -.db standard. their counterpart DesignWare components are DW02 mac. modifications to the HDL code has to be done by instantiating or using function calls to utilize those parts.3V } + search_path link_library = { c35_CORELIB. b: in . The standard library comes for free. synthetic library”.Feb 2003 / S Molund -. VHDL Translate GTECH netlist Mapping+optimization create_clock .-. end test. use four library variables.

sldb dw01. To check if the synthesized design meets the goals. Whenever the design is optimized and mapped to the technology gates. cd. One is the command line user interface. Beginners who uses graphical tools can easily find these commands/constraints in the menus. Synthesis c35 IOLIB 4M. design analyzer and design vision. Most common operations in synthesis can be done by ”pushing-button” procedure in the graphical tools. 5.sldb dw02. 5. eg. only synthesis commands/constraints are given for each synthesis steps in the following sections.sldb dw02. trying to pick up cells from the technology libraries to meet the design goals. ElectroScience — 78 — .sdf and .3. however.db c35 IOLIB 4M.v format(verilog) for place&route. To use any of the two graphical tools. speed or area of the circuit) by constraint specification.sldb) should be specified here.1 Sample synthesis script for a simple design The following is a sample synthesis script for a simple design.. Use the command report timing or report area. During this process. These two programs are nothing but simple graphical interfaces for dc shell. eg. which is shown in Figure 5.3 Baseline synthesis flow After correctly setting up the environments. it is a more efficient way to do synthesis for the experienced designer. The shell can be invoked by typing the command: dc shell. This is called optimization.sldb” represents standard DesignWare library while ”dw01. namely.. For this reason. If the designer is more into the graphical interface. The link libraries resolve cell references in your design. To synthesize a design using command line based dc shell is usually confusing for beginners. called dc shell. the design files are first analyzed one by one from bottom up in the hierarchy. which means the whole design is translated into a circuit interconnected by technology independent logic gates (GTECH netlist). which works much like the way of a unix shell.3. The variable ”symbol library”. By running synthesis script (a collection of synthesis commands and constraints) on a dc shell. . where ”stdandard. Given some goals (eg. Assume there is one project containing a hierarchy of design files. To synthesize the design. . ”synthetic library” set symbol libraries and DesignWare libraries respectively. and even some common unix command is also supported in such a shell.vhd format for back-annotate simulation. each design is checked for syntax error and loaded into Design Compiler memory if no errors are found. type in design analyzer & or desgin vision -dcsh mode &. which can be done by the command compile. The top design is then elaborated. ls. and DesignWare libraries(standard. That is. Design Compiler starts mapping and optimization processes at the same time.Chapter 5. there exists two programs provided by Synopsys. Synopsys Design Compiler provides two ways to synthesize the design. it can be saved as different format for future use. your own design(* by default).db). report has to be generated. a synthesis flow can be fully automated without any intervention.sldb” is a foundation DesignWare library. so both target library(c35 CORELIB. refer to Design Vision User Guide or Design Analyzer Reference manual for more details about the graphical tools.db” in the directory indicated in ”search path”. the synthesis tool is ready to start.

5.4. Advanced topics

TOP

Module_1

Module_2

Cell_1

Cell_2

Cell_3

Figure 5.3: Sample hierarchical design

/* Analyze analyze -f analyze -f analyze -f analyze -f analyze -f analyze -f

and ELaborate design */ vhdl -lib WORK cell_1.vhd vhdl -lib WORK cell_2.vhd vhdl -lib WORK cell_3.vhd vhdl -lib WORK module_1.vhd vhdl -lib WORK modele_2.vhd vhdl -lib WORK top.vhd

elaborate top -lib WORK /* constraints (design goals) */ create_clock -period CLK_PERIOD -name CLK find(port CLK) uniquify

/* compile (mapping & optimization) */ current_design top compile /* outputs */ write -format db -hier -output TOP.db write -f vhdl -hier -output TOP.vhd write -f verilog -hier -output TOP.v write_sdf -version 1.0 top.sdf /* reports */ current_design top report_timing >"timing.rpt" report_area > "area.rpt"

5.4

Advanced topics

So far, simple synthesis steps have been given to show how to derive a hardware (netlist) from a behavioral HDL descriptions using Design Compiler. However, due to its inherent complexity in synthesis process, further knowledge on inner working mechanism of synthesis process would be beneficial for a designer to control the synthesis process in order for better synthesis results. The following sections will try to address some of important issues that will lead to a better understanding of the synthesis process, hopefully resulting in better script writing to control the synthesis process. — 79 — Lund Institute of Technology

Chapter 5. Synthesis

Figure 5.4: Statistic distributions of wire capacitance value

5.4.1

Wireload model

During optimization phase, design compiler needs to know both cell delays and wire delays. The cell delays are relatively fixed and dependent on the fanins and fan-outs of those cells. Wire delays, however, can not be determined before post-synthesis place and route. One has to estimate such delays in certain ”pessimistic” way to guarantee the estimated delay is equal to or larger than the real wire delays after place and route, otherwise the circuit might malfunction. Usually, Wireload models are developed by semiconductor vendors, and this is done based on statistical information taken from a variety of example designs. For all nets with a particular fanout, the number of nets with a given capacitance is plotted as a histogram, one capacitance is picked to represent the real values for that fanout depending on how conservative criteria you want the model to be. Figure 5.4 is one example of such histogram. If a criteria of 90 percent is chosen, the value is 0.198 pF in this case. Since the larger a design gets, the larger capacitance the wires might have, Wireload models usually come with a range of area specifications. Use the command report lib, you will see the segment describing the wireload models as the example below.
**************************************** report : library Library: c35_CORELIB Version: 2003.06-SP1-3 Date : Wed Aug 18 13:06:21 2004 ****************************************

Library Type : Technology Tool Created :2000.11 Date Created : Date:2003/03/14 Library Version : c35_CORELIB 1.8 - bka Comments : Owner: austriamicrosystems AG HIT-Kit: Digital Time Unit : 1ns Capacitive Load Unit 1kilo-ohm Voltage Unit 1uA Leakage Power Unit : 1.000000pf Pulling Resistance Unit : : 1V Current Unit : : Not specified.

ElectroScience

— 80 —

5.4. Advanced topics

Bus Naming Style

: %s[%d] (default)

Wire Loading Model: Name : 10k Location : c35_CORELIB Resistance : 0.0014 Capacitance : 0.001633 Area : 1.8 Slope : 5 Fanout Length Points Average Cap Std Deviation -------------------------------------------------------------1 5.00 Name : 30k Location : c35_CORELIB Resistance : 0.0014 Capacitance : 0.001633 Area : 1.8 Slope : 6 Fanout Length Points Average Cap Std Deviation -------------------------------------------------------------1 6.00 Name : 100k Location : c35_CORELIB Resistance : 0.0014 Capacitance : 0.001633 Area : 1.8 Slope : 8 Fanout Length Points Average Cap Std Deviation -------------------------------------------------------------1 8.00 Name : pad_wire_load Location : c35_CORELIB Resistance : 0.0014 Capacitance : 0.001633 Area : 1.8 Slope : 15 Fanout Length Points Average Cap Std Deviation -------------------------------------------------------------1 15.00

Wire Loading Model Selection Group: Name : sub_micron

Selection Wire load name min area max area ------------------------------------------0.00 1427094.00 10k 1427094.00 4280637.00 30k 4280637.00 219533760.00 100k

Wire Loading Model Mode: enclosed.

As to how to select Wireload models for wires crossing design hierarchy, synopsys provides three modes, namely, top, enclosed and segmented. This is shown in Figure 5.5. For the top mode, Design Compiler sees the whole design as if there is no hierarchy, and use the Wireload model used for the top level for all nets. In the enclosed mode, Design Compiler uses the Wireload model specified for the smallest design that encloses the wires across the hierarchy. Segmented mode is the most optimistic mode among all these three, it uses several Wireload models for segments of the wire for each hierarchy. You can choose which mode to use by the command set wire load model in your constraint file. Otherwise, design compiler uses the mode specified in the technology libraries or it’s own default mode.

5.4.2

Constraining the design for better performance

After translating process, the design in behavioral HDL is translated into technology independent gates, after which optimization phase starts. Two major — 81 — Lund Institute of Technology

3 Compile mode concepts Before going into details about optimization techniques. combinational logic optimization plus mapping and sequential logic optimization.Chapter 5. flipflops. the basic concepts regarding different compile modes should be explained. Sequential optimization maps sequential elements (registers.5: Wireload mode tasks take place during this process. For combinational logic optimization. If properly used. Two most commonly used methods to do this are logic flattening and structuring which are described in the section about logic optimization. etc. 5. It tries different cell combinations.4. Combinational logic mapping takes in the technology-independent gates optimized from an early phase and maps those into cells selected from vendor technology libraries.) in the netlists into real sequential cells (standard or scan type) in the technology libraries. using only those that meet or approach the required speed and area goals. Synthesis Mode=top 100K D Q 100K D Q 100K D Q 100K D Q 100K 30K 100K 10K D Q 100K 100K D Q 10K 100K 100K D Q D Q D Q Mode=segmented 100K D Q Mode=enclosed 100K 100K D Q D Q 100K D Q 100K D Q D Q 100K 100K D Q D Q 100K 100K 30K 100K 30K 30K 10K D Q 100K 10K 10K D Q 30K 10K 10K D Q 10K D Q 100K 30K D Q 10K 30K 30K D Q D Q D Q D Q D Q Figure 5. and also tries to further optimize sequential elements on the critical path by attempting to combine sequential elements with it’s surrounding combinational logic and pack those into one complex sequential cells. It is crucial for understanding different compile strategies and scripts writing. a phase called technology-independent optimization is performed to meet your timing and area goals. these compile modes will significantly reduce the compile time and improve the final ElectroScience — 82 — . refer to Design Compiler Reference Manual: Optimization and Timing analysis. latches.

the compile time is reduced substantially. 5. Some blocks might be over-constrained while others under-constrained. It starts the optimization and mapping process all over again no matter whether the design was compiled or not before. and integrated together on the top level. and compile top-down for the sub-modules separately. which is the case for the designs at the department. one capability that comes with Design Compiler. Advanced topics synthesis results. etc. it removes it in the design and rebuild a brand new gate structure. the run time of compiling the whole design could be prohibitive. but such improvements come at the cost of increased compilation time. This in turn will direct Design Compiler to work on the wrong part of a timing path. synthesized separately. Use the command compile -inc for incremental compilation. the script writing is much simpler and easy to understand. Full compilation is simply invoked by the command compile in the scripts. a better results among all alternative strategies can be achieved. It tries to re-map these parts and change the structure only if they are improved. within 150k equivalent gates). It starts with an optimized gate structure and works on the parts with violations. physical data from P&R. the synthesis process has to take place in parallel. like the ones a group of people work on. Just take the top level design in the hierarchy and compile it. This is so called top-down compile strategy. manually setting constraints on the sub-block borders could be laborious and apt to mistakes. and do top level compile to fix interconnect violations — 83 — Lund Institute of Technology . a typical divide-and-conquer procedure. This demands the whole design be separated into several sub-blocks. • Full compilation The most commonly used compile mode is the full compilation. Usually this results in better circuit implementation after the design has been changed. however. In such cases. Solution to the above problems is automatic design budgeting. If prior optimized gate structure already exists. If snake path is present due to not fully registered sub-blocks.) to create a set of constraints among all hierarchies. If the design. It utilizes top level constraints and delay information (cell delays. In practice. Problems always emerge in the sub-block interfacing.4 Compile strategy Most designers would not bother with design strategy if their design is within reasonable size (for example.4. such bottom-up and then top-down strategy would not be manipulated as easily as it sounds. turns into a huge project. Wireload models. Design budgeting can be invoked on dc shell with the command dc allocate budgets at three different levels. BY working on only problematic parts without first phase of logic optimization. The budgeting starts either in unmapped RTL level or mapped gate level or the mix of these two. In addition to the QOR(Quality of Results).5. • Incremental compilation Incremental compilation is useful if only part of the design fails to meet the constraints.4. a designer could then apply these constraints to the sub-blocks interested.

which are implicit constraints defined by the technology libraries. Apart from clock period setting. 5. Comprised of four types of constraints (namely speed. Some examples are: • The maximum transition time for a net. To use optimization constraints to obtain better circuit performance. registers). input and output delays have to be specified to complete timing paths constraints. • The maximum fanout of a cell that drives many load cells. Budgeting for Synthesis for more information. you only have to set clock period constraints on the clock pins of the sequential elements. • The maximum load capacitance a net could have.Chapter 5. specifying a cell with input pin connected to the net to function correctly. optimization constraints are commonly used to optimize for the speed of your design. These constraints are required to guarantee the circuit to function correctly. Synchronous paths are usually the timing paths of combinational logics between any two sequential elements (eg. design compiler can not violate design rule constraints. buffering the outputs of a driving cell. In the example. Use the set input delay and set output delay for this purpose. In other word. synchronous and asynchronous paths. measures are taken to meet the requirements.5 Design Optimization and constraints Design optimization is performed by Design Compiler according to user specified constraints. any paths starting from an input port and end at the sequential elements or paths from sequential elements to output ports are regarded as synchronous paths as well.4. Asynchronous delay is defined as point to point paths with partial or no association with sequential elements. power. When violations are found during the optimization process. To specify the maximum delay of these timing paths. Although the design will meet design rule constraints with default values by technology libraries. the designer could still specify more restrict design rule constraints. ElectroScience — 84 — . you should have in mind that your design is composed of two parts. area. Synthesis for top level modules. The other type is optimization constraints. It has a lower priority over design rule constraints. • A table listing maximum capacitance of a net as a function of the transition times at the inputs of the cell. In addition to timing paths between sequential elements. The following sections will discuss these in detail. eg. To set the constraints on these paths like maximum and minimum delays. These constraints are defined by the designer with the goal to get better synthesis results such as higher speed or less chip area. choosing cells with different size. and porosity). Two types of constraints exist for constraining your design. use the command set max delay and set min delay. One of the two types is the design rule constraints. Refer to Synopsys User Guide. Use the command create clock to specify the clocks. even if it means violating optimization constraints. only clock period is set to constrain any synchronous paths between synchronous elements under the assumptions of 0 input and output delay.

5.4. Advanced topics
TOP A_0 Subdesign A Copy Copy
D Q D Q

TOP A_0

Optimize accrodingly

D

Q

Compile A_1
D Q D Q

A_1
D Q

D

Q

D

Q

Figure 5.6: Uniquify method

5.4.6

Resolving multiple instances

Before compilation, the designer need to tell Design Compiler how multiple instances should be optimized according to the environment around. In a hierarchical design, a sub-design is usually referenced more than once in the upper level, thus reducing the effort of recoding the same functionality by design reuse. In order to obtain better synthesis results, each copy of these sub-designs has to be optimized specifically to be tailored to the surrounding logic environment. Design Compiler provides three different ways of resolving multiple instances. • Uniquify When the environment around each instance of a sub-design differs significantly, it would be better to optimize each copy independently to make it fit to the specific environment around them. Design Compiler provides the command Uniquify to do this. The command makes copies of the sub-design each time it is referenced in the upper level of the hierarchy, and optimize each copy in a unique way according to the conditions and constraints of its surrounding logic. Each copy of the sub-design will be renamed in certain conventions. The default is %s %d, where %s is the sub-design name, and %d is the smallest integer count number that distinguish each instance. This is shown in Figure 5.6. The corresponding sample script is as follows:
dc_shell> current_design top dc_shell> uniquify dc_shell> compile

• Don’t touch There is one another way called Don’t touch that compile the sub-design once and use the same compiled version across all the instances that reference the sub-design. This is shown in Figure 5.7. This is used when the environment around all the instances of the same sub-design is quite similar. Use the command characterize to get the worst environment among all the instances, and the derived constraints are applied when compiling the sub-design. A dont touch attribute is then set on the sub-design so that the following compilation in the upper level will leave the all the instances intact. The major motivation of doing optimization in this way is to save compilation time of the whole design. The script is as follows: — 85 — Lund Institute of Technology

Chapter 5. Synthesis
Subdesign A TOP A
D D Q Q

Compile
D Q

A A Copy
D Q D Q

Figure 5.7: Dont touch method

dc_shell> current_design top dc_shell> characterize U2/U3 dc_shell> current_design C dc_shell> compile dc_shell>current_design top dc_shell> set_dont_touch {U2/U3 U2/U4} dc_shell> compile

• Ungroup If the designer wants the best synthesis results in optimizing multiple instances, use the command Ungroup. Similar to Uniquify, Ungroup makes copies of a sub-design each time it is referenced. Instead of optimizing each instance directly, design compiler removes the hierarchy of the instance to “melt” it into its upper level design in the hierarchy. This is shown in Figure 5.8. In this way, further design optimization could be done at the logics that were originally at the instance border which are impossible to optimize with design hierarchy. The better synthesis result comes at the cost of even longer compilation time. In addition, design hierarchy is destroyed, which makes post-synthesis simulation harder to perform. The sample script is as follows:
dc_shell> current_design B dc_shell> ungroup {U3 U4} dc_shell>current_design top dc_shell> compile

5.4.7

Optimization flow

Until now, enough “peripheral” knowledge has been introduced regarding design optimization. This section will move on to give an overview of the whole optimization process, intended for a better understanding of what is actually happening in synthesis process. This, to the author’s view, is beneficial for constraints settings in optimization. ElectroScience — 86 —

5.4. Advanced topics
TOP A_0 Subdesign A Copy Copy A_1
D Q D Q D Q

D

Q

Compile TOP Two logic blocks merging TOP A_0 Optimize accrodingly

D D Q

Q

Remove Hierarchy
D Q

A_1
D Q

D

Q

D

Q

D D Q

Q

Figure 5.8: Ungroup method In fact, design optimization starts already when it reads in designs in HDL format. Aside from performing grammatical checks, it tries to locate common sub-expressions for possible resource sharing, select appropriate DesignWare implementations for operators, or reorder operators for optimal hardware implementations, etc. These high level optimization steps are based on your constraints and HDL coding style. Synopsys refers this optimization phase as Architectural Optimization. After architectural optimization, the design represented as GTECH netlist goes through logic optimization phase. Two major process happens during this phase: • Logic Structuring For paths not on critical timing ranges, structuring means smaller area at the cost of speed. It checks the logic structure of the netlist for common logic functions, attempting to evaluate and factor out the most commonly used factors, and put them into intermediate variables. Design resource is reduced by hardware sharing. Introducing extra intermediate variables increases the logic levels into the existing logic structure, which in turn makes path delay worse. This is the default mode that design compiler optimize a design. • Logic Flattening Opposite to logic structuring, logic flattening converts combinational logic paths into a two level, sum-of-products structure by flattening logic expressions with common factors. It removes all the intermediate variables and form low depth logic structure of only two levels. The resulting en— 87 — Lund Institute of Technology

Chapter 5. Synthesis

hanced speed specification is a tradeoff for area and compilation time, and some times it is even not practical to perform such tasks due to limited CPU capability. By default, design compiler does not flatten the design. It is even constraints independent, meaning design compiler would not invoke it even timing constraints is present. Designer, however, can enable such capability by using the command set flatten. Gate level optimization is about selecting alternative gates from the cells in the technology libraries, which tries to meet timing and area goals based on your constraints. Setting constraints (both design rule and optimization) is already covered in the previous section. Various compile mode can be used to control the mapping process in gate level optimization.

5.4.8

Behavioral Optimization of Arithmetic and Behavioral Re-timing

For designs with large number of arithmetic operations such as multimedia applications, manually setting constraints to optimize datapath is a tedious work. Designer has to go back to HDL descriptions, manually inserting pipeline stages to the parts on the critical ranges. Specialized arithmetic components have to be coded or instantiated in the HDL file, making it harder to maintain. To address this issue, Design Compiler provides two capabilities that will automate such a process. BOA (Behavioral Optimization of Arithmetic) is a feature that applies various optimization on arithmetic expressions. It is effective for datapaths with large number of computations in the form of tree structures and large wordlength operands. BOA replaces arithmetic computations in the form of adder trees with carry save architecture. This includes decomposing a multiplier into a wallace tree and an adder. It also optimizes multiplication with a constant into a series of shift-and-add operations. Compared to similar implementations from DesignWare components, BOA can also work on much more complicated arithmetic operations and does not need to change the source code manually. Acting as a behavioral optimization techniques, it only recognizes high level synthetic operators (+, −, ∗), and is invoked by the command transform csa after elaboration (but before compile). BRT (Behavioral Re-timing) is a technique to move registers around combinational paths to achieve faster speed or less register area. The concept of retiming is introduced in the course “DSP-Design”. By inserting pipeline registers into timing paths, the path delays between any two registers are reduced. As a result, the total circuit could run at a higher speed at the cost of latency. The idea is easy to understand but hard to implement when this is done in the mapped gate level. Use the command pipeline design provided by Design Compiler. This command will automatically place the pipeline registers on your design to achieve the best timing results according to your requirements. One thing to note is that the targeted design has to be purely combinational circuits. There is another command, optimize registers, which optimizes the number of the registers by moving registers in a sequential design. Refer to Creating High-speed Data-Path Components for more details. ElectroScience — 88 —

end behave. — 89 — Lund Institute of Technology .5 Coding style for synthesis Although synthesis tools will optimize the design to a certain extent. The higher the design level you are working on. end if. which comes to different synthesis results. end process. c.5. d) begin -. o : out std_logic). end if. Two examples on how the synthesis results could differ in a good and bad coding style will be shown. end if_bad. if sel="11" then o<=d. ie. a. library ieee. Since this will cover everything from algorithm down to logic gates. 5.1 Coding style if statement This is a “classical” problem regarding if statement coding style in vhdl. In this context. if sel="01" then o<=b.process if sel="00" then o<=a.all.5.std_logic_1164. b. the most important factor for a good design lies mostly in everything in the upper level. what kind of architecture the whole system is utilized. c. sel : in std_logic_vector(1 downto 0). architecture behave of if_bad is begin process (sel. how good the whole system is partitioned. end if. how well the algorithm is. end if. the more performance boost you will get eventually. use ieee. only 10-20 percent of performance enhancement is usually expected. if sel="10" then o<=c.9: synthesis for if bad 5. entity if_bad is port ( a. only coding styles will be covered in the following section. b.5. Below is two coding styles using if statement for the same functionality. which is beyond the range of the tutorial. Coding style for synthesis Figure 5. whether a good coding style is used in VHDL. d : in std_logic.

d) begin -. Synthesis Figure 5. end if_good. end behave. architecture RTL of LOOP_BAD is type TABLE_8_x_5 is array (1 to 8) of std_logic_vector (4 downto 0).5. c. making a hardware for each iteration of the loop. else o<=d. elsif sel="01" then o<=b. architecture behave of if_good is begin process (sel. The following is an example of two for loop coding styles. elsif sel="10" then o<=c. d : in std_logic. critical path increases due to the delay propagated from input pin d through all the way to the output o. Design Compiler is not smart enough to take the advantage of all these exclusive if conditions. entity if_good is port ( a. IRQ) variable INNER_ADDR: std_logic_vector(4 downto 0).10: synthesis for if good library ieee. This results in three cascaded MUXes. So. the designer should aware of the hardware inference in such for loop.Chapter 5. b. Here is the synthesis results from synopsys for both designs.process if sel="00" then o<=a. ElectroScience — 90 — . use ieee. it uses priority coding for if statements without following else statements. end if. Design Compiler simply unrolls the loop. that should be translated into a single MUX.all. c. begin process (BASE_ADDR. o : out std_logic). and try to use as less hardware as possible within the loop. end process. when writing such loops. 5. Instead.std_logic_1164. b.2 Coding style for loop statement For for loop construct in the vhdl design. For if bad design. sel : in std_logic_vector(1 downto 0). a.

"10100"."10000". "00100"."00010". begin TEMP_OFFSET := "00000". end process. -.5. DONE := ’1’."00100". ADDR <= INNER_ADDR. end process. variable DONE: std_logic. for I in 1 to 8 loop if ((IRQ(I) = ’1’) and (DONE = ’0’)) then INNER_ADDR := INNER_ADDR + OFFSET(I)."10100". — 91 — Lund Institute of Technology . INNER_ADDR := BASE_ADDR."10001". end loop."10001"."11000"). end loop. for I in 8 downto 1 loop if (IRQ(I) = ’1’) then TEMP_OFFSET := OFFSET(I). end if. IRQ) variable TEMP_OFFSET: std_logic_vector(4 downto 0)."00010". end if.Calculate Address of interrupt Vector ADDR <= BASE_ADDR + TEMP_OFFSET."11000"). end RTL."01000". begin DONE := ’0’. end RTL. constant OFFSET: TABLE_8_x_5 := ("00001". Coding style for synthesis Figure 5. "10000". "01000". architecture RTL of LOOP_BEST is type TABLE_8_x_5 is array (1 to 8) of std_logic_vector (4 downto 0). constant OFFSET: TABLE_8_x_5 :=("00001". begin process (BASE_ADDR.5.11: synthesis for loop bad variable DONE: std_logic.

Chapter 5.12: synthesis for loop good ElectroScience — 92 — . Synthesis Figure 5.

Chapter 6 Place & Route 6. this manual was never intended to be complete. the publisher and the authors assume no responsibility for errors or omissions.2 Starting Silicon Ensemble Before you start SE you should ensure that your design has been tested extensively as the place and routing job can be very time consuming dependent on the design size. Several digital ASIC’s have been successfully produced (in our digital ASIC group) according to the presented design flow. The required library elements for the design flow are: • LEF/DEF Files (usually provided from the chip manufacturer) • Netlist in Verilog format (generated with Synopsys) !As SE needs a lot of disk quota to do the routing you should check that you have enough free disk space! — 93 — Lund Institute of Technology . The reader is guided by using the Silicon Ensemble GUI and by the use of script (mac) files. 6. For a more detailed description the reader is referred to the Silicon Ensemble manual. Although every precaution has been taken in the preparation of this manual. Neither is there any liability assumed for damages or financial loss resulting from the use of the information contained herein.1 Introduction and Disclaimer This manual is intended to guide an ASIC designer carrying out the designs steps from netlist to tapeout.1. However. The design steps considered in this manuscript are presented in Figure 6.

jnl will be created • Commands in the initialization file se. • A log file se.1: Silicon Ensemble Design Flow To start SE go to the cadence work directory: Setup the SE environment with To view the SE manual type: Start SE graphical user interface with and following actions will take place: cd NameofDirectory source . Place & Route Initialization Files Import to Silicon Ensemble LEF/DEF Verilog Floorplanning Pad Placement Block and Cell Placement Power Routing Clock Tree Generation and Routing Signal Routing Design Verification Figure 6.ini will be run • The SE GUI pops up and is ready to use ElectroScience — 94 — . openbook seultra -m=500 & • 200 MB RAM will be allocated for the use of SE.Chapter 6....

6.2. — 95 — Lund Institute of Technology . Starting Silicon Ensemble Type seultra -h for more information.

ElectroScience — 96 — .lef (custom memeory) in the given order and terminate each import routine with ok.lef (cells and pads) and memory.lef (process parameters).def file. All the files required to start the SE design flow have now been imported. see Figure 6.v.v. If memory cells are used the respective LEF files have to be read as well. In the next step the supply pads definition file that specifies the names of the different power pins is imported: ⇒ Select File → Import → DEF and browse for the supplyPads. the cells and the pads.3 Import to SE To create a database for your design the LEF files have to be imported to create a physical and logical library. Place & Route 6. MTC45000. The description of the memory and a netlist of the design in verilog format needs to be imported to SE: ⇒ Select File → Import → Verilog and browse for the memory verilog file memory.lef. LEF files contain information on the target technology. Set Compiled Verilog Output Library to projdir. MTC45100. ⇒ Select File → Import → LEF Import cmos035tech. The name of the top module is created after the synthesis and can be quite long as all the generics are included in the name. Such files contains the minimum set and a few additional supply pads. The import can be done by using the GUI. Type the name of your top module under Verilog Top Module. Repeat the same procedure and import filter. in the command line of the SE GUI.2. ⇒ Type save design ”loaded”.Chapter 6.

mac.4 Creating a mac file Instead of using the GUI to import the data.jnl. Scan se. a complete mac file serves as documentation and makes it possible to repeat the flow exactly.2: Import of the netlist in verilog format 6. a script file can be executed to perform the same operations. ⇒ Open a new file. If the structure of a script file and the SE design flow is understood. e. Furthermore. Creating a mac file Figure 6.g. You script file may look as follows: — 97 — Lund Institute of Technology . script files can be more than a competitive alternative to the GUI. Paste the instructions in your newly generated file. The *.g emacs. Verilog and DEF files. e.4. 1 verilogin. with an editor. If parameters needs to be changed it is more convenient to do the changes in the script file rather than doing the changes via the GUI.jnl for the instructions that have been carried out with the GUI to load the LEF. All the instructions needed to set up such a script file can be found in se∗.jnl is a screen dump of the report window.6. For future designs the script files can be easily modified and the time needed to do the place and route job is significantly reduced.

FLOAD DESIGN "LOADED". LIB "cds_vbin" REFLIB "MTC45000 MTC45100 cds_vbin".lef.def . INPUT LEF F LEF/MTC45100. LIB "projdir" REFLIB "MTC45000 MTC45100 cds_vbin".lef. SAVE DESIGN "LOADED" .. are variations in commands that result in the same design task possible. INPUT VERILOG FILE "/. ElectroScience — 98 — . Change the first command (in your mac file) INPUT to FINPUT.. INPUT DEF F DEF/supplyPads.your_design:hdl" . However. INPUT LEF F LEF/MTC45000.lef . Place & Route ##-. The difference between these commands is that it is possible to add data to an existing library with INPUT but not with FINPUT. ##-. INPUT LEF F LEF/memory.v".lef .v".mac All the necessary files have been imported as it has been equivalently done with the GUI.Import Library Data ##-FINPUT LEF F LEF/cmos035tech.Chapter 6./memory. ⇒ Select File → Execute → 1 verilogin./your_design. INPUT VERILOG FILE "/. as SE provides a huge set of commands.your design. DESIGN "projdir.

IO’s. Set the value for Block Halo Per Side (distance between blocks and rows) to 30µ for this design.g. chip area etc. row utilization. • The core rows are created • Sites for corner cells are created if necessary ⇒ Select Floorplan → Initialize Floorplan and Figure 6.5. e. The aspect ratio is set to one by default. which is square.3 the design statistics such as Number of Cells. before creating the floorplan. Figure 6.6.3 pops up. see Figure 6. By Initialize Floorplan following actions will take place: • A starting floorplan is created by an estimate of the required layout. The floorplan. as described in Expected Results. ⇒ Select Flip Every other row and Abut Rows to create rows where VDD and GND alternates. The distance of the IO to the core needs to be set. Nets is provided by SE..3. • Global and detailed routing grids are created. will be created by confirming your settings with apply or ok. The calculate button describes the floorplan in terms of number of rows. 150µ in both cases.3: Initialize Floorplan Dialog Window — 99 — Lund Institute of Technology . In Figure 6. The distance has to be large enough to leave room to route power and ground wires. Floorplanning 6.5 Floorplanning The floorplanning is done by using a combination of automatic functions provided with the GUI in combination with modifications done by the user.

4: Initialized floorplan with memory cell ElectroScience — 100 — . IO to core distance. Place & Route You can try different setting.Chapter 6. and observe how the floorplan changes. A floorplan and a memory cell are visible now.g. Before continuing in the design flow return to your initial settings for the floorplan.4. see Figure 6. e. Figure 6.

⇒ Select Move on the icon panel on the left hand side. Next. Assure that the memory supply faces to one of the core borders (the memory pins become visible if net is selected as visible). Drag the memory cell to one of the corners on the core. currently placed on the lower left site. The more supply pads the better.ioc is to make your ASIC fit better with surrounding blocks. Thus the memory power ring has only to be drawn on two or three sides of the memory.6.6. The aim of placing the pads manually by using ioplace.4.6 Block and Cell Placement The memory cell. ⇒ Select Place → IO’s → Random → ok. ⇒ Select Floorplan → Update Core Rows → ok.5: Initialized floorplan with memory cell — 101 — Lund Institute of Technology . needs to be placed on the core. A file ioplace. ⇒ Select Edit if you want to do changes in the pad placement. ⇒ Finally. Avoid core supply pads in the corners of the ASIC. ⇒ Activate cell as selectable. As a rule of thumb: Place the supply pads evenly in the center of each side of the chip. The order of the cell names in the generated file is the placement order: Left → right in the top and bottom rows and bottom → top in the left and right rows.jnl for the instructions and paste them a new file. ⇒ Select Place → IO’s → I/O constraint file and select Write.ioc is generated that specifies the location/orientation of the pads. place the IO pads with ok. ⇒ Scan se. However. 2 floorplan. North FLPNorth South FLPSouth Figure 6.5.g. The coordinates for cell orientation are illustrated in Figure 6. Block and Cell Placement 6. Interrupt a series of IO pads with pad supply to obtain a better driving of the pads.mac. see Figure 6. the power pins and signal IO’s needs to be placed. e. The rows underneath the memory are cut and the rows surrounding the memory are shortened (halo per side) to give space for the memory power ring. The names of the IO’s as specified in the vhdl code are listed with their respective net name in the verilog file. by setting such constraints it might be a much harder job for the router and some configurations might not be routeable at all.

The placement will be done automatically (QPlace). ⇒ Scan se. Dependent on the design size this can take some time. ⇒ Select Place → Cell and ok and the cells will be placed on the rows.6. see Figure 6. Place & Route The floorplan is now ready to insert the cells.jnl for the instructions carried out under block and cell placement and paste them in a new file.g. Figure 6.mac. ⇒ Save you design as placed. e. 3 place.Chapter 6.6: Closeup of the memory and placed cells ElectroScience — 102 — .

Delete other supplies in the dialog window. Power Routing 6.7. wires crossing the core can be added. stripes and power pads To start Plan Power ⇒ SelectRoute → Plan Power → Add Rings. to 30µ and 10µ. the following can be specified: • layer for the stripes • width and spacing • number of stripes (equally spaced across the core) Figure 6. With Plan Power the following design steps are done: • Power paths can be planned and modified before routing them • Generation of power rings that are surrounding all blocks and cells • Creates stripes over standard cell rows • Connects created ring. as at least metal1 is already reserved for signal routing. metal4. The first net name is the left or bottom-most ring. The order of the rings can be defined by the order of the net names. In order to improve the power supply throughout the rows. If the IO to core distance (defined during initialize floorplan) is to narrow you will get a warning.6.7 Power Routing The power ring structures surrounding the core and the memory cell has to be planned and routed. With the dialog box that pops up.7: Dialog box for Add Rings — 103 — Lund Institute of Technology . Only vdd and gnd is needed as core supply. Specify the width for the core supply ring and channel. ⇒ Select Route → Plan Power → Add Stripes. To prevent routing violations choose metal3.7. see Figure 6. ⇒ Select the type of metal you want to use for the power ring. respectively.9. see Figure 6. located between the memory and core.

4 powerroute.8: Core with power rings To create stripes across the core you may use following parameters: layer metal4 width 5. Place & Route Figure 6.g.Chapter 6. ElectroScience — 104 — .000 spacing 1.mac.000 mode all Set to Set spacing 100.jnl for the instructions carried out during power routing and paste them in a new file. e. ⇒ Scan se.000 The strips are only generated to the edge of the core and will be connected later to the power rings.

2.35 CMOS technology.11. as the script can easily be reused in the AMIS 0.4.mac. The different filler cells can be found in MTC45100. Power Routing Figure 6. — 105 — Lund Institute of Technology .jnl for the instructions and paste them in a script file.8. Fill in the Model as IOFILLER64 and the prefix as FP64.7.g.1). It is recommended to save the commands for the filler cells in a separate script file.10. see Figure 6. see Figure 6. Observe how the gaps between the IO pads are vanishing.16. ⇒ Select Place → FillerCells → AddCells to open the dialog box. This can be done by inserting fillercells (dummy-cells) in the IO ring.lef. ⇒ Scan se. e. You only need to set the area parameter to a rather big value to make it applicable for future designs.6. Iterate this procedure with narrower filler cells (32.9: Dialog box for Add Stripes Filling the gaps between the IO pads The gaps between the the IO pads needs to be closed to maintain continuity in the power supply ring. Flip South under Placement. IOdummys. ⇒ Select North. South. Flip North.

All Ports.Chapter 6.def.12. Block.g 5 connectrings.mac. The following is completed with the Connect Ring command: • Connections between IO power pins within IO rows • Connections between CORE IO ring wires and the IO power pins • Connections between stripes and core rings • Connections between block power pins and the core IO wires ⇒ Select Route → Connect Ring Assure that Stripe. Follow Pins are selected. see Figure 6. IO Ring.10: Adding Filler Cells 6.8 Connecting Rings To connect the rings use the Connect Rings command. IO Pad. ⇒ Scan se.jnl for the instructions carried out during connecting the rings and paste them in a script file. ⇒ File → Export → DEF ElectroScience — 106 — . Export your design as b4CT. Place & Route IO IO IO IO IO IO Before placement IOcell IOcell IOcell IOcell After placement: empty cells result in interruption of the supplies IO IO text text IO IO IO IO text text vdd gnd IOcell IOcell IOcell Filler cell Filler cell IOcell IO IO IO text text IO IO IO vdd gnd Filler cells result in continuous supply lines Figure 6. All the supplies are now connected. e. ⇒ Save the design as connected.

9 Clock Tree Generation As an optional tool to SE comes a clock-tree generator (CT-Gen). • Minimizes skew. With the use of data from your DEF file CT-Gen produces clock tree(s). • Controls buffer and inverter selection Following constraints for the clock tree need to be defined before using CTGen: • Clock root. Clock Tree Generation Figure 6. It can be controlled with the GUI or as a stand-alone tool. computes wire estimates and inserts buffers to reduce clock skew.9. • trise and tf all of the waveform. • Produces code for Silicon Assemble router. You need to export your design as b4CT. The input DEF file must be fully and legally placed. It is recommended to route the clock net only after all the core cells are connected to the supplies as eventual routing violations might be prevented. CT-Gen provides following features: • Construction of an optimized clock tree. The output of CT-Gen is a fully placed netlist with new components such as buffers and inverters.6. We will use CT-gen as a stand-alone tool.def before generating the clock tree. • Minimum/maximum insertion from clock root to any leaf • Maximum skew between insertion delay and any leaf pin — 107 — Lund Institute of Technology .11: IO Filler Cell Window 6.

• clock: after the clock tree has been added.constrainsts and set the constraints (see the comments for explanation).mac and paste the instructions in the SE command line.commands parameters such as the input LEF file. ⇒ Start CT-Gen with ctgentool ctgen.12: Closeup of the layout after the Connect Ring command ⇒ Open the file ctgen. In the ctgen. supply definitions need to be set.Chapter 6. ⇒ Open the verilogin. ⇒ Delete all the files in CTGEN that may remain from previous designs with \rm -r CTGEN/. Place & Route Figure 6. ElectroScience — 108 — . However.<type> where <stage> is one of • initial: without modifications. • violations: lists any violations against the given constraints. • trace: describes the clock path/tree with its components. a number of report files will be produced in CTGEN/rpt.commands) as work library. • final: when the clock tree components are placed correctly. the technology and memory LEF files needs to be imported first. The clock generator tool uses the library CTGEN (defined in ctgen.commands If the the clock generation was successful. The report files are of the syntax <stage>. • timing: the same without the details. The DEF file generated with CT-Gen needs to be loaded to SE. and <type> is one of • analysis: complete description of best and worst path found.

lef.jnl for the place filler cell command. Paste the command in your newly generated file. Fill in the Model as FILLERCELL16 and the prefix as FC16.mac and scan se.10 Filler Cells The row utilization as calculated during floorplan initialization is visible on the floorplan. The different filler cells can be found in MTC45000. see Figure 6. !With increased area in your 6 coredummy. Furthermore. Select North. The red gaps between the cells need to be closed to avoid design rule violations.2.4. Flip South under Placement. South.def ⇒ Select Place → FillerCells → AddCells to open the dialog box. Iterate this procedure and change the width of the filler cells (8.13: Core Filler Cell Window — 109 — Lund Institute of Technology .6. Filler Cells 6. By placing filler cells continuity is maintained in the rows. ⇒ Open a new file.10. ).mac file such file can be reused in other designs! Figure 6.13. ⇒ Select File → Import → DEF and select afterCT. Flip North. e.g 6 coredummy. substrate biasing is improved by filler cells substrate connections.

If the final routing is done identify different parts of the ASIC. Check if the design could be improved. If it crashes decrease memory allocation for SE as WRoute is run outside SE and thus will get more memory. ⇒ Select highlight and the entire clock tree becomes visible. Scan the last lines in se. WRoute is included in the SE package and can be used with the GUI or as stand-alone tool. OK in both boxes. ⇒ Scan se. Trace the highlighted signal path and identify the cell where the net ends. ⇒ Select Edit → Find. The benefits of WRoute are: • 3 to 20 times faster than GRoute/FRoute • Timing driven global and final routing • Automatic search and repair in final routing • Supports prerouted nets • Final clean up WRoute requires a lot of memory.g 7 finalroute. The ∗ option ⇒ Selects every netname that starts with nxxx. ⇒ Select nets as visible. Assure that ALL is selected and proceed with ok To start the final routing ⇒ Select Route → WRoute to open the dialog box.11 Clock and Signal Routing For final routing the Envisia Ultra Router (WRoute) is used. Specify the net name as: nxxxx∗. e.jnl for ”Total wire length” and ”Total number of vias”.mac. !It is recommended to route the clock before routing the remaining signals (shortest wiring possible)! ⇒ Select Route → Clock route. Select Global and Final Route and Auto Search and Repair. Choose net as type. ⇒ Select Options and enable Optimize Wire Length. Place & Route 6. ElectroScience — 110 — . Enable Incremental Final Route. ⇒ Select Route → WRoute. After the improved routing is done find the new ”Total wire length” and ”Total number of vias”. Choose one or several pins and find the clock signal.jnl for the instructions carried out during final routing the rings and paste them in a script file.Chapter 6.

However. The GDS II file is ready to send to the manufacturer. for tapeout. the manufacturer requests the GDS II format. To check if there are any unconnected pins or routing violations ⇒ Select Verify → Connectivity.map and terminate with OK.12 Verification and Tapeout As a last step in the SE design you may test if the place and route process went smoothly. Verification and Tapeout Figure 6. — 111 — Lund Institute of Technology . Assure that Types and Nets are set to ALL. ⇒ Choose the map-file cmos035gds2.6. A map-file is used to translate the design to GDS II.12. Tapeout files The finished design needs to be saved as DEF or as GDS II format. If no errors are found the design is ready for tapeout. ⇒ Select a name of the GDS-II file and a name of the Report file. ⇒ Select File → Export → GDS II.14: Closeup of the final layout 6. The manufacturer replaces the cell abstracts with layout cells.

Place & Route 6. Anders Berkemann. Thanks to Shousheng He.Chapter 6.13 Acknowledgements Thanks to Stefan Molund for setting up a system for our needs. ElectroScience — 112 — . Fredrik Kristensen and Zhan Guo for contributing with their experience and knowledge.

a complex cell can have several levels of logic and hence — 113 — Lund Institute of Technology .1.1 Dynamic power consumption Switching power The switching power is due to charging and discharging of capacitances and can be described by superposition of the individual switching power contributed by every node in a design. Hence. Internal power The internal power consumption is due to voltage changes on nets that are internal to a cell. Dynamic power is dissipated when a circuit is active. the switching power increases as logic transitions increase. that is. static power consumption starts to be the major part of the power consumption even during switching. not switching. that is. Here. This power is consumed during a momentary short caused by a non-perfect signal transition. the voltage on a net changes due to changes at the input.Chapter 7 Optimization strategies for power 7. The load capacitance as seen by a cell is composed of net and gate capacitances on the driving output. transitions on a cell input do not necessarily result in a logic transition at the output. nonzero rise or fall time. Since charging and discharging are result of logic transitions. Internal power consumption of a cell is said to be both state and path dependent. static power is considered to be consumed when the circuit is inactive. Clearly. In a static CMOS cell. providing a direct path between Vdd and ground. However. On the other hand. 7. that is. short-circuit power is the major contributor and often used synonymously.1 Sources of power consumption Two major sources of power consumption are considered: Dynamic and static power. both P and N transistors are conducting simultaneously for a short time. one further distinguishes between switching and internal power consumption. As technology is scaled down.

thus consuming more internal power. I3 is gain-induced drain leakage and finally. the threshold voltage is also decreased to sustain an improvement in gate delay for each new technology generation [8]. the major sources of leakage are the subthreshold current (part of I2 ) and gate leakage due to oxide tunnelling (part of I4 ).2 Optimization Where and when to apply power optimization strategies is simply answered. The current I1 in Figure 7.2 Static power consumption In digital CMOS technology. the gain-induced drain leakage (I3 ). I2 is a combination of subthreshold current and channel punchthrough current. I4 is a combination of oxide tunnelling current and gate current due to hot-carrier injection. which makes leakage increasingly troublesome. Input A and D can each cause an output transition at Z. Despite possible savings at higher abstraction levels. For submicron CMOS technology. However.1: Cell with different logic depths. D only affects one level of logic.1. static power consumption is caused by leakage through the transistors. Leakage power due to subthreshold current increases exponentially with threshold voltage scaling [8]. It consumes a different amount of internal power. 7. Optimization strategies for power transitions on different input pins will consume a different amount of internal power. Since the ratio of supply voltage to threshold voltage affects the circuit delay. as exemplified in Figure 7. However. depending on the number of logic levels that are affected by the transition. Traditionally.1. However. Figure 7. as early as possible and at the highest possible abstraction level. the supply voltage is decreased to reduce electrical field strengths and power dissipation [9]. A B C D Z Figure 7. A typical example for a cell with state dependent internal power is a RAM cell. and the pn-junction leakage (I1 ) may also become important factors.3 shows that the attainable power savings are the greatest at system level and still significant downto register transfer level (RTL)—before the design is committed to a specific technology. In future CMOS technologies. 7. the highest accuracy on a power estimate is at gate and transistor level. the increase in static leakage power cannot be ignored in CMOS circuits of today and will be an even greater issue in the future. ElectroScience — 114 — . six different types of leakage are described.Chapter 7.2 comes from the pn-junction reverse-bias. depending on whether it is in read or write mode. In [8]. leakage has been a rather small contributor to the overall power consumption in digital circuits. whereas A affects all three. In CMOS circuits currently used today.

it is assumed that the voltage swing is equal to Vdd .2: Leakage currents in a MOS transistor.3: Power savings and accuracy attainable at different abstraction levels [10]. Figure 7. synthesis.2. Vswing is the voltage swing. 7.7. RTL power estimation in Synopsys involves basically the same steps as the more accurate gate-level estimation and will hence be neglected in favour of the latter. Optimization I4 Gate I3 I2 I1 Gate Oxide Source Drain Figure 7. and verification flows that come along with it. 1 From our experience. most IC’s are still developed at RTL due to the maturity of design. (7. which is usually true. For simplicity. Following are some of the options that can be thought of. Based on the obtained results.2. Vdd is the operating voltage. design modifications can be carried out1 . and f is the clock frequency. RTL power estimation is usually fast and gives early answers to the questions: • Which architecture consumes least power? • Which module in a design is consuming the most power? • Where is power being consumed within a given block? Doing such an estimation does not introduce much overhead to your design flow and should be carried out if power is a serious objective in your design. — 115 — Lund Institute of Technology .1) where CL is the total capacitive load on the chip that is switched.1 Architecural issues—Pipelining and parallelization The dynamic power consumption Pd for a CMOS design depends on the operating speed as 2 Pd ∝ CL · f · Vdd · Vswing ≈ CL · f · Vdd .

a circuit designed for high ElectroScience — 116 — . Comparing the hardware mapped and the parallel implementation. Then. and time shared (d) implementation.3) 1 t (7. while the time shared design has to double f to reach freq .4: HW mapped (a). with a lower clock frequency. pipelined (b). α. Consequently. and Vt depend on the silicon process. Ccrit is the capacitive load in the critical path. K and α are constants. Note that α accounts for velocity saturation in processes characterized by short channel devices. K(Vdd − Vt )α (7.4 [11]. parallel (c). The hardware mapped and pipelined designs both produce one sample per clock cycle and hence have f equal to the required frequency freq . Optimization strategies for power Let freq be the frequency that is required at the output of a design to achieve a desired throughput. see Figure 7. However. The parallel design produces two samples per clock cycle and f can be reduced to freq /2. it seems as if there is nothing to gain from a power perspective by decreasing f since CL increases accordingly. the architecture of the design will affect both freq and thus the dynamic power consumption. f = 2 · freq FF MUX f = freq FF f = freq FF f = freq /2 FF FF FF FF FF FF FF FF (a) (b) (c) (d) Figure 7.Chapter 7. Vdd can be decreased since f= and tpd ∝ Ccrit Vdd .2) where tpd is the propagation delay. K. and Vt is the threshold voltage. Note that the pipelined design has a shorter critical path and can therefore reach a higher maximum clock frequency.

(7.3 and 1. To design a device with active power control will increase the number of constraints on the design and add additional verification time. α approaches numbers smaller than 2.4(c). With (7.org 55c ≈ = 2 K(Vdd − Vt ) KVorg KVorg (7.26.org = 55c and Ccrit.5) yields 2 · 55c Vorg 56c .par = 112c. The load along the respective critical path is then Ccrit. and 1c.18µm. Hence. the dynamic power consumption is calculated as fpar = Hence.3) and.1) and (7. 3c. The latter produces two samples each clock period and thus the frequency can be halved while the original throughput is maintained. the power savings will not be as large as in the presented example for processes below 0.5 for a 0.6) and (7.8) (7.9) The parallel design has the same throughput at approximately a quarter of the original design’s power consumption and can achieve double throughput at Vdd becomes double power consumption. for example. However.5) forg .4) tpar = 2 · torg .par Vdd Ccrit.2) and (7. Vt is assumed to be much smaller than Vdd and α = 2.7. Note that the approximation Vt too rough and unreliable as silicon processes shrink since Vt does not scale with Vdd .6) torg ≈ (7.7) into (7. tpar ≈ Ccrit. if the design should be able to switch between high throughput/power and low throughput/power mode during run time. Then. between 1. and a multiplexer be 24c. Optimization speed can. with (7.org = 56c and for the parallel design CL. — 117 — Lund Institute of Technology .par = 56c. reduce the power consumption below that of a circuit only designed to operate at low speed. 2 2 Porg CL. As an example.1). an adder. A cell library that is characterized for multiple voltages is needed as well as hardware to generate the voltage levels and logic to control it. it is possible and an example of a commercial processor with active power control is found in [13]. a register. using (7. when operated at low speed. 2 (7. respectively.org Vdd Ccrit. for simplicity and rather traditional. f Vorg 2 CL.7) Substituting (7. = ⇒ Vpar = KVpar KVorg 1.25µm process [12].2. the total load of the original design is CL. Also.par fpar Vpar 112c org ( 1. consider the designs from Figure 7.96 )2 Ppar 2 = = ≈ 0. an active power controller will be required.4(a) and Figure 7.par 56c ≈ = K(Vdd − Vt )2 KVpar KVpar Ccrit.4) Let the capactive load of a multiplier. In addition. 1c.org forg Vorg 56cforg Vorg (7. The drawback with the parallel design is that the amount of hardware is doubled.96 and finally.

there are some differences worth noticing. One simple way to deal with this problem is to gate the clocks into these blocks. These blocks will consume switching power and contribute with unnecessary load to the clock tree. At some point there will be no more combinational logic that can be separated by pipeline registers. However. but larger.5(a). A safer. for pipelining to be really efficient the delay through each stage should be balanced since frequency and voltage are proportional to the critical path. the number of clock cycles from which data is present at the input until the corresponding result is presented at the output. To turn off the gated clock for one clock cycle.5(b).2 Clock gating When a design increases in size and functionality there is a great chance that parts of the hardware are only used at certain times or in certain operation modes. In addition to reducing the clock load it will prevent unnecessary switching activity since almost all hardware modules have registers at their input. Here.5: A simple clock gate (a) and its timing diagram (b). en is latched in order to avoid transitions while the clock signal is high and glitches on the output. en gclk clk (a) clk en gclk (b) Figure 7. This design is simple but sensitive to glitches. Finally. pipelining increases latency. First and most important. 7. turn off the clocks. implementation is shown in Figure 7. the amount of additional hardware is normally much less than in the parallel case since only some extra registers are needed. which must be avoided under all circumstances. that is. that is.Chapter 7. Secondly. Optimization strategies for power To insert additional pipeline stages in the original design has the same effect as parallelizing. When the enable signal en is low the gated clock is turned off. In addition. The simplest clock gate is implemented as an AND-gate. en is lowered at ElectroScience — 118 — . whereas parallelizing is only restricted by area.2. The timing diagram is shown in Figure 7. as shown in Figure 7. either the throughput is increased or the power consumption is reduced. pipelining has an upper limit to the number of pipeline stages that can be inserted.6(a). even though the result is unused.

the datapath operator output is not used if it is an unselected input to a multiplexer or if the datapath operator is an input to a register that is currently disabled. additional logic (AND or OR gates) called isolation logic is inserted along with an activation signal to hold the inputs of the data-path operators stable whenever their output is not used. With the operand isolation approach.3 Operand isolation Operand isolation. Traditionally. also called guarded evaluation [14].6(b). datapath operators are always active. Optimization en D Q G enl gclk clk (a) clk en enl gclk (b) Figure 7. do_operand_isolation = true read -f vhdl /simple. Isolation candidates can be identified in the HDL source by using a pragma annotation or at the GTECH level by using commands in the synthesis script.6: A latched clock gate (a) and its timing diagram (b). This design is safer since the latch is only transparent when the clock is zero keeping the output of the AND-gate at zero.2. In a particular clock cycle. any time in the preceding clock cycle and then raised again in the next clock cycle. However. instead of halting the clock signal.vhd set_operand_isolation_style -logic or set_operand_isolation_cell DW_MULT set_operand_isolation_slack 5 /* define constraints */ — 119 — Lund Institute of Technology . prevents unnecessary operations to be performed. 7. Due to changing inputs. datapath signals are directly disabled and hence no new results are evaluated.7. as shown in Figure 7. they dissipate power even when the output of the operators is not used.2.

such as mobile phones.2. One efficient and often-used method to combat leakage is to use dual or variable threshold technologies [15. power consumption due to leakage remains. thus. A rollback mechanism is also provided to remove the isolation logic if you determine that the delay penalty is not acceptable. AS S0 D0 D1 S1 Figure 7. the inputs to the operator are kept stable. In a dual threshold technology. use OR gates to create the operand isolation since this holds the inputs to the operator at logic 1 during inactive mode. Optimization strategies for power compile In Synopsys. AND gates are used for inputs that mostly stay at logic 0.4 Leakage power optimization While the active power consumption varies with the workload and can be locally eliminated using for instance clock gating.8: Operands isolated from the datapath operator. adders. For inputs that mostly stay at logic 1. By creating an activation signal (AS) based on this observation. 16]. see Figure 7. Ideal candidates for operand isolation are arithmetic modules such as arithmetic logical units (ALUs). a large part of the system is in idle mode most of the time. as long as the power supply voltage is switched on.7. For burst mode applications. 7. the result from the datapath operator is not used in further steps if S0=0 or if S1=1. Consider the example from Figure 7. making simple saving schemes such as clock gating less attractive. operand isolation automatically builds isolation logic for the identified operators.7: Datapath operator and selection circuitry.Chapter 7. Similarily. there ElectroScience — 120 — . S0 D0 D1 S1 Figure 7. and multipliers that frequently perform redundant computations. This makes the energy consumed by leakage a large contributor to the overall energy consumption and.8. Here. minimizing switching activity at the transition from active to inactive mode.

In Figure 7. In [8]. such as clock gating. The difference with these cell libraries is that the oxy-nitride dielectrics are as — 121 — Lund Institute of Technology .5 Beyond clock gating With aggressive gate length scaling follows an excessive gate leakage current that is due to very thin oxy-nitride gate dielectrics. Results in [18] indicate that reverse body bias is an efficient method to reduce leakage. which also results in less leakage current [8]. A simple way to save power is to shut down a function block when it is not used. the potential Vm is higher than 0V due to the leakage current. Therefore. stacking transistors. To retain this information some memory stages can be inserted in the block that stores the state the block was in before it was shut down. power consumption due to leakage is described as Ileak · Vdd . A technique using reverse body bias for standby power management is described in [20]. The leakage current may differ several orders of magnitude for a simple CMOS gate depending on the input pattern [17]. The increased potential at Vm makes the voltage between gate and source negative for transistor T2. making it a feasible technique for power reductions. and in [19]. This method is applicable if it can be predicted when a block is going to be used again. increased channel length.9(b). will cease to work. This rather long time duration is due to the risk of interference with other parts on a chip. see Figure 7. and reverse body bias. Reverse body bias involves increasing the bulk potential for the PMOS transistors and decreasing the bulk potential for the NMOS transistors. 7. For handheld applications a low power dissipation is very important. This stems from the fact that the time to power up a block is in the order of a few microseconds.7. supply voltage scaling and reverse body bias are suitable as adaptive or reconfigurable power management methods. Supply voltage scaling directly affects leakage.2. and thereby reduces both leakage current and maximum transistor current [8]. special low power cell libraries has been developed for new technologies. Therefore. The voltages between bulk and source. one with high threshold voltage transistors for low power and one with low threshold voltage transistors for low delay. Stacking transistors increases the resistance between supply and ground. the static power dissipation will rapidly increase with the development of the technology and methods for power saving. Reduced leakage power is therefore achievable during idle mode depending on the selected patterns for internal nodes. Using low threshold devices only in timing critical paths can radically decrease leakage currents both in standby and operational mode. Optimization are typically two versions of each digital gate. Increased channel length is a static design method while stacking transistors. A disadvantage with these methods is that the block loses its information. the input pattern has great impact on the leakage power. Reverse body bias as well as increased channel length can be utilized to increase the threshold voltage and thereby reduce leakage [8]. A way to reduce the power-up time and still save power is to just lower the voltage in a block when it is not used.9(a). which reduces the leakage current.2. it is shown that the leakage current decreases when the supply voltage is decreased. Due to transistor stacking in CMOS gates. a comparison is made between four different leakage reduction techniques involving supply voltage scaling. and between drain and source also change due to the stacking effect. In [18].

thick as in the previous technology. As an example. the downscaling of technology causes the drain current in a transistor (Gain Induced Drain Leakage) to increase to be in the same order of the gate current. ElectroScience — 122 — . Optimization strategies for power Vdd 0V T1 Vm 0V T2 (a) Vdd Vdd + V1 GN D − V2 (b) Figure 7. a format that models different properties of a cell’s and a design’s behaviour is presented.9: The stacking effect (a).3 Power estimation with Synopsys Power Compiler This section tries to give some practical examples of how to carry out power estimation using Synopsys PowerCompiler. First. the 90 nm and the 130 nm technology use oxy-nitride dielectrics with same thickness. 7. Information provided by a cell library should be briefly verified by inspecting the respective files in order to be able to manually calculate the power consumption of a cell. see [21]. Reverse body bias (b). One can also predict that technologies have to be developed with high-κ gate dielectrics to meet stringent gate leakage and performance requirements.Chapter 7. Using the methods above to decrease the power dissipation of the block has an increase of 5% to 20% in chip area of a block. The drawback with this is that the performance with respect to speed will also be similar to the previous technology. Also. More information about future technology predictions.

(MODULE "HDEXOR2DL" (PORT (Z (COND !A2 RISE_FALL (IOPATH A1 ) COND !A1 RISE_FALL (IOPATH A2 ) COND A2 RISE_FALL (IOPATH A1 ) COND A1 RISE_FALL (IOPATH A2 ) ) ) ) (LEAKAGE (COND (!A2 * !A1) COND (A2 * !A1) COND (!A2 * A1) COND (A2 * A1) COND_DEFAULT) ) ) When a transition at output Z occurs. Here.db dc_shell> lib2saif -output <library_name>.1 Switching Activity Interchange Format (SAIF) In order to allow for power estimation and optimization there has to be input in form of toggle rate and static probability data based on switching activity. Power estimation with Synopsys Power Compiler 7. or port went through either a 0 → 1 or a 1 → 0 transition. the forward annotation SAIF file contains directives that determine which design elements are to be traced during simulation. For example. there are two types of SAIF files: forward annotation and back annotation.13µm process. Furthermore. the transition belonged to COND !A2 RISE FALL (IOPATH A1). if a transition at Z is caused by A1 and the value of A2 was “0” at the time.7. it lists synthesis invariant points (nets. The extra information provided results in more accurate estimates. pin. dc_shell> read -format db <library_name>. The former is provided as input to the simulator.3. Generally. typically available through simulation. the simulator should determine which path caused the transition and evaluate the associated state condition at the input transition time. Considered is a 2-input XOR cell from the UMC 0. for example.db Following is an example of a forward annotation SAIF file for gate level simulation that takes state and path dependent information into account. the toggle rate is a measure of how often a net. all four possible outcomes can — 123 — Lund Institute of Technology . modelling state and path dependent power consumption of cells. These directives are generated by the synthesis tool from the technology-independent elaborated design. the latter is generated by the simulator as input to the following power estimation and optimization tool. the forward annotation SAIF file contains information from the technology library about cells with state and/or path dependent power models. The static probability is the probability that a signal is at a certain logic state. For RTL simulation. With SAIF [22] data. In this example. DC can perform a more precise power estimation. signals). ports) and provides a mapping from RTL identifiers to synthesized gate level identifiers (variables.3. dc_shell> rtl2saif -output rtl_fwd.saif -design <design_name> For gate level simulation.saif <library_name>. Following is the command that can be used from the dc shell prompt.

Unbalanced logic often give rise to these kind of glitches. They consume the same amount of power as a normal toggle transition does and are an ideal candidate for power optimization. the simulation tool provides a de-rating factor (IK) to scale the IG count to an effective count of normal toggle transitions. are extra transitions at the output of a gate before the output signal settles to a steady state. the number of transitions is not enough to accurately estimate their power dissipation. Also. In a back annotated SAIF file there are timing and toggle attributes. non-glitching and glitching. The second category of glitches is called inertial glitches (IG). To model glitching behaviour. Optimization strategies for power be assigned different power values in the back annotation SAIF file. the leakage power section is divided into four outcomes for valid cell states and a default outcome for cell states not covered by the aforementioned ones. Transport glitches (TG). their pulsewidth is smaller than a gate delay and would be filtered out if an inertial delay algorithm were applied in the simulator. However. for example. This information usually yields reasonable analysis and optimization results.Chapter 7. Therefore. Unlike TG. The back annotation SAIF file from the simulator includes the captured switching activity based on the information and directives in the appropriate forward annotation file. Switching activity can be divided into two categories. ( COND (A*B*Y) COND (!A*B*Y) COND (A*!(B*C)) COND B COND C COND_DEFAULT (T1 (T1 (T1 (T1 (T1 (T1 1) 1) 2) 1) 1) 3) (T0 (T0 (T0 (T0 (T0 (T0 8) 8) 7) 8) 8) 6)) ElectroScience — 124 — . The former is the basis of information about toggle rate and static probability. The former group includes: • T0: Total time design object is in “0” state • T1: Total time design object is in “1” state • TX: Total time design object is in unkown “X” state • TZ: Total time design object is in floating “Z” state • TB: Total time design object is in bus contention (multiple drivers simultaneously driving “0” or “1”) The toggle attributes can be summarized as: • TC: Number of “0” to “1” and “1” to “0” transitions • TG: Number of transport glitches • IG: Number of inertial glitches • IK: Inertial glitch de-rating factor Following is an example of how state dependent timing attributes appear in the back annotated file and how they are interpreted. a full-timing simulation is required.

Included are general definitions of base units.2 Information provided by the cell library As an example. In order to calculate — 125 — Lund Institute of Technology . Figure 7. and wire load models.3.10) This is the basic information needed to calculate switching and leakage power. the standard cell library to UMC’s 0. The information is bundled in [23].7. we consider a 2-input XOR cell.11) (7.13µm process is considered.11 shows an example of how the internal energy per transition is obtained using a look-up table.3 Calculating the power consumption As a simple example. There are 9 cycles in total and for every condition defined. The information needed to calculate the power consumption is in a file called HDEXOR2DL. 7. From the simulation time tsim . The path dependent power model distinguishes which state the pin is in and whether a rising or falling transition occured. this look-up is done for every input pin. the toggle rates T R and static probabilities P r{I}. Extending this. an extrapolation takes place if the exact index into the table is not available. the number of cycles for which this condition is fulfilled are counted. operating conditions. Furthermore. 7. Each net is modeled with its timing and toggle attributes. These templates use weighted input transition times and output net capacitances as indeces into the respective look-up tables that specify the energy per transition.rpt. Included is the necessary wire load model and the cell description.10: Simulated waveforms result in back-annotated SAIF data as listed before.10. as indicated in the figure. Power estimation with Synopsys Power Compiler A B C Y A*!(B*C) DEF DEF C !A*B*Y A*B*Y DEF B Figure 7. The respective simulated waveforms are depicted in Figure 7.saif. we list the back annotated SAIF-file that results from a gate level simulation of a simple XOR gate in xor back. Considering a simple cell.3. (7. Each cell’s behaviour is described in conjunction with timing and power model templates. are obtained according to T R = T C/tsim and P r{I} = T 1/tsim .3.

Here you have to make use of the state dependent leakage power information that is provided in the cell library. Figure 7. 42.Chapter 7. t. (7. (7. usually 1ns.4 65. but only half of them affect the power consumption. The factor 1/2 results from the fact that T R accounts for both 0 → 1 and 1 → 0 transitions.4 Power-aware design flow at the gate level After having introduced the basic concepts of power estimation. the number of transitions is counted on a per second basis and hence one has to normalize T R to the default time unit (t. The total leakage power of a cell is determined by a superposition of the probabilities of being in a certain state times the leakage power for this state. N Pleak = i=1 P r{I = i} · Pleak.11: Look-up table based internal energy calculation. Optimization strategies for power Energy/transition z y x 69. Surely.) of the cell library. Also.82 Weighted average input transition time Output load cap. 2m ]. The switching power of a cell can be expressed as Pswitch = 1/2 · TR 2 · (Cwire + Cpin ) · Vdd .67 0.20 0.13) Here.u.56 0.8 0. one gets a feeling for the amount of information that has to be processed to get an estimate of a design’s power consumption.12) where N = 2m is the number of combinations of m binary inputs and I = [1 .rpt. 7.1 10. use the back annotated SAIF file listed earlier in this section. For verification of your results.u. The internal power is a lot more subtle to calculate since the tool’s extrapolation algorithm is hard to track. Based on these facts try to calculate both switching and leakage power of the cell. one has to take the load capacitance into account which is usually comprised of wire load and pin load. . toggle rates and static probabilities of the nets. . the power estimation report is found in xor.i .2 30. it is time to conclude with a complete design flow.9 Figure 7.12 shows an example of a design ElectroScience — 126 — .

In most cases clock gates have to be inserted manually in the design together with control logic but in some cases there are tools that can perform automatic low level clock gating. An incremental compile will then try to optimize the netlist based on the premises it was provided with. Compiling the elaborated RTL or GTECH description results in a gate level netlist on which you can annotate switching activity from a back annotation file (BW-SAIF). Figure 7. These flip-flops are implemented with a feedback from output to input and a multiplexer that selects if the previous output is to be kept or a new input is to be accepted. the switching activity is annotated and power constraints are set to the design. no extra control signals are needed since all control logic is already there. Synopsys claims that the power estimates lie within 10-25% of SPICE accuracy. timing constr. — 127 — Lund Institute of Technology . delay (. netlist (.13(b). delay (. timing and area constraints are set to the design. preferably nc-verilog or vcs since these simulators make use of state and path dependent information provided by the cell library (FW-SAIF).13(a) shows a flip-flop with enable input.sdf) Backannotate BW-SAIF Gate-level sim. This file is created by a gate-level simulator. After RTL clock gating is completed. elaborate clock gating RTL.sdf) Backannotate BW-SAIF Gate-level sim.v). Figure 7.7. netlist (. see Figure 7.14 shows the result of automatic clock gating on a bank of enable flip-flops. Analyze. Unlike manual clock gating.v).4. Power-aware design flow at the gate level flow that estimates and optimizes power consumption of a digital design at the gate level. GTECH Compile area. Note that MS can not make use of this information and hence the power estimates based on the toggle files from this simulator will be inferior. If the design was fully annotated. Now. These tools automatically find flipflops that have an enable control and exchange these to a clock gate and standard flip-flops.12: Gate-level power estimation and optimization flow. capture toggle info FW-SAIF SDPD info Report power Compile -incr power constr. capture toggle info FW-SAIF SDPD info Report power Figure 7.

1 Design flow Figure 7. There is a predefined optimization priority that can be changed. although great care has been taken to debug the scripts.13 µm technology. These scripts can be downloaded from [?]. diverse cell libraries make life easier for the optimization tool. Then. in D Q out en clk clock gate gclk Figure 7. and area constraints. the names of each script and environments they are executed in. All feedback is welcome. Optimization takes place by evaluating optimization moves and their impact on the overall cost function.scr ElectroScience — 128 — . designs dominated by sequential logic often achieve less power reduction since different sequential cells tend to have less variation in power than combinational cells. and max capacitance. No responsibility is taken for the consequences of using these scripts. 7. Following are in descending order dynamic power.5 Scripts This section provides a short description on scripts for DC and Cadence ncverilog that deal with power estimation and optimization. Also.15 shows the design flow for the power optimization scripts.5. The scripts are written for the UMC 0. are the most important ones. Inputs are header. Generally. leakage power. A general knowledge about UNIX and DC is assumed. Furthermore.Chapter 7. design rule constraints set by the technology library. 7. Optimization strategies for power in en D Q out in D Q out en clk (a) clk (b) Figure 7.14: The result of automatic clock gating on a bank of enable flip-flops. a positive timing slack will leave more possibilities for optimization whereas already tight timing constraints might not result in any power improvement at all. By default. max transition. for example.13: An enable flip-flop (a) and its implementation in Synopsys (b). max fanout. timing is the next critical issue.

15: Design flow for power optimization with DC and nc-verilog.sdf) Figure 7.ncverilog switching activity (.scr).saif compile. power constraints on both dynamic and leakage power can be set in the header.ncverilog new switching activity (. All information about the design and the target values for the synthesis is listed in header.scr timing area. and VHDL code.saif) Design Compiler final power. The result is an initial VERILOG netlist together with a standard delay format (SDF) file that will be used in nc-verilog to estimate switching activity in each node of the design.saif) Design Compiler power opt.scr netlist (. This is carried out by running ncverilog -f compile. The first step in the optimization takes place in DC. a power optimized netlist.scr Design Compiler read hdl. is used in nc-verilog to collect the new switching activity by running — 129 — Lund Institute of Technology .5. Timing and area constraints are set in this stage and DC will look for places to insert clock gates. elaborated. Scripts vhdl description header.ncverilog Then. With this results as a guideline.vhd) delay (. All important results are collected in a report file. the switching activity report is used to estimate the power consumption in DC (power opt. Here the design is be read.sdf) nc-verilog umc013.scr.7. and synthesized.scr final power report netlist (.v) first power report nc-verilog compile pwr.scr.scr power optimized netlist (.v) delay (. An incremental synthesis is performed and the result.

ElectroScience — 130 — .scr header.scr script executed in the syn folder will perform all steps in the design flow and the result can be read in the report files.scr timing area. open the header.scr and fill in search paths. the all. However.sdf script has to be executed in PrimeTime due to some naming conventions problems between the UMC libraries and the DC SDF file.scr final pwr. create the hierarchy shown in Figure 7.ncverilog The remaining step is to estimate the final power consumption and collect all important information in a report file. this netlist and the SDF file can be used to perform a post synthesis simulation for verification purposes in MS. since more than one program is involved there are some parts of the scripts that have to be patched manually.scr read hdl. Then. To initialize the project.scr.scr Figure 7.ncverilog vhdl vhdl files stimuli all.1. Optimization strategies for power ncverilog -f compile_pwr. To create an SDF file for VHDL the write.scr script is executed. Finally. The final output are a VHDL netlist. and a report file with the final power consumption estimate.scr write sdf.2 Patches The aim was to generate a fully automatic design flow that could be executed with one script and where all parameters are set in one constraint file.5.ncverilog compile pwr. If the above procedure is performed correctly.ncverilog in the sim folder. Then.syn in the syn folder and source setup. which files and what to change are listed below. constraints. and design names.16: The required file hierarchy for an automated design flow.ncverilog compile. it is strongly recommended that the scripts are executed one at a time and the observed results are correct.16. The complete structure with all script files and an example can be downloaded from [?]. at least once before the all. an SDF file. This is performed in DC with final pwr.16 must be provided. However. the parts that have to be manually patched are shown in Table 7.scr report. In addition to run an automated flow a file hierarchy as shown in Figure 7. 7.Chapter 7.scr power opt. PROJECT syn source setup. Then source setup.syn sim source setup.

Might cause timing errors in DC. tb ALU. Warning: Use clock gates with latch to avoid glitch sensitivity../syn/netlists/ALU W L8.sdf verilog tb verilog tb verilog tb compile.6./syn/netlists/ALU W L8 PWR.v tb name tb ”MODULE”.ncverilog compile pwr. 7.. elaborate ALU -lib WORK -update -gate clock Description: Elaborates the design and looks for places to insert clock gates. set false path -to find(-flat. type sold and look in the Power Management/Power Compiler reference manual/Using Verilog. tb ALU.sdf file name must be tb ”MODULE”.6 Commands in DC Some commands that are frequently used in DC and nc-verilog for power optimization are briefly explained in this section.sdf write.ncverilog compile pwr. feel free to type: dc_shell> list -commands For a complete explanation of the commands in nc-verilog.v 7. e.v. Warning: Will only function correct if the latches in the clock gates are the only latches in the design (this is normally the case). -hierarchy. DC might otherwise insert one clock cycle of delays on the enable signal.v.db result file name netlist/ALU W L8 PWR. — 131 — Lund Institute of Technology .. For a complete explanation use the man command in dc shell. Clock gating set clock gating style -sequential cell latch -minimum bitwidth 3 Description: Tells DC to use latch based clock gates if it finds enable registers that are at least -minimum bitwidth bits wide when elaborating. for example.1: Manual patches. tb ALU.g. dc_shell> man <command_name> To get a glimpse of how many commands there are.v netlist name . e.g. tb ALU.v tb name tb ”MODULE”.7. Parameter Values and example design data base db/ALU W L8 PWR.g.v netlist name . It is assumed that the reader already has basic knowledge about common commands in DC.v. Commands in DC File write..ncverilog Table 7.6. see also set false path. e.1 Some useful DC commands This section is an excerpt of the commands that appear in the power estimation scripts. pin "latch/D") Description: Tells DC to NOT perform timing checks on the enable signal in clock gates.v Design under test must be labeled as dut inputs and parameters see the example..ncverilog compile.

Used to recompile the design after power constraints are specified. Warning: Only necessary if power estimation is performed on RTL.Chapter 7. This file is then used in a simulation tool to capture switching activity. power preserve rtl hier names=true Description: Preserves RTL-hierarchy in RTL design. Miscellaneous commands compile -boundary optimization Description: Will optimize across all hierarchy levels. rtl2saif Description: Used to create a forward annotated SAIF file. The forward SAIF file for gate-level simulation is provided by the technology library (lib2saif). Optimization strategies for power report clock gating -hier -gating elements Description: Reports all inserted clock gates. set max dynamic power 8 mW set max leakage power 20 uW Description: Sets target values for power optimization. reset switching activity -all read saif -input "backward. report report report report report report report ElectroScience area > REPORT FILE timing >> REPORT FILE constraint >> REPORT FILE design >> REPORT FILE reference >> REPORT FILE routability >> REPORT FILE port >> REPORT FILE — 132 — . See also the nc-verilog command toggle report. This information is used to estimate power consumption.saif" -instance "tb ALU/dut" -unit base ns Description: Removes previously annotated switching activity and reads new switching information from the backward. report power -analysis medium -hier -hier level 3 >> REPORT FILE Description: Reports power consumption in the design and lists it for each module down to the third hierarchy level. Warning: Must be used if a second compile is performed on the same design.saif file into the design. Power estimation and optimization link Description: Links all used libraries to the current design. compile -incremental Description: Will only try to solve issues due to new constraints set on an already compiled design. Warning: Must be set to true if rtl2saif is to be used.

sdf". toggle start(). in this case the dut module. set toggle region(tb ALU.6."*plus*") set implementation "DW01 add/cla" cell list remove unconnected ports cell list } Description: Finds all adders in the elaborated design and set them to be implemented in carry-lookahead (CLA) style found in the DesignWare library DW01. design list) { current design de cell list = find (cell. Description: Specifies the top level in the design. Description: Will read the timing information specific to the design dut that is to be simulated.saif"). Description: Specifies at what time toggle count will start. set register type -exact -flip flop HDSDFPQ1 Description: Replaces all registers with the specific register HDSFPQ1. sdf annotate("ALU.6./sdf. "*".log"). design list = find (design. — 133 — Lund Institute of Technology . 7. #50*CYCLE_TIME $toggle_start() states that toggle count will begin after 50 clock cycles (CYCLE TIME is a constant specified in the testbench). Warning: Will not fix all names for vhdl-format. do not be surprised about all the warnings. write sdf -version 1. This file is used to perform a more accurate simulation of the netlist.0 "ALU.sdf" Description: Writes information about timing and delay between all pins in the design. Description: Will read the timing information specific to the technology library that is used. Therefore. change names is used since DC sometimes change names on nets and ports.dut). -hierarchy) design list = design list + current design foreach (de.".2 nc-verilog read lib saif("umce13h210t3 tc 120V 25C.7. dut. for example. Commands in DC report clock gating -hier -gating elements >> REPORT FILE report saif -flat -missing >> REPORT FILE report power -analysis medium -hier -hier level 3 >> REPORT FILE Description: Reports a lot of useful information and store it in REPORT FILE. change names -rules vhdl -hierarchy write -format vhdl -hierarchy -output "netlist/ALU syntesized. .vhd" Description: Change names to match with current db and save the netlist as in vhdl-format.

and ALU CT RL are the control signals. the intermediate results from synthesis are also available in a different report file for comparison purposes.Chapter 7. power constraints are specified and an incremental synthesis is performed. the time scale is nano seconds.0e-9. the new design is power estimated and the result is presented in a report file. ElectroScience — 134 — . This file is used in DC to perform a more accurate power consumption estimate. Description: Specifies at what time toggle count will stop. B A Mux M Alu X Shift S R R reg ALU OP Q reg Q Figure 7. and an estimate of the power consumption is performed based on the inputs in the stimuli file. 7. for example."tb ALU").saif". Figure 7. an ALU.saif. Then. Q is output.7 The ALU example The power optimization scripts described in the previous section includes an example design.17 shows the ALU schematics where A and B are inputs. Optimization strategies for power toggle stop(). The ALU will then be synthesized according to timing constraints.17: An Arithmetic Logical Unit (ALU). toggle report("/backward. The two enable flip-flops will be identified as candidates for clock gating and exchanged with a clock gate and basic flip-flops without enable input. What follows is a short description of the effects of running the scripts on the ALU design. see Chapter ?? for more details. Finally. #1000*CYCLE_TIME $toggle_stop() states that toggle count will begin after 1000 clock cycles (CYCLE TIME is a constant specified in the testbench). that is. the required clock frequency.1. Description: Writes all information about switching activity for each pin and net in the design to the file backward.

By requirements formalization. The system is modelled to obtain a mathematical description of how it works. This means that it will find the difficult corner case even if the designer has missed it. and how they relate to another is described.Chapter 8 Formal Verification With the growing complexity and reduced time-to-market the biggest bottleneck in digital hardware design is design verification. which gives precis requirements. 256 billion test vectors are needed where as with formal verification methods only a mathematical equivalence verification is needed. Of this reason there is a huge demand to find verification methods that are faster without reducing the coverage of the verification. The growing complexity of ASIC designs has caused the time of verification through simulation to increasing very much. It is therefore not necessary for the designer to identify all corner cases and test each one with a set of test vectors. formal verification techniques are also inherently more robust.1 Concepts within Formal Verification In Figure 8. The formal specification is validated to make sure that it correctly captures the requirements and needs. It has been shown that for a design of 20 to 25 million gates the verification time was reduced to less than a tenth of the verification time using a simulation tool. One method that has matured and established it self in real life design flows is formal verification methods. 8. To fully verify the shifter block. In addition to being much faster than simulation. Requirements and needs are more or less precisely formulated requirements and expectations of a system. An insufficiently verified ASIC design may contain hardware bugs that can cause expensive project delays when they are discovered during software or system tests on the real hardware. A formal verification tool will automatically find all cases where a design does not meet the specifications. The real system is the actual design. An example that will make the difference between verification using simulation and formal verification visible is the verification of a 32 bit shifter.1 the most important concepts in formal verification. The modelled description of the real — 135 — Lund Institute of Technology . The formal verification methods can be described as a group of mathematical methods used to prove or disprove properties of a design. one arrives at a formal specification.

By formal verification one can prove or disprove that the system model satisfies the requirements in the formal specification. 8.1 Formal Specification The formal specification gives an exact description of the system function.1. When a formal specification is generated by a tool it is important that the formal specification correctly describes the behavior of the design. some designers uses two formal verification tools from different vendors with the hope that the tools complement each other when it comes to correctness. This is done by validating the specification. To increase the certainty. Also when the system model is generated by a tool the correctness is an issue and therefore the same procedure as for the tool generated formal specification is used.Chapter 8. Another possibility is to use formal construction to create a system model which with certainty satisfies the requirements. It is also essential to make certain that the formal specification correctly describes the functionality of the system.1: The concepts in formal verification.1. This description is called the system model. system is called the system model. 8. The function of the formal specification is therefore to precisely describe the concepts used when talking about the functionality of the system. Since this is an issue of the correctness of the tool used it is hard to have any control over it. Formal Verification Requirements & Needs Requirements Formalisation Validatio n Formal Specification Formal Construction Verification System Model Modelin g Real System Figure 8.2 The System Model Since the formal specification is a mathematical description of what the system should do. a mathematical description of what the system actually does is needed. ElectroScience — 136 — .

respectively. formal design and construction is therefore preferable to applying formal verification on top of a traditional system development process. where the designer does not have time to wait for lengthy simulation runs during each design iteration.2. which is used in different stages in the design flow. 8.3 Formal Verification Formal verification means that using a mathematical proof it can be checked with certainty whether the system (as described by the system model) satisfies the requirements (as expressed by the formal specification).1 Model Checking When the formal verification tool is used as a model checker design specifications are translated into a set of properties described in a formal language. The principle of designing for testability is most certainly valid when formal methods are to be used. 8. that the system model and the formal specifications correctly capture the properties of the system and the requirements. The design is then mathematically verified for consistency with these properties. of course.1. Instead of developing the system with traditional methods and then verifying it. This kind of verification is often used to verify RTL code before synthesis. A precondition for meaningful verification is.2. When relying on testing there is no guarantee that the system does not behave incorrectly in a case which has not been tested. In general.2 Areas of Use for Formal Verification Tools There are two main kinds of usage for formal verification tools namely. 8. What the formal verification will substantially reduced is the amount of testing.1.2.8. 8. It is often difficult to verify an existing hardware system already designed and constructed using traditional methods. This kind of verification is often used to compare two — 137 — Lund Institute of Technology . and above all reduce the amount of debugging. Areas of Use for Formal Verification Tools 8. the system can be developed using a formal construction process (often referred to as synthesis of the system). model checking and equivalence checking.4 Formal System Construction The feasibility of successful verification depends to a major extent on how the system is constructed. In this case the formal specification and the system model is automatically generated by the tool. A successful formal verification does not imply that testing is superfluous since every step in the system development can not be treated formally.2 Equivalence Checking When the formal verification tool is used as an equivalence checker the tool compares two design descriptions and verifies that they have the same behavior. Formal verification differs from testing in that it guarantees that the system acts correctly in every case.

In formal verification tools as in synthesize tools the same class of algorithms is used. Often secured by using tools from different vendors. 8. The FPGA-prototypes has some limitation when it comes to imitate designs on chip. 8.Chapter 8. or it can compare two netlists to ensure that the insertion of a scan chain or other test logic has not ruined the original behavior of the design. To cover delay problems. A feature in ASIC-emulators that makes hardware simulation easier is that the emulator can be compiled in such a way so that nodes in the design is visible to an desirable extent.3.2 ASIC-emulators The ASIC-emulators is mainly used to start software development earlier but can also be used as a hardware simulation tool. timing problems and sequencing problems in a designspecial verificatin tools has to be used. Formal Verification versions of RTL code to ensure that an optimization has not changed behavior of the design.3 8.3 Limitations in Formal Verification Tools With formal verification methods only functionality checks of a design can be done.2. 8. Despite the limitations with FPGA’s.3 FPGA-Prototypes As for the ASIC-emulator the FPGA-prototypes is mainly used for software development but can also be used for hardware simulation. it is one of the most preferred methods when simulating a hardware design. Limitations in development tools for FPGA’s can also give some problems since they are not so advanced as development tools for ASIC’s. 8. Example of limitations is that FPGA’s are developed for a certain type of applications which sets the architecture of the FPGA. To reduce the risk of double error it is therefore important to use independent tools. Therefore when implementing an ASIC-design on an FPGA it is fairly often that the ASIC-architecture not agrees with the architecture of the FPGA. The penalty of these limitations for FPGA-prototypes is mainly that one has to adjust the design methodology to these limitations.3.1 Other Methods to Decrease Simulation Time Running the simulation in a cluster of several computers The simulation is divided up on several computers in a network. In an asynchronous circuit designs formal verification methods are not usable since formal verification methods can not handle non sequential circuit designs. To be able to do this the simulation tool used must have this possibility. The reliability will properly inElectroScience — 138 — .3.4 Future Reliability of formal verification tool has increased very much since the tools become commonly used for ASIC-verification. 8.

Emmanuel Ligeon. Hence. A process in constant progress is to move hardware design to a higher abstraction level.hardi.hardi.l4i.se/eda-tools/why_formal_verification. it can be expected that in new tools formal methods will be more common.com/formalpro/css.html.htm http://www.5 References http://www. 8. Formal methods has shown that they apply very well on the next level of abstraction that is the system level. click on Alcatel.htm http://www.com/eda-tools/literature/Formell_verifiering_i_praktiken_1.se/E_form.com/story/OEG20010403S0006 http://www. — 139 — Lund Institute of Technology .pdf http://www.8.edtnscandinavia. The need of this comes from the desire to reduce time-to-market. References crease even more in the future and reduce the limitations.5.mentor.

ElectroScience — 140 — .

Reducing the power supply voltage is an efficient method to reduce all types of power consumption (due to switching. While the active power consumption varies with the workload. and a prediction of doubled clock frequency every 24 months has also been added [25. However. such as a PDA or a cellular phone to last as long as possible. reducing the power supply voltage for an entire chip might not be possible due to performance constraints. the active power consumption will soon be overtaken by the leakage power. If the increased computing speed and packing density of modern technologies is to result in faster and more complex chips. If the trend shown in Figure 9. such as mobile phones. — 141 — Lund Institute of Technology .1 continues. Active power consumption due to capacitive switching is traditionally regarded as the main source of power consumption in digital design. Furthermore. For burst mode applications. For the past decades. Power management has therefore gained much attention both in industry and in universities. Components of an ASIC or a microprocessor have different requirements on power supply voltage due to different critical paths. To make the battery in a hand-held device. and leakage) [28]. limiting the on-chip power consumption is important due to increased production costs for cooling arrangements. the technology development has followed the predictions by Gordon Moore in “Moore’s law”.26].1 shows the development in active and leakage power [27]. making simple saving schemes such as clock gating less attractive. Therefore. as long as the power supply voltage is switched on. thus. new methods must be developed to handle power consumption. “Moore’s law” has been revised to a factor two improvement every 24 months. running some components on a lower voltage can save power. short circuit. and can be locally eliminated using for instance clock gating. power consumption due to leakage remains. since it causes an increasing power consumption in each technology generation [27].Chapter 9 Multiple Supply Voltages Energy and power consumption are important factors in the design of integrated hardware. The original version of “Moore’s law” was an observation that the number of components on integrated circuits doubles every 12 months [24]. a large part of the system is in idle mode most of the time. Figure 9. reaching deep submicron technologies also makes it necessary to take transistor leakage into account. energy stored in the battery must be used efficiently. However. Later. This makes the energy consumed by leakage a large contributor to the overall energy consumption and.

33. if each output bit is connected to an inverter supplied by a lower voltage. a control circuit is used for generating a square-wave with variable duty factor that controls the switches.001 1960 1970 1980 1990 2000 2010 Figure 9. Furthermore. while using more than two voltages only leads to minor additional savings. all buses and signals are run at the lower voltage. E. For the designs in [29]. and [30]. Multiple Supply Voltages 1000 100 Active Power Leakage Power (W) 10 1 0. there are examples of Buck converters reaching an efficiency as high as 90 %.2b is used in [30. The switched supply is filtered using a low-pass filter to minimize the ripple at Vout . Supply voltage scaling is associated with an overhead due to DC/DC-converters for transforming the supply voltage down to lower voltage levels.01 0. The most frequently used DC/DC-converter for power savings by lowering the supply voltage is the Buck converter [28]. 34]. ElectroScience — 142 — . an increase in the number of supply voltages from two to three results in less than 5 % additional power savings. power savings in the range of 2550 % are achieved for a number of design examples using dual supply voltages compared to a single voltage approach. In the Buck converter. as shown in Figure 9. The signal level converter shown in Figure 9. Besides the power consumption in the control circuits for the Buck converter. Moore. In [31] and [32].2a. The Buck converter converts the main supply voltage down to a lower voltage using large transistor switches. This leads to further power savings [35]. Intel Corp [27] 0.1 Source: G. In a system with more than one supply voltage.Chapter 9. there is always a power loss in the switches due to resistance of the transistors. Increasing the number of available supply voltages from one to two often results in large power savings. signal level converters are needed to ensure correct functionality when going from a low voltage region to a higher voltage region. In [29].1: Processor power. active and leakage.

In this example. The text is intended for anyone having knowledge on how to use scripts in Silicon Ensemble. Buck converter (a).1 Demonstration of Dual Supply Voltage in Silicon Ensemble This part describes a (fairly) simple method to Place & Route a design with multiple supply voltages using the tool Silicon Ensemble.def”.1 Editing Power Connections in the Netlist The netlist contains information about which net to use for power and ground. and signal level converter (b).. However.v).. The contents of the file “supplyPads2Vdd.. Scripts and netlist files for the example are found in [?].2: Overhead components associated with multiple supply voltages. The macro “Verilog2DEF.13µm technology.9. and it is connected to the net “VDP”. assignments of power nets are easily manipulated in a netlist written in Design Exchange Format (DEF). To import supply pads use the DEF-file “supplyPads2Vdd.mac” can be executed in silicon ensemble to import and convert the design “dummy top. 9. and that the designer has full control over the design procedure.. The example is implemented using the UMC 0.1. To simplify edits in the netlist. — 143 — Lund Institute of Technology . 9. Advantages of this method are that no other tool than Silicon Ensemble is necessary.def”. Changing an automatically generated netlist is generally avoided. the top view contains the two blocks “conventional” and “improved”. The hierarchy of the design shall contain a top view where each block is intended to use a separate supply voltage. It is likely that the design is initially described using a Verilog netlist (.def” is shown below. A description of the design used in this demonstration is found in [Matthias paper]. The added supply voltage pad is “pi VDP”.1.Joachims manual. which contains an extra supply voltage pad with net name “VDP”. Demonstration of Dual Supply Voltage in Silicon Ensemble Vhigh VDD Vctrl Vout out Vlow in (a) (b) Figure 9.v” to “dummy dop. Start Silicon Ensemble as described in .. it is converted to DEF format.

V3IO ( pi_* V3IO ) + USE POWER .def” in a text editor and skip down to the line “PINS 28 . COMPONENTS 9 . Skip down to the first line that contains “conventional” and change the block to use “VDP” as shown below. .pi_VDP WVVDD .”.VDP + NET VDP + + DIRECTION INPUT DIRECTION INPUT + + DIRECTION INPUT DIRECTION INPUT + DIRECTION INPUT + + USE GROUND . . .pi_CORNER_RB WCORNER END COMPONENTS .VDD ( pi_VDDC VDD ) + USE POWER . As default. .Chapter 9.”. .pi_CORNER_RT WCORNER .V0IO ( pi_* V0IO ) + USE GROUND . Skip down to the line “SPECIALNETS 5 . . . . END SPECIALNETS END DESIGN Open the netlist “dummy dop. UNITS DISTANCE MICRONS 1000 .pi_CORNER_LB WCORNER .pi_VDDI WVV3IO . SPECIALNETS 5 . Change number of pins to 29 and add a supply voltage pin “VDP” as shown below. . The text between “SPECIALNETS” and “END SPECIALNETS” contains connections for power supply wires. both blocks use power supply “VDD”. ( improved/l_trellis_instance/l_butterfly_instance_0/U70 VDD ) ( improved/l_trellis_instance/l_butterfly_instance_0/U69 VDD ) ElectroScience — 144 — . . . USE POWER . .pi_VDDC WVVDD .V3IO + NET V3IO . .VSS + NET VSS + .VSS ( * VSS ) + USE GROUND .VDP ( pi_VDP VDD ) + USE POWER .V0IO + NET V0IO . USE GROUND . + USE POWER .pi_VSSC WVVSS . Multiple Supply Voltages DESIGN demoDesign . USE POWER . This is changed to “VDP” for the block named “conventional”.pi_VSSI WVV0IO .pi_CORNER_LT WCORNER . PINS 29 .VDD + NET VDD + . . .

Change this to “dummy.1.mac”. ( conventional/l_trellis_instance/l_butterfly_instance_0/U67 VDD ) ( conventional/l_trellis_instance/l_butterfly_instance_0/U66 VDD ) ( pi_VDP VDD ) + USE POWER .1. they are placed as two separate blocks. sets the block “improved” to use the pad “pi VDDC”. which is shown below.def” contains all the edits described in this section. Demonstration of Dual Supply Voltage in Silicon Ensemble ( pi_VDDC VDD ) + USE POWER . SET VAR DRAW. This is controlled by adding “regions” and selecting one region for each block.mac”.def” if you skipped the edits described in chapter 9. REFRESH — 145 — Lund Institute of Technology . The I/O placement is set in the file “ioplace.1. Floorplan and I/O placement is performed with the script “2-floorp and IO.AT "On".REGION. which is found just before the line END SPECIALNETS should be removed to avoid warnings.ioc”.def”. This script imports the netlist “dummy top. Skip down to the last assignment of VDD-pins for the block named “conventional” and change the block to use the pad “pi VDP” as shown below. Since the two parts of the design shall use separate supply voltages.1. Regions are added by running the script “3-regions. .V0IO ( U75 V0IO ) ( U74 V0IO ) ( U73 V0IO ) ( U72 V0IO ) ( U71 V0IO ) ( U70 V0IO ) ( U69 V0IO ) ( U68 V0IO ) ( U67 V0IO ) ( U66 V0IO ) ( U65 V0IO ) ( U64 V0IO ) ( U63 V0IO ) ( U62 V0IO ) ( U61 V0IO ) ( U60 V0IO ) ( U59 V0IO ) The line .VDP ( conventional/U16 VDD ) ( conventional/bm_sig_reg_0_0_3 VDD ) ( conventional/bm_sig_reg_0_0_2 VDD ) ( conventional/bm_sig_reg_0_0_1 VDD ) ( conventional/bm_sig_reg_0_0_0 VDD ) ( conventional/bm_sig_reg_0_1_3 VDD ) The line ( pi_VDDC VDD ) + USE POWER .2 Floorplan and Placement Import the design with “1-import. ADD REGION NAME USEVDD CELL improved* BBOX (50000 -45000) (100000 25000). The file “dummy.mac”. .9. which is the default placement except for the power pads.VDP ( pi_VDP VDD ) + USE POWER . 9. ADD REGION NAME USEVDP CELL conventional* BBOX (-100000 -45000) (-50000 46000).

The command ADD REGION NAME USEVDP CELL conventional* BBOX (-100000 -45000) (-50000 46000). Figure 9.mac”.3 Power and Signal Routing In this design. Deciding size and placement of the regions is an iterative process. VDP. all components belong to one of the blocks that are placed within the two regions. Rows outside the regions can therefore be removed to give space for power routing.mac” (shown below) deletes rows outside the regions. The cells are placed within the defined regions.-50000) (50000. and VSS).3: Floorplan with regions.mac” to create three power rings (VDD. 9. Run the script “6-initialpower. Place cells by running the script “4-placecells. Multiple Supply Voltages Figure 9.50000). creates the region “USEVDP” at the specified coordinates.25000) (100000. DELETE ROW AREA SPLIT (-50000. ElectroScience — 146 — .Chapter 9.50000).1. and sets the placement of all components at hierarchical level “conventional” and below to use the region “USEVDP”.3 shows the floorplan with created regions. Running the script “5-cutrows. DELETE ROW AREA SPLIT (50000. Initial power rings are created and then edited for this design.

-54800). and “VDD” are moved to optimize power routing. — 147 — Lund Institute of Technology . Demonstration of Dual Supply Voltage in Silicon Ensemble Figure 9. -22400). and finally.mac”. 55200) (-41000.16312). and then saved as a script file. MOVE WIRE NOVIASATCROSSOVER DRC SNAP NET VDD (62969.79721) (63120. -68000). -54800).-80009) (62160.mac”. signals are routed by running the script “9-wroute. ADD WIRE SPECIAL DRC NOVIASATCROSSOVER SHORTSCHECK NET VSS LAYER MET2 WIDTH 10000 YFIRST (-41000. Power is routed by running the script “8-routepower. This is due to a grid mismatch and can (at your own risk) be ignored. MOVE WIRE NOVIASATCROSSOVER DRC SNAP NET VDD (134658. ADD WIRE SPECIAL DRC NOVIASATCROSSOVER SHORTSCHECK NET VSS LAYER MET2 WIDTH 10000 YFIRST (40000.16312) (124000. Adding and moving wires is easily done using the menu commands (Edit-Wire-Move/Add). MOVE WIRE NOVIASATCROSSOVER DRC SNAP NET VDD (62969.4 shows the initial power rings.1. -14800). and then wires for the nets “VDP”. MOVE WIRE NOVIASATCROSSOVER DRC SNAP NET VDD (-135842 25901) (27000. There might be geometry violations on signal wires close to pads. Figure 9. The script “7-modifypower.9. shown below contains two types of commands. MOVE WIRE NOVIASATCROSSOVER DRC SNAP NET VDP (120613. 34800).4: Initial power rings. 55200) (40000.23800) (-28000. two wires are added for the net “VSS”. First.mac”.

SROUTE ADDCELL MODEL HDFILL2 PREFIX FC2 NO FN SO FS SPIN VDD NET VDD SPIN VSS NET VSS AREA ( 20000 -75000 ) ( 133000 44000 ) .Chapter 9. The area “( 20000 -75000 ) ( 133000 44000 )” matches the block using supply voltage “VDD”. SROUTE ADDCELL MODEL HDFILL8 PREFIX FC8 NO FN SO FS SPIN VDD NET VDP SPIN VSS NET VSS AREA ( -130000 -75000 ) ( -42000 55000 ) . cells are added in a part of the routing area. SROUTE ADDCELL MODEL HDFILL4 PREFIX FC4 NO FN SO FS SPIN VDD NET VDD SPIN VSS NET VSS AREA ( 20000 -75000 ) ( 133000 44000 ) .mac”. and connected to the power nets.5 shows the completed design with geometry violations. SROUTE ADDCELL MODEL HDFILL8 PREFIX FC8 NO FN SO FS SPIN VDD NET VDD SPIN VSS NET VSS AREA ( 20000 -75000 ) ( 133000 44000 ) . SROUTE ADDCELL MODEL HDFILL4 PREFIX FC4 NO FN SO FS SPIN VDD NET VDP SPIN VSS NET VSS AREA ( -130000 -75000 ) ( -42000 55000 ) . Figure 9. #--Block using VDP SROUTE ADDCELL MODEL HDFILL16 PREFIX FC16 NO FN SO FS SPIN VDD NET VDP SPIN VSS NET VSS AREA ( -130000 -75000 ) ( -42000 55000 ) . SROUTE ADDCELL MODEL HDFILL1 PREFIX FC1 NO FN SO FS SPIN VDD NET VDD SPIN VSS NET VSS AREA ( 20000 -75000 ) ( 133000 44000 ) . Filler cells are also needed in empty spaces between standard cells. The script “11-fillcore. The two blocks must be handled separately since they have different names on power nets.1.4 Filler Cells To fill out empty spaces between I/O-pads. #--Block using VDD SROUTE ADDCELL MODEL HDFILL16 PREFIX FC16 NO FN SO FS SPIN VDD NET VDD SPIN VSS NET VSS AREA ( 20000 -75000 ) ( 133000 44000 ) .mac”. For each command in “11-fillcore. ElectroScience — 148 — . and the area “( -130000 -75000 ) ( -42000 55000 )” matches the block using supply voltage “VDP”. run the script “10-fillperi.mac” shown below adds filler cells to both blocks. Multiple Supply Voltages 9.

Demonstration of Dual Supply Voltage in Silicon Ensemble SROUTE ADDCELL MODEL HDFILL2 PREFIX FC2 NO FN SO FS SPIN VDD NET VDP SPIN VSS NET VSS AREA ( -130000 -75000 ) ( -42000 55000 ) . SROUTE ADDCELL MODEL HDFILL1 PREFIX FC1 NO FN SO FS SPIN VDD NET VDP SPIN VSS NET VSS AREA ( -130000 -75000 ) ( -42000 55000 ) . — 149 — Lund Institute of Technology .1.9.

Multiple Supply Voltages Figure 9.Chapter 9. ElectroScience — 150 — .5: Completed design.

vol. Prentice Hall. [2] P. Rep. NY: [12] R. 8.” IEEE Journal of Solid-State Circuits. J. B. methodology. 1210–1216. [11] K. Nov. J. Copenhagen. pp. Kluwer. May 2004. 2nd ed. Roy. Jiang and V. Vancouver. Kristensen. “Supply and threshold voltage scaling for low power CMOS. Chicago. Nilsson.Bibliography [1] J. 2. [5] ——. and J. no. Denmark. 2002. Sept.com/doc/vhdl2proc. 2000. Catthoor.pdf. P. China. 1st ed. 20th NORCHIP Conference. Illinois. Wiley.” [4] F. 7–12. vol.” Proceedings of the IEEE. Canada. 1999.” in Proceedings of the 10th Great Lakes Symposium on VLSI. “CMOS System-on-a-Chip Voltage Scaling beyond 50nm.” Synopsys. pp. 3. Bhasker. 2003. 32. B. 2003.” in Proc. “Leakage Current Mechanisms and Leakage Reduction Techniques in Deep-Submicrometer CMOS Circuits. USA. Mahmoodi-Meimand. IEEE Conference on Personal. vol. Tech. [10] “Managing power in ultra deep submicron ASIC/IC design.. Custom Memory Management Methodology. “A generic transmitter for wireless OFDM systems. Kapoor. Indoor. Gordon. Olsson. — 151 — Lund Institute of Technology . Inc. Morgan Kaufman. [9] A. [3] Jiri Gaisler. 305–327. 2001. “A flexible FFT processor.. S. Designers Guide to VHDL. Gonzalez. 2002. 1999. 1998. Ashenden. Bhavnagarwala.” in Proc. and H. [6] H. “High-level design http://www. pp. D. Meindl. [8] K. and A. Feb. [7] F. Austin. Aug. Horowitz. Bejing. A VHDL Primer. New York. pp. 1997. K. and M. Parhi.” in Proc. 3rd ed. Mukhopadhyay.gaisler. 91. and Mobile Radio Communication (PIMRC). 2234–2238. no. “FPGA implementation of controller-datapath pair in custom image processor design. VLSI Digital Signal Processing Systems. IEEE Symposium on Circuits and Systems (ISCAS). wall. A.

Technology Roadmap for Semiconductors.com/research/silicon/moorespaper. S. 1997. [16] G. [15] T.” First Monday./UMCE13H210D3 1.” [22] “Switching activity interchange format (SAIF).” http://www. [23] $UMC LIB/. Borkar. and S. Deutscher. Borkar.1/lib/umce13h210t3 tc 120V 25C. ElectroScience — 152 — . pp.” in Proceedings of the 2003 International Symposium on Low Power Electronics and Design.. “The Lives and Death of Moore’s Law. Sept. “Obeying Moore’s Law Beyond 0. 122–127. Apr. P. and F.. 2003. R. 2002. ISLPED’02. Sery. vol. no. Tiwari.” in Proceedings of the 7th Asia and South Pacific Design Automation Conference and the 15th International Conference on VLSI Design. Tenth International Conference on VLSI Design. Krishnamurthy. Aug. pp. pp. pp. ISLPED’03. “Transmeta LongRun Power Management. Clark. VA. Sachdev. Karnik. R. DAC’02. Hyderabad. Hsu. Gonzalez. E. Toumi. 1965. 26–31. 2000. 78–83. Tech.itrs. Chatterjee.” Electronocs. pp. and R. S. “Modeling and Estimation of Total Leakage Current in Nano-scaled-CMOS Devices Considering the Effect of Parameter Variation. S. 2003.Bibliography [13] Transmeta Corporation. 203–206. T. and V. [19] S. “Life is CMOS: why chase the life after?” in Proceedings of the 39th Design Automation Conference. pp. 185–192. 2002. [Online]. 45–50. “International http://public. N. 2002. “Leakage Power Estimation for Deep Submicron Circuits in an ASIC Design Environment. Available: http://firstmonday. “Sub-90 nm Technologies-Challenges and Opportunities for CAD. [18] B. De.. Aug. Borkar. Mukhopadhyay and K. 2002.” in Proceedings of the 13th Annual IEEE International ASIC/SOC Conference. Rep. Ravikumar. Roy. M. [14] V. Kumar and C.lib. Ricci.” Synopsys. Borkar. [24] G. Inc. Moore. India.org/issues/issue7 11/tuomi/ [26] S.pdf [25] I. “Effectiveness and Scaling Trends of Leakage Control Techniques for Sub-130 nm CMOS Technologies. Nov.net. Demmons.” in Proceedings of the 2003 International Symposium on Low Power Electronics and Design.” in Proceedings of the 2002 IEEE International Conference on Computer Aided Design. 8. “Standby Power Management for a 0. [21] ???. 172–175. 7–12. Aug. “Dynamic power management for microprocessors: A case study. ISLPED’03. pp. S. 38. Arlington. De. Jan. [Online]. Available: ftp://download.intel. 2002. “Cramming More Components onto Integrated Circuits. Donnelly. and V. pp. S.com.” in Proc. Malik. [20] L. [17] R. ACM Press. Jan. June 10-14 2002.18 Micron.18µm Microprocessor. Nov.transmeta.” in Proceedings of the 2002 International Symposium on Low Power Electronics and Design.

Olsson. 2003. May 6-9 2001. Suzuki. vol. [28] A. 1999. Mita. Roy. T. pp. 4. “Dual Supply–Voltage Scaling for A o Reconfigurable SoCs. K. M. P. 83. Fujita.” IEEE Journal of Solid-State Circuits.” in Proceedings of IEEE International Symposium on Circuits and Systems. New Orleans. ISCAS’97. pp. [30] V. pp. T. 20–23. Nakagome. 72–77. “Scheduling and Optimal Voltage Selection for Low Power Multi-Voltage DSP Datapaths. Chandrakasan and R.” in Proceedings of the IEEE. 1995. ˚str¨m. Feb. “No Exponential is Forever: But “Forever” Can Be Delayed!” in Proceedings of the IEEE International Solid-State Circuits Conference. S. Apr. Chiba. Matsuda. vol. [32] T.” in Proceedings of the 36th Design Automation Conference. pp. Apr. — 153 — Lund Institute of Technology . Brodersen. Watanabe. and M. W. “Variable Supply-Voltage Scheme for Low-Power High-Speed CMOS Digital Design. Takeuchi. Energy-Efficient. Sakurai. 4. [35] T. vol. and P. no. no. “Minimizing Power Consumption in Digital CMOS Circuits. Aoki. 3. ISCAS 2001. Moore. ISSCC’03. “Optimal Selection of Supply Voltages and Level Conversions During Data Path Scheduling Under Resource Constraints. 34.-Y. “Synthesis of Low Power CMOS VLSI Circuits using Dual Supply Voltages. Y. C. Parhi. pp. Apr. Horowitz. 28. F. June 1997. Australia. K. Johnson and K. pp. 2. [34] T. USA: ACM Press. LA. F. [31] G. K. “Sub-1-V Swing Internal Bus Architecture for Future Low-Power ULSI’s. T. [33] M.” IEEE Journal of Solid-State Circuits. E. 498–523. 2152–2155. Maeda. 33. 72–75. no. 1993. Sano. 520–528. pp. Isoda. Kuroda. 454–462. Roy. Wei and M. June 21-25 1999. Furuyama. K. vol. Oct. Sydney. Mar.” in Proceedings of IEEE International Symposium on Circuits and Systems. and T. Nilsson. pp. 1996. Johnson and K.Bibliography [27] G. vol. K. 61–64. “A Fully Digital.” in Proceedings of the 1996 IEEE International Conference on Computer Design. no. pp. Itoh. 4. Sundararajan and K. [29] M. 1998.” IEEE Journal of Solid-State Circuits. Yamane. Adaptive Power-Supply Regulator. A. 414–419. P. DAC’99.

ElectroScience — 154 — .

std_ulogic. — 155 — Lund Institute of Technology .(’U’.’0’.Forcing Unknown ’0’. ux01. UX01. std_ulogic r r r r r r : : : : : : std_ulogic std_ulogic std_ulogic std_ulogic std_ulogic std_ulogic ) ) ) ) ) ) ) RETURN RETURN RETURN RETURN RETURN return RETURN UX01. -------------------------------------------------------------------. std_ulogic. -. -. -. UX01.’Z’) SUBTYPE UX01 IS resolved std_ulogic RANGE ’U’ TO ’1’. UX01. -. -. std_ulogic.Weak 1 ’-’ -. -.Forcing 1 ’Z’.(’U’.resolution function ------------------------------------------------------------------FUNCTION resolved ( s : std_ulogic_vector ) RETURN std_ulogic.’1’.common subtypes ------------------------------------------------------------------SUBTYPE X01 IS resolved std_ulogic RANGE ’X’ TO ’1’.Appendix A std logic 1164 PACKAGE std_logic_1164 IS -------------------------------------------------------------------.’X’.High Impedance ’W’.’X’. -------------------------------------------------------------------.’1’.(’X’. std_ulogic.’0’. -.Forcing 0 ’1’. -. -------------------------------------------------------------------. -------------------------------------------------------------------.Weak Unknown ’L’.Don’t care ). UX01.’0’. std_ulogic.Weak 0 ’H’.*** industry standard logic type *** ------------------------------------------------------------------SUBTYPE std_logic IS resolved std_ulogic. -.(’X’.unconstrained array of std_logic for use in declaring signal arrays ------------------------------------------------------------------TYPE std_logic_vector IS ARRAY ( NATURAL RANGE <>) OF std_logic. -------------------------------------------------------------------.unconstrained array of std_ulogic for use with the resolution function ------------------------------------------------------------------TYPE std_ulogic_vector IS ARRAY ( NATURAL RANGE <> ) OF std_ulogic.Uninitialized ’X’.logic state system (unresolved) ------------------------------------------------------------------TYPE std_ulogic IS ( ’U’.’1’) SUBTYPE X01Z IS resolved std_ulogic RANGE ’X’ TO ’Z’. -. -. -.’1’) SUBTYPE UX01Z IS resolved std_ulogic RANGE ’U’ TO ’Z’.overloaded logical operators ------------------------------------------------------------------FUNCTION FUNCTION FUNCTION FUNCTION FUNCTION function FUNCTION "and" "nand" "or" "nor" "xor" "xnor" "not" ( ( ( ( ( ( ( l l l l l l l : : : : : : : std_ulogic.’0’.’Z’) -------------------------------------------------------------------. UX01.

( l. std_logic_vector. std_ulogic_vector. FUNCTION FUNCTION FUNCTION FUNCTION FUNCTION To_StdULogic To_StdLogicVector To_StdLogicVector To_StdULogicVector To_StdULogicVector ( ( ( ( ( b b s b s : : : : : BIT BIT_VECTOR std_ulogic_vector BIT_VECTOR std_logic_vector ) ) ) ) ) RETURN RETURN RETURN RETURN RETURN std_ulogic. ( l. X01Z. UX01. function "xnor" ( l. ( l.Bibliography -------------------------------------------------------------------. FUNCTION "and" ( l. r : std_ulogic_vector ) RETURN std_ulogic_vector. r : std_logic_vector ) RETURN std_logic_vector.conversion functions ------------------------------------------------------------------FUNCTION To_bit ( s : std_ulogic. FUNCTION "or" FUNCTION "or" FUNCTION "nor" FUNCTION "nor" FUNCTION "xor" FUNCTION "xor" ( l. r : std_logic_vector ) RETURN std_logic_vector. FUNCTION To_bitvector ( s : std_ulogic_vector. FUNCTION To_bitvector ( s : std_logic_vector . r : std_ulogic_vector ) return std_ulogic_vector. std_logic_vector std_ulogic_vector std_ulogic BIT_VECTOR BIT_VECTOR BIT ) ) ) ) ) ) -------------------------------------------------------------------. std_ulogic_vector. FUNCTION "not" FUNCTION "not" ( l : std_logic_vector ) RETURN std_logic_vector. -------------------------------------------------------------------. r : std_logic_vector ) return std_logic_vector. function "xnor" ( l. std_logic_vector.edge detection ------------------------------------------------------------------FUNCTION rising_edge (SIGNAL s : std_ulogic) RETURN BOOLEAN. r : std_ulogic_vector ) RETURN std_ulogic_vector. std_logic_vector. r : std_ulogic_vector ) RETURN std_ulogic_vector. std_logic_vector. ( l. FUNCTION Is_X ( s : std_logic_vector ) RETURN BOOLEAN. std_logic_vector. FUNCTION "nand" ( l. std_ulogic_vector. X01. r : std_logic_vector ) RETURN std_logic_vector. UX01. r : std_ulogic_vector ) RETURN std_ulogic_vector.object contains an unknown ------------------------------------------------------------------FUNCTION Is_X ( s : std_ulogic_vector ) RETURN BOOLEAN. ( l : std_ulogic_vector ) RETURN std_ulogic_vector. std_ulogic_vector. -------------------------------------------------------------------. std_ulogic_vector. xmap : BIT := ’0’) RETURN BIT_VECTOR. std_ulogic_vector. std_ulogic_vector. std_logic_vector. -------------------------------------------------------------------. r : std_logic_vector ) RETURN std_logic_vector. X01Z. r : std_logic_vector ) RETURN std_logic_vector. ElectroScience — 156 — .vectorized overloaded logical operators ------------------------------------------------------------------FUNCTION "and" ( l. std_logic_vector. xmap : BIT := ’0’) RETURN BIT. X01.strength strippers and type convertors ------------------------------------------------------------------FUNCTION FUNCTION FUNCTION FUNCTION FUNCTION FUNCTION FUNCTION FUNCTION FUNCTION FUNCTION FUNCTION FUNCTION FUNCTION FUNCTION FUNCTION FUNCTION FUNCTION FUNCTION To_X01 To_X01 To_X01 To_X01 To_X01 To_X01 To_X01Z To_X01Z To_X01Z To_X01Z To_X01Z To_X01Z To_UX01 To_UX01 To_UX01 To_UX01 To_UX01 To_UX01 ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( s s s b b b s s s b b b : : : : : : : : : : : : s s s b b b : : : : : : std_logic_vector std_ulogic_vector std_ulogic BIT_VECTOR BIT_VECTOR BIT std_logic_vector std_ulogic_vector std_ulogic BIT_VECTOR BIT_VECTOR BIT ) ) ) ) ) ) ) ) ) ) ) ) RETURN RETURN RETURN RETURN RETURN RETURN RETURN RETURN RETURN RETURN RETURN RETURN RETURN RETURN RETURN RETURN RETURN RETURN std_logic_vector. xmap : BIT := ’0’) RETURN BIT_VECTOR. FUNCTION "nand" ( l. r : std_ulogic_vector ) RETURN std_ulogic_vector. FUNCTION falling_edge (SIGNAL s : std_ulogic) RETURN BOOLEAN. ( l. std_ulogic_vector.

— 157 — Lund Institute of Technology .Bibliography FUNCTION Is_X ( s : std_ulogic END std_logic_1164. ) RETURN BOOLEAN.

R: UNSIGNED) return SIGNED. R: UNSIGNED) return STD_LOGIC_VECTOR. R: SIGNED) return STD_LOGIC_VECTOR. UNSIGNED. function "+"(L: UNSIGNED) return UNSIGNED. SIGNED.-. INTEGER.Copyright (c) 1990. R: SIGNED) return STD_LOGIC_VECTOR. R: INTEGER) return SIGNED. R: INTEGER) return STD_LOGIC_VECTOR. R: UNSIGNED) return UNSIGNED. R: STD_ULOGIC) return UNSIGNED.-. SIGNED. STD_ULOGIC. R: SIGNED) return STD_LOGIC_VECTOR. INTEGER. R: SIGNED) return STD_LOGIC_VECTOR. INTEGER. R: UNSIGNED) return STD_LOGIC_VECTOR. R: UNSIGNED) return STD_LOGIC_VECTOR.--. UNSIGNED. R: SIGNED) return SIGNED. UNSIGNED. SIGNED. STD_ULOGIC. R: UNSIGNED) return STD_LOGIC_VECTOR. INTEGER. R: INTEGER) return STD_LOGIC_VECTOR. R: INTEGER) return STD_LOGIC_VECTOR. R: STD_ULOGIC) return STD_LOGIC_VECTOR. UNSIGNED. UNSIGNED. R: UNSIGNED) return UNSIGNED.and that any derivative work contains this copyright notice. -. SIGNED.Bibliography std logic arith ---------------------------------------------------------------------------. R: INTEGER) return UNSIGNED. R: UNSIGNED) return STD_LOGIC_VECTOR.-. SIGNED. R: STD_ULOGIC) return SIGNED. Inc. SIGNED.std_logic_1164. R: SIGNED) return SIGNED.-. R: SIGNED) return SIGNED. UNSIGNED.---------------------------------------------------------------------------library IEEE. R: UNSIGNED) return STD_LOGIC_VECTOR. UNSIGNED. STD_ULOGIC. subtype SMALL_INT is INTEGER range 0 to 1. R: SIGNED) return SIGNED. STD_ULOGIC. R: UNSIGNED) return UNSIGNED. R: SIGNED) return SIGNED. R: UNSIGNED) return UNSIGNED. R: SIGNED) return STD_LOGIC_VECTOR.This source file may be used and distributed without restriction -. R: SIGNED) return SIGNED. UNSIGNED. UNSIGNED. R: SIGNED) return STD_LOGIC_VECTOR. INTEGER. SIGNED. R: UNSIGNED) return UNSIGNED. type SIGNED is array (NATURAL range <>) of STD_LOGIC.provided that this copyright statement is not removed from the file -. SIGNED. UNSIGNED. INTEGER. R: SIGNED) return SIGNED. use IEEE.1992 by Synopsys. function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: UNSIGNED. R: STD_ULOGIC) return UNSIGNED. SIGNED. SIGNED.1991. UNSIGNED. R: STD_ULOGIC) return STD_LOGIC_VECTOR. R: INTEGER) return STD_LOGIC_VECTOR. R: SIGNED) return STD_LOGIC_VECTOR. R: SIGNED) return STD_LOGIC_VECTOR. STD_ULOGIC. R: STD_ULOGIC) return SIGNED. STD_ULOGIC. SIGNED.all. SIGNED. INTEGER. STD_ULOGIC. SIGNED. UNSIGNED. UNSIGNED. R: INTEGER) return UNSIGNED. ElectroScience — 158 — . SIGNED. UNSIGNED. package std_logic_arith is type UNSIGNED is array (NATURAL range <>) of STD_LOGIC. R: STD_ULOGIC) return STD_LOGIC_VECTOR. R: UNSIGNED) return UNSIGNED. R: UNSIGNED) return STD_LOGIC_VECTOR. STD_ULOGIC. R: INTEGER) return SIGNED. -. SIGNED. R: UNSIGNED) return SIGNED. R: STD_ULOGIC) return STD_LOGIC_VECTOR. SIGNED. INTEGER. All rights reserved. UNSIGNED. R: UNSIGNED) return STD_LOGIC_VECTOR. R: SIGNED) return SIGNED.

INTEGER. R: UNSIGNED) return BOOLEAN. UNSIGNED. INTEGER. R: UNSIGNED) return STD_LOGIC_VECTOR. R: SIGNED) return BOOLEAN. R: SIGNED) return BOOLEAN. SIGNED. R: UNSIGNED) return SIGNED. SIGNED. function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function function "+"(L: UNSIGNED) return STD_LOGIC_VECTOR. R: SIGNED) return SIGNED. — 159 — Lund Institute of Technology . R: SIGNED) return BOOLEAN. SIGNED. R: INTEGER) return BOOLEAN. R: UNSIGNED) return UNSIGNED. R: SIGNED) return STD_LOGIC_VECTOR. INTEGER. UNSIGNED. INTEGER. SIGNED. R: UNSIGNED) return BOOLEAN. SIGNED. SIGNED. UNSIGNED. UNSIGNED. R: SIGNED) return BOOLEAN. R: SIGNED) return SIGNED. SIGNED. SIGNED. SIGNED. UNSIGNED. R: SIGNED) return BOOLEAN. UNSIGNED. UNSIGNED. R: INTEGER) return BOOLEAN. SIGNED. R: UNSIGNED) return BOOLEAN. R: SIGNED) return BOOLEAN. R: SIGNED) return BOOLEAN. R: SIGNED) return BOOLEAN. R: UNSIGNED) return BOOLEAN. UNSIGNED. R: SIGNED) return BOOLEAN. "+"(L: SIGNED) return STD_LOGIC_VECTOR. "*"(L: "*"(L: "*"(L: "*"(L: "*"(L: "*"(L: "*"(L: "*"(L: "<"(L: "<"(L: "<"(L: "<"(L: "<"(L: "<"(L: "<"(L: "<"(L: UNSIGNED. INTEGER. INTEGER.Bibliography function "+"(L: SIGNED) return SIGNED. R: UNSIGNED) return BOOLEAN. R: INTEGER) return BOOLEAN. ">="(L: ">="(L: ">="(L: ">="(L: ">="(L: ">="(L: ">="(L: ">="(L: "="(L: "="(L: "="(L: "="(L: "="(L: "="(L: "="(L: "="(L: UNSIGNED. UNSIGNED. R: INTEGER) return BOOLEAN. R: UNSIGNED) return BOOLEAN. "<="(L: "<="(L: "<="(L: "<="(L: "<="(L: "<="(L: "<="(L: "<="(L: ">"(L: ">"(L: ">"(L: ">"(L: ">"(L: ">"(L: ">"(L: ">"(L: UNSIGNED. R: INTEGER) return BOOLEAN. R: INTEGER) return BOOLEAN. UNSIGNED. R: INTEGER) return BOOLEAN. COUNT: UNSIGNED) return UNSIGNED. UNSIGNED. R: UNSIGNED) return BOOLEAN. R: SIGNED) return BOOLEAN. R: SIGNED) return BOOLEAN. SIGNED. R: SIGNED) return BOOLEAN. UNSIGNED. R: UNSIGNED) return BOOLEAN. SIGNED. UNSIGNED. R: INTEGER) return BOOLEAN. "/="(L: "/="(L: "/="(L: "/="(L: "/="(L: "/="(L: "/="(L: "/="(L: function SHL(ARG: UNSIGNED. R: UNSIGNED) return BOOLEAN. R: SIGNED) return BOOLEAN. function "ABS"(L: SIGNED) return SIGNED. R: SIGNED) return BOOLEAN. R: INTEGER) return BOOLEAN. R: SIGNED) return BOOLEAN. UNSIGNED. UNSIGNED. R: SIGNED) return BOOLEAN. R: UNSIGNED) return BOOLEAN. R: UNSIGNED) return BOOLEAN. SIGNED. SIGNED. R: UNSIGNED) return BOOLEAN. INTEGER. SIGNED. SIGNED. function "-"(L: SIGNED) return SIGNED. UNSIGNED. R: SIGNED) return STD_LOGIC_VECTOR. R: SIGNED) return BOOLEAN. UNSIGNED. R: UNSIGNED) return BOOLEAN. UNSIGNED. INTEGER. R: UNSIGNED) return BOOLEAN. INTEGER. SIGNED. INTEGER. R: UNSIGNED) return BOOLEAN. R: INTEGER) return BOOLEAN. R: UNSIGNED) return BOOLEAN. UNSIGNED. R: UNSIGNED) return STD_LOGIC_VECTOR. R: UNSIGNED) return BOOLEAN. INTEGER. SIGNED. SIGNED. SIGNED. SIGNED. R: SIGNED) return BOOLEAN. INTEGER. R: UNSIGNED) return BOOLEAN. R: INTEGER) return BOOLEAN. "-"(L: SIGNED) return STD_LOGIC_VECTOR. "ABS"(L: SIGNED) return STD_LOGIC_VECTOR. SIGNED. R: INTEGER) return BOOLEAN.

CONV_UNSIGNED(ARG: CONV_UNSIGNED(ARG: CONV_UNSIGNED(ARG: CONV_UNSIGNED(ARG: CONV_SIGNED(ARG: CONV_SIGNED(ARG: CONV_SIGNED(ARG: CONV_SIGNED(ARG: INTEGER. SIZE: INTEGER) return SIGNED.SIZE < 0 is same as SIZE = 0 -. STD_ULOGIC. UNSIGNED) return INTEGER. function SHR(ARG: UNSIGNED. INTEGER.sign extend STD_LOGIC_VECTOR (ARG) to SIZE. function function function function function function function function function function function function function function function function CONV_INTEGER(ARG: CONV_INTEGER(ARG: CONV_INTEGER(ARG: CONV_INTEGER(ARG: INTEGER) return INTEGER.return STD_LOGIC_VECTOR(SIZE-1 downto 0) function SXT(ARG: STD_LOGIC_VECTOR. UNSIGNED. SIZE: INTEGER) return SIGNED. UNSIGNED. SIZE: INTEGER) return STD_LOGIC_VECTOR.returns STD_LOGIC_VECTOR(SIZE-1 downto 0) function EXT(ARG: STD_LOGIC_VECTOR. end Std_logic_arith. SIZE: INTEGER) return STD_LOGIC_VECTOR. -. SIZE: INTEGER) return STD_LOGIC_VECTOR. INTEGER. SIZE: INTEGER) return STD_LOGIC_VECTOR. SIGNED. COUNT: UNSIGNED) return UNSIGNED. COUNT: UNSIGNED) return SIGNED. STD_ULOGIC. SIZE: INTEGER) return STD_LOGIC_VECTOR. SIZE: INTEGER) return UNSIGNED. COUNT: UNSIGNED) return SIGNED.zero extend STD_LOGIC_VECTOR (ARG) to SIZE. SIGNED. UNSIGNED. STD_ULOGIC. ElectroScience — 160 — .Bibliography function SHL(ARG: SIGNED. SIZE: INTEGER) return UNSIGNED. -.SIZE < 0 is same as SIZE = 0 -. SIZE: INTEGER) return UNSIGNED. -. CONV_STD_LOGIC_VECTOR(ARG: CONV_STD_LOGIC_VECTOR(ARG: CONV_STD_LOGIC_VECTOR(ARG: CONV_STD_LOGIC_VECTOR(ARG: -. SIZE: INTEGER) return SIGNED. SIGNED) return INTEGER. SIZE: INTEGER) return STD_LOGIC_VECTOR. SIGNED. SIZE: INTEGER) return UNSIGNED. STD_ULOGIC) return SMALL_INT. SIZE: INTEGER) return SIGNED. function SHR(ARG: SIGNED.

function "/="(L: INTEGER. STD_LOGIC_VECTOR. function ">="(L: STD_LOGIC_VECTOR. R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR.-All rights reserved. -. Inc. function "+"(L: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR. R: STD_LOGIC_VECTOR) return BOOLEAN. 1992 by Synopsys. function "<"(L: STD_LOGIC_VECTOR. INTEGER. R: STD_LOGIC_VECTOR) return BOOLEAN.remove this since it is already in std_logic_arith -function CONV_STD_LOGIC_VECTOR(ARG: INTEGER. R: STD_LOGIC) return STD_LOGIC_VECTOR. R: STD_LOGIC_VECTOR) return BOOLEAN. function ">="(L: INTEGER.---------------------------------------------------------------------------library IEEE. R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR.all. function "="(L: INTEGER. package STD_LOGIC_UNSIGNED is function function function function function function function function function function "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: STD_LOGIC_VECTOR. function "<"(L: STD_LOGIC_VECTOR. use IEEE.std_logic_arith. SIZE: INTEGER) return STD_LOGIC_VECTOR. function "="(L: STD_LOGIC_VECTOR.-. STD_LOGIC. R: INTEGER) return BOOLEAN. function "="(L: STD_LOGIC_VECTOR. R: INTEGER) return BOOLEAN.and that any derivative work contains this copyright notice. R: STD_LOGIC_VECTOR) return BOOLEAN.COUNT: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR. function "<="(L: STD_LOGIC_VECTOR. R: INTEGER) return STD_LOGIC_VECTOR. use IEEE. R: STD_LOGIC_VECTOR) return BOOLEAN. -. R: INTEGER) return BOOLEAN. R: INTEGER) return STD_LOGIC_VECTOR. STD_LOGIC_VECTOR. STD_LOGIC_VECTOR. R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR. R: STD_LOGIC_VECTOR) return BOOLEAN. R: INTEGER) return BOOLEAN. R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR. R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR.all. R: STD_LOGIC_VECTOR) return BOOLEAN.Copyright (c) 1990. -. function SHR(ARG:STD_LOGIC_VECTOR. function "/="(L: STD_LOGIC_VECTOR. -.-.Bibliography std logic unsigned ---------------------------------------------------------------------------. R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR. function "<="(L: INTEGER. STD_LOGIC_VECTOR. R: STD_LOGIC_VECTOR) return BOOLEAN. R: STD_LOGIC_VECTOR) return BOOLEAN. R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR. function ">"(L: STD_LOGIC_VECTOR. R: INTEGER) return BOOLEAN. R: STD_LOGIC_VECTOR) return BOOLEAN.COUNT: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR. STD_LOGIC. function "<"(L: INTEGER. function "/="(L: STD_LOGIC_VECTOR. function CONV_INTEGER(ARG: STD_LOGIC_VECTOR) return INTEGER. R: STD_LOGIC_VECTOR) return BOOLEAN.provided that this copyright statement is not removed from the file -. — 161 — Lund Institute of Technology . R: STD_LOGIC_VECTOR) return BOOLEAN. function "*"(L: STD_LOGIC_VECTOR. INTEGER. function SHL(ARG:STD_LOGIC_VECTOR. STD_LOGIC_VECTOR. end STD_LOGIC_UNSIGNED. R: STD_LOGIC) return STD_LOGIC_VECTOR. function ">"(L: INTEGER. function ">"(L: STD_LOGIC_VECTOR. function ">="(L: STD_LOGIC_VECTOR.--.This source file may be used and distributed without restriction -. 1991.std_logic_1164.-. function "<="(L: STD_LOGIC_VECTOR. R: INTEGER) return BOOLEAN.-.

and that any derivative work contains this copyright notice. STD_LOGIC_VECTOR. R: STD_LOGIC_VECTOR) return BOOLEAN. STD_LOGIC_VECTOR. "/="(L: STD_LOGIC_VECTOR. 1992 by Synopsys. function "<="(L: STD_LOGIC_VECTOR. R: STD_LOGIC_VECTOR) return BOOLEAN.---------------------------------------------------------------------------library IEEE.COUNT: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR. 1991. R: INTEGER) return BOOLEAN. R: STD_LOGIC_VECTOR) return BOOLEAN. R: STD_LOGIC_VECTOR) return BOOLEAN. R: INTEGER) return STD_LOGIC_VECTOR. R: INTEGER) return BOOLEAN. function "<"(L: STD_LOGIC_VECTOR. function ">"(L: STD_LOGIC_VECTOR.COUNT: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR.all. R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR.Copyright (c) 1990.-.std_logic_1164. function "+"(L: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR. function "<"(L: INTEGER. SHL(ARG:STD_LOGIC_VECTOR. SIZE: INTEGER) return STD_LOGIC_VECTOR. R: STD_LOGIC_VECTOR) return BOOLEAN. package STD_LOGIC_SIGNED is function function function function function function function function function function "+"(L: "+"(L: "+"(L: "+"(L: "+"(L: "-"(L: "-"(L: "-"(L: "-"(L: "-"(L: STD_LOGIC_VECTOR. SHR(ARG:STD_LOGIC_VECTOR.-. R: INTEGER) return STD_LOGIC_VECTOR. -. use IEEE.-. -. R: STD_LOGIC) return STD_LOGIC_VECTOR. end STD_LOGIC_SIGNED. R: STD_LOGIC_VECTOR) return BOOLEAN. function ">="(L: STD_LOGIC_VECTOR. function "-"(L: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR. function CONV_INTEGER(ARG: STD_LOGIC_VECTOR) return INTEGER. function "="(L: STD_LOGIC_VECTOR. INTEGER. use IEEE. function function function function function "/="(L: STD_LOGIC_VECTOR.all. R: STD_LOGIC_VECTOR) return BOOLEAN. function "<="(L: STD_LOGIC_VECTOR. "/="(L: INTEGER. R: STD_LOGIC_VECTOR) return BOOLEAN. function ">="(L: STD_LOGIC_VECTOR. STD_LOGIC_VECTOR. INTEGER. function ">"(L: INTEGER. R: INTEGER) return BOOLEAN. function "<="(L: INTEGER. function ">="(L: INTEGER. R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR. function "="(L: STD_LOGIC_VECTOR. ElectroScience — 162 — . R: INTEGER) return BOOLEAN.remove this since it is already in std_logic_arith -function CONV_STD_LOGIC_VECTOR(ARG: INTEGER.This source file may be used and distributed without restriction -. Inc. R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR.Bibliography std logic signed ---------------------------------------------------------------------------.-All rights reserved.provided that this copyright statement is not removed from the file -. R: STD_LOGIC_VECTOR) return BOOLEAN. R: STD_LOGIC) return STD_LOGIC_VECTOR. R: INTEGER) return BOOLEAN. function "*"(L: STD_LOGIC_VECTOR. STD_LOGIC_VECTOR. -.std_logic_arith. STD_LOGIC.--. R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR. R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR. function "<"(L: STD_LOGIC_VECTOR. R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR. function ">"(L: STD_LOGIC_VECTOR. R: STD_LOGIC_VECTOR) return BOOLEAN. R: STD_LOGIC_VECTOR) return BOOLEAN. -. R: STD_LOGIC_VECTOR) return BOOLEAN. STD_LOGIC.-. R: INTEGER) return BOOLEAN. function "ABS"(L: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR. function "="(L: INTEGER. STD_LOGIC_VECTOR. R: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR.

Sign up to vote on this title
UsefulNot useful