You are on page 1of 34

Bar Ilan University

School of Engineering
VLSI Lab

Data Driven
Clock Gating
Academic Advisor: Prof. Shmuel Wimer
Instructor: Mr. Moshe Doron
Industry correspondent: Mr. Roey Mioni

Dov Gropper
Dvir Shasha
Final Fourth Year Project
Computer Engineering

Table of Contents
Main Project Goals........................................................................................................................ 3
Motivation ..................................................................................................................................... 3
Theory .......................................................................................................................................... 4
Design Flow .................................................................................................................................. 7
Design: ...................................................................................................................................... 7
Simulation environment: ............................................................................................................ 8
Iterative Perfect Matching Algorithm (IPM): ............................................................................... 8
Clock gating Implementation: .................................................................................................... 9
Hardware and Design Components .............................................................................................10
Problems and Solutions ...............................................................................................................12
Direct Memory Accesses Controller .............................................................................................14
Behavior ...................................................................................................................................14
System level .............................................................................................................................14
The block diagram of the DMA controller's state machine: .......................................................16
The Design: ..............................................................................................................................17
Top design, with Verification diagram: ......................................................................................18
Results.........................................................................................................................................20
The SpyGlass Results: .............................................................................................................21
Result review: ...........................................................................................................................24
Conclusions .................................................................................................................................25
References and Sources..............................................................................................................27
Appendixs ....................................................................................................................................28

DMAC Spec: .....................................................................................................................28

Main Project Goals


Data Driven Clock Gating is a research study by Professor Shmuel Wimer. Its main
purpose is to reduce power consumption of electronic circuits.
Our project implements the technique described in Professor Wimers research on a
design in register transfer level (RTL).
Our project consisted of the following stages:
Implementation of Data-Driven Clock Gating on a given design.
Creation of a design flow for this implementation.
Attain an estimate of the power consumption reduction.

Motivation
The increasing demand for low power mobile computing and consumer electronics
products has refocused VLSI design in the last two decades on lowering power and
increasing energy efficiency. Power reduction is treated at all design levels of VLSI chips.
From the architecture through block and logic levels, down to gate level circuit and
physical implementation, one of the major dynamic power consumers in the system clock
signal, typically responsible for up to 50% of the total dynamic power consumption. Clock
network design is a delicate procedure, and is therefore done in a very conservative
manner under worst case assumptions. It incorporates many diverse aspects such as
selection of sequential elements, controlling the clock skew, the decision of the topology
and physical implementation of the clock distribution network.

Theory
Clock gating
Several techniques to reduce the dynamic power have been developed, of which clock
gating is predominant. Ordinarily, when a logic unit is clock, its underlying sequential
elements receive the clock signal regardless of whether or not they will toggle in the next
cycle.
Clock enabling signals are usually introduced by designers during the system and clock
design phases, where the inter-dependencies of the various functions are well
understood. In contrast, it is very difficult to define such signals in the gate level,
especially in control logic, since the inter-dependencies among the states of various flipflops depend on automatically synthesized logic. There is a big gap between block
disabling that is driven from the HDL definitions, and what can be achieved with data
knowledge regarding the flip-flops activities and how they are correlated with each other.
The research presents an approach to maximize clock disabling at the gate level, where
the clock signal driving a flip-flop is disabled (gated) when the flip-flop states is not subject
to a change in the next clock cycle.
Clock gating does not come for free. Extra logic and interconnects are required to
generate the clock enabling signals, and the resulting area and power overhead must be
considered. In the extreme case, each clock input of a flip-flop can be disabled
individually, yielding maximum clock separation. This, however, results in high overhead.
Thus, the clock disabling circuit is shared by a group of several flip-flops in an attempt to
reduce the overhead.

On the other hand, such grouping may lower the disabling effectiveness, since the clock
will disabled only when the inputs to all the flip-flops in a group dont change. It is,
therefore beneficial to group flip-flops whose switching activities are highly correlated in
derive a joined enabling signal.
This requires gathering statistical information of our flip-flops using simulations, and
statistical analysis.
Another issue that influences the effectiveness of this suggested technique is the fan-out
of the gater. The theory presents a formula for calculating the optimal fan-out of the gater,
referred to as k:

When: q- the probability for flip-flop input stability.


CFF, Clatch , Cw - the capacities of extra flip-flops, latches, and wires.
In our project, we approached this issue by implementing different fan-outs and
estimating the effectiveness of each one on the power consumption reduction.

The graph above shows the normalized power net savings per flip-flop obtained by
adaptive gating at first level of clock tree in the equation above. The saving is compared
to the non-gated situation. The optimal fan-out is marked for each toggling probability:
Using the statistical information gathered and the optimal fan-out, we could attain groups
of matching flip-flops for the clock gating.

Design Flow

Design:
The design flow begins with a design in RTL. It is important to begin with a design that
has been proven to work properly. The design must not include any IPs (intellectual
property) or RTL sources that are not visible to the user, and therefore cannot be edited.
At this point, the design flow supports implementation for a single clock domain.
Moreover, the sequential and combinational logic must be separated in the RTL in order
for the scripts to run properly.

Simulation environment:
In this stage, simulations on the RTL are performed and statistical information is gathered
for analysis. This must simulate a typical use of the design, so that we can achieve
realistic statistical information. There is support currently for one simulation per design.
The simulation runs with Cadence's SimVision.
The Simulation environment steps:
Add tracing code to the design. This is obtained by running the program ftrc.exe.
To run this program, the user must first modify the file inputs.rti. In the file, the user
sets the following attributes:
Specifying the name of the design files list including extension (*.vc).
The program gives an option whether to get the output design as one or
multiple files.
Add tracing code to the test-bench manually. The code must be added before the
DUT instantiation.
At this point, the simulation can run.
The simulation outputs will contain two files:
Activities.rpt - the report file contains the active flip-flops per time in
millisecond.
FF_lists.rpt - the report file contains a list of the flip flops in the design.

Iterative Perfect Matching Algorithm (IPM):


This algorithm runs iteratively to select flip-flops with high correlation of toggling.
The following steps are to be taken:
Run PrepareIPM.exe. This program takes the activity.rpt and FF_list.rpt, and
creates different files for each domain.
Modify the batch file IPM_Run.bat in the following manner:
The first parameter is the execution file name, should not be changed.
The second is the number of iteration the algorithm will run. The number of
iteration represents the fan-out of the gater as such: 2^x = k (when x is the
number of iteration, and k is the gater fan-out).
The third is currently not used. It is the maximal physical distance between
flip-flops that will be gated together.
The fourth and fifth parameters are the names of the output files from the
previous stage.
Batch file example:
"Iterative Perfect Matching.exe" 7 50 "Activity_0.rpt" "FFList_0.rpt" > log.txt
Run IPM_Run.bat
The IPM outputs will contain a file for each iteration and domain contains the FFs
groups.
8

Clock gating Implementation:


This is a executable file - rcg.exe that creates a clock gate for each group of FFs
received from the IPM - thus completing the process.
These are the necessary steps:
Modify the inputs.rcg file with the following:
The name of the design files list (*.vc).
The name of the folder containing the FF-lists for grouping.
Specify whether to get the output design as one-file or multiple.
Run rcg.exe.
Design with Data Driven Clock Gating:
This design has the same logic behavior as the initial design. However, it contains the
gate components according to the implementation process.
It is important to note that the process begins and ends with a design in RTL.

Hardware and Design


Components
Direct Memory Access Controller in RTL design:
We implemented our design flow on a Direct Memory Access Controller in RTL.
The design does not include any IPs (intellectual property) and has been developed by
Mr. Moshe Doron in Bar-Ilans VLSI lab.
The design has a single clock domain and after editing, the sequential and combinational
logic have been separated.
Altera's FPGA board DE-115:
This is the first board used for implementation of the design (before and after) in order to
measure the power usage.
Altera's Quartus II:
We used Alteras EDA, Quartus II. It is programmable logic device design software. Its
features we used include:
Synthesis and implementation of Verilog for hardware description.
Burning the design in gate level on the DE-115 FPGA board.
Xilinx's FPGA board ML605:
This is the second board used for implementation of the design (before and after) in order
to measure the power usage. The reason we switched to this board was that it possesses
a 0.005 ohm serial resistor which is needed in order to measure the current that flows out
of the FPGA chip directly.
Xilinx's ISE:
ISE Is a software tool produced by Xilinx that we used for synthesis and analysis of our
HDL design. In addition, after we received a gate level design from ISEs synthesis we
burned the design onto the ML605 FPGA board - In similar to our usage of Alteras
Quartus II.

10

Cadence's SimVision:
Simvision is the waveform viewer in the Cadence EDA suite. We mainly used it for
behavioral design verification before FPGA implementation. Due to the RTL changes
made, it was necessary to verify our design before implementing our design flow of Data
Driven Clock-Gating. After the implementation, we used it to make sure the design still
had the exact same behavior. We used a test bench with two OK signals that indicated
proper behavior of the design. It is crucial that we used the same test bench on both
designs, so that we can positively know that the two designs really had the same
behavior.

11

Problems and Solutions


The evil design problem:
We started our project with the Animation Graphic Engine (AGE) RTL design.
The design used Altera's tools (IP) to implement some elements.
This code could not be edited or viewed. Trying to design these components ourselves
failed due to memory resources shortage on the FPGA board. In other words, the Quartus
synthesized the design with our components and the output was too large to be
implemented on any of the FPGA boards in the VLSI labs possession.
The Solution we chose was to switch to alternative RTL design, DMA Controller, which
uses no IP.
The power to measure power problem:
The design was implemented on Altera's FPGA board but power consumption savings
measurement failed because the Ampere-meter measured board consumption rather
than FPGA consumption i.e. Lack of resolution.
In order to try and solve this problem, we implemented the design on Xilinx's ML-605
FPGA board which has built-in serial 0.005 ohm resistor to FPGA Power pads, enabling
power measurement of the design solely, using ISE SW.
Unfortunately, this still did not solve our problem. Due to high static power consumption of
the FPGA chip itself, we were not able to measure the actual power consumption of the
design without getting a lot of bias power disruptions. In order to try and enhance the
design in contrast to the static power, we multiplied it by 10, and then by 100. The power
consumption measured for one design compared to 100 was extremely similar - thus
proving that our ability to measure power on the FPGA board is limited. In addition, we
contacted field application engineers from Xilinx and they agreed that there is no solution
to our problem short of ASIC implementation. This was not an option available for us.
The Tri-state Area Problem
Our first flow ran on a design in gate level, and not in RTL. The tracing ingredients of the
flip-flops were created as special flip-flops that had a tracer inside them. This meant that it
was necessary to compile the RTL before beginning the process. For this, we used the
RC (RTL compiler). The RC was instructed to use the special flip-flops in order to
synthesize the RTL, that way a simulation that gathers toggling information could later be
run.
The problem was that the RC synthesized the RTL using tri-state buffers. At the FPGA
synthesis stage, the Altera Quartus gave a compilation error since it cannot synthesize tristate buffers.
12

The problem was solved by moving the design flow to RTL, and allowing the Altera
Quartus to synthesize the design without these limitations.
In addition, this change also shortened the runtime of the entire design flow.
The Logical separation Problem
In the design flow, the clock gating implementation step required that the designs
sequential and combinational logics will be separated in the RTL.
Because of that, we separated the DMA controller sequential and combinational logics.
It is not ideal to change the design the original design in order to perform the flow.
However, the tools are still under development and will be more versatile in the future.
An explanation of these types of logic:
Combinational logic - circuits that implement Boolean functions.. These circuits are
functions of input only. An example of combinational logic:

Sequential logic - Like combinational logic circuits, a sequential logic circuit has inputs
and outputs. However, the output depends on the state of a FSM as well as the inputs.
Furthermore, it contains a clock.
An example of sequential logic:

13

Direct Memory Accesses


Controller
Background
As mentioned before, the DUT was changed from the AGE to the DMAC. This meant we
had to become familiar with the DMAC logic, and behavior - due to the fact that we
needed to create test-benches for it.

Behavior
The DMAC is an integral part of the vendor-specific Graphics-On-Key (GOK) USB2.0
Device. The Device is dedicated to USB Communication Channel. It has the potential of
being integrated into the Protocol Engine (PE) Device. The DMAC function, within the
GOK Device, is to transfer data between the USB2.0 Protocol Engine Receive/Transmit
(RX/TX) Packet Buffers and the Device Animation Graphics Engine (AGE) Function
Endpoints, in response to PE service requests. The DMAC is the only Bus Master in the
system. It is pre-configured to perform the required data transfers to and from the AGE
Application Function Core. The DMAC is capable of performing words gather-scatter,
support system data bus width up to 48bits (6 bytes) and up to 24bits address bus
(16Mbytes address range).
Flyby and gather-scatter data transfer modes are supported but memory to memory
transfers is not.

System level

The USB 2.0 Device DMAC is pre-programmed (ROM), to perform the required data
transfers to and from the AGE Application Function Core. The DMAC Configuration
Memory contains the necessary information to access any Endpoint Buffer (Memory
or Register Files), in the AGE Core.
14

The PE issues a Transaction Request command signal and a Packet Transfer


Request signal to the DMAC, for a specific AGE Endpoint. The DMAC responds with
an Acknowledge signal to the PE and starts data transfer transactions between PE
Packet Buffers and EP Buffers Registers or Memory, over the system bus by
issuing Endpoint Buffer Address, Read and Write control signals, while monitoring
AGE Wait signal (for slow Memories). Data transfers are performed in either single
bus cycle 16bit words data transfer (flyby mode) or in multiple bus cycles (gatherscatter mode), to match different source and destination bus widths. In both single
packet and multiple packets data transfers, terminating specific EP Input Transaction
(from EP to Host), is done by the DMAC monitoring the End-Of-Transaction signal,
issued by the Function (last EP Buffer address reached). In case of Output
Transaction (from Host to EP), if last packet size is smaller than the predefined EP
MaxPacketSize or packet having data size = 0 (zero), the PE de-asserts its Transfer
Request signal. In case of multiple packets data transfers, only the Packet Transfer
Request signal is de-asserted and the DMAC will carry on with next packet data
transfers as soon as the Packet Transfer Request signal is be asserted. When both
Transfer Request and Packet Request are de-asserted, the DMAC resorts to its idle
state and is ready to perform the next transfer request.
The DMAC access PEs RX/TX Buffers (FIFOs), as an I/O Devices, using dedicated
PE read/write signals. Data is transferred over the system data bus, as 16bit words.

15

The block diagram of the DMA controller's state machine:

Notice that the flow splits left and right for the two directions: Rx path, and Tx path. Inside
each direction there are more splits, for different data sizes.

16

The Design:

This is a block diagram of the DMAC design. It is constructed from Data interface unit
(DIU), Finite State Machine unit and a Configuration ROM. On the left is the interface with
the protocol engine. On the right, is the interface with the System Bus, and function.
The next stage was to create an environment that would allow us to visually verify the
design on an FPGA board.

17

Top design, with Verification diagram:

This block diagram represents the design that was implemented on the FPGA board.
In addition to the DMA controller we used a stimuli ROM triggered by an 8 bit counter to
resemble data from Protocol engine or from i.e. the AGE.
To confirm the correctness of the data transfer we used 77Bit comparator and a monitor
ROM. The comparator compared the transferred data to the expected result stored in the
monitor ROM and using two LEDs if the data was transferred correctly and also if the
DMA control signals were in the correct state.
The components:
8-bit counter: a regular 8 bit counter. Each clock the count is increased by 1. The
output will return to 0x00 upon reach of 0xFF or reset.
Stimuli ROM: a Read Only Memory component that contains the data that will be
pushed in the inputs of the DMAC. It is made of 57 bit words. It receives the
address from the 8-bit counter as an input.
18

Monitor ROM: a Read Only Memory component that contains the data that should
be the output of the DMAC according to the input address. It receives the address
from the 8-bit counter as an input.
77-bit Comparator: A unit that compares the expected data (from monitor ROM)
to the collected data (from the DMAC). It splits the comparison into two: data, and
control signals. If the expected and collected are identical - both LEDs should be
on.
And so, if both LEDS are on during running- the design is working properly. It is important
to note that during reset, only the data OK will be on.
After debug work of the test bench, we achieved two working designs- with and without
Data Driven Clock Gating.

19

Results
In parallel to our work this year, our flow was run on designs at CEVA.
The VLSI department at CEVA already used clock gating in their design flow.
Their gaters are based on control signals. That means that if the entire clock domain is
not functioning at a given time, the clock signal is blocked and is not forwarded to the
specific clock domain.
The clock gaters we suggest in the design flow are based on data and statistical
information.
The data driven clock gaters were added to the design additionally to the control driven
clock gaters. This fact limited the process in terms of power reduction, because the
design was already power reduced.
To prove the potential of the design flow an activity test was made on the DUT. In this test
Flip Flops that did not needed a clock signal were sampled:

The table above shows that almost 98% of the Flip Flops active only 0-5% of the entire
test. This means that there is potential of saving power by implementing the technique on
the design. However, that is not enough to insure that saving is possible. It is also
necessary to show that many flip-flops have high correlation between their clock-toggle
vectors, in order to gate them together. The following graph shows just that:

20

The X-axis is the correlation percentage. The Y-axis is the number of flip-flops with the
appropriate correlation percentage. As can be seen, there are a very small percentage of
flip-flops with low correlation, and a very large percentage of flip-flops with high
correlation.
Now we can soundly predict high power saving potential.
After implement the entire design flow on three different designs and masure power with
simulation program, Spyglass, the results received in CEVA were:

The SpyGlass Results:


The power reduction percentages that SpyGlass measured:
Design A: 22%
Design B: 15%
Design C: 13%

21

The tables below shows the detailed results received with Spyglass on Design C:
Golden design

Leakage

Total Power:

337uW

12.7mW

40.0mW

53.0mW

Combinational Power:

224uW

2.65mW

22.3mW

25.2mW

Sequential Power:

95.2uW

8.96mW

1.05mW

10.1mW

Black Box Power:

0W

0W

10.3uW

10.3uW

Memory Power:

0W

0W

0W

0W

IO PAD Power:

0W

0W

0W

0W

Clock Power:

17.8uW

1.10mW

16.6mW

17.7mW

Internal

Switching Total

Above is the power measurement report that was derived from analysis of the golden
design. This means that no data-driven clock gating was performed on the design.
The next table shows the main power consumption data according to a given k. This
means that the data-driven clock gating process ran, and a separate design was created
for each gater fan-in size.

22

Leakage

Internal

Switching

Total

golden

337uW

12.7mW

40.0mW

53.0mW

k=4

415uW

10.6mW

49.6mW

60.6mW

k=8

398uW

9.04mW

42.8mW

52.3mW

k=16

388uW

8.83mW

40.2mW

49.5mW

k=32

387uW

9.52mW

40.3mW

50.2mW

k=64

385uW

10.3mW

41.8mW

52.5mW

k=128

383uW

11.0mW

41.2mW

52.6mW

It is easily noticeable that most ks reduce power consumption.


The Power reduction vs. K (fan-in):

23

The following table shows the power consumption data for k=16 fan-in, except for some
variations that were done outside of the design flow.
Total

Switching

Internal

Leakage

K=16 (with few variations)

46.3mW

37.1mW

8.88mW

378uW

Total Power:

26.3mW

23.2mW

2.82mW

255uW

Combinational Power:

6.86mW

1.62mW

5.14mW

100uW

Sequential Power:

10.8uW

10.8uW

0W

0W

Black Box Power:

0W

0W

0W

0W

Memory Power:

0W

0W

0W

0W

IO PAD Power:

13.2mW

12.2mW

917uW

22.9uW

Clock Power:

This design was 22% more efficient than the original golden design above.

Result review:
It can be noticed that with the k=16, the power saving is maximal. Also when the
fan-in is too small as in k =4 the power increases.
The combinational power increases with Data Driven Clock Gating as a result of
the extra logical component, the gaters. But the sequential power and the clock
power decreases more significantly because of the clock disabling techniques.
Although the design already had control driven clock gating the activity test shows
that there is still room to save power because the activity of 98% of the Flip Flops
were low and the correlation between the most of them were high.

24

Conclusions
The results that have been shown in the last chapter have proven beyond doubt that
Professor Shmuel Wimers research Data-Driven Clock Gating is a practical and
efficient power reduction tool. The design flow that was developed during this project
made the research a practical tool that could transform a given RTL design into a more
energy efficient one.
Design flow review:
The ability to work in RTL mode saved a lot of runtime of the design flow and made
it more effective. This issue change becomes more relevant, and even crucial,
when implementing this design flow on a large design. That is due to exponential
growth of runtime in every stage of the design flow.
We added overhead to the design in the form of logical components, the gaters.
The ability to combine a number of Flip Flops together with statistical knowledge as
a tool was the power saving main element. Both of these aspects appeared in the
result in the form of decrease and increase of power in the final design.
Even when a design has clock gaters driven by control the Data-Driven Clock
Gating proves effective. The fact that most of the Flip Flops were not active in most
of the run time, and the high correlation between most of them made it possible to
decrease power despite the control driven gaters.
There is still room for improvement of the versatility and user friendliness of the
scripts and the design flow. The disadvantages of the scripts create a need to
change the design. This happens because the scripts cant handle a design that
has both combinational and sequential logic mixed. In addition, the scripts wont
work on a design that has a synchronous reset. The code addition to the testbench necessary for the tracing stage should be done as part of the flow (by one of
the programs) and not manually. It would be Ideal to create a main program with a
user interface (GUI) that would combine the entire design flow. That way, the flow
would be easier to run and more user friendly.
There is still a need to achieve results in ASIC to confirm the efficiency of
implementing Data-Driven Clock Gating.
The need of a good simulation that mimics a real application use of the design will
have significant influence on the effectiveness of the design flow. This is due to the
fact that the technique is based on statistics and correlation and the more realistic
the simulation the statistical results would be accurate.
Our attempts to measure the power consumption on the FPGA boards were not
successful. The reason was that the boards has a tremendous static power
25

consumption level, due to all its BRAMs and LUTs. Even after multiplying the
design 100 times and measuring the power consumption with the ISE Chipscope
using the built in 0.005 ohm serial resistor - the power difference was not apparent.
That is probably the reason FPGA boards are used in the industry in order to
check design integrity of low power devices, and the actual devices are
manufactured using ASIC.

26

References and Sources


1. The Optimal Fan-Out of Clock Network for Power Minimization by Adaptive Gating
By Shmuel Wimer and Israel Koren.
2. Optimal Flip-Flop Grouping in Data-Driven Clock Gating for Maximal Power Saving
By Shmuel Wimer and Israel Koren.

27

Appendix A

DMAC Spec:

USB2.0 aware DMAC Specification


1. Introduction
The document defines a USB2.0 protocol-aware Direct Memory Access Controller (DMAC)
Device.
The DMAC is an integral part of the vendor-specific Graphics-On-Key (GOK) USB2.0
Device.
The Device is dedicated to USB Communication Channel. It has the potential of being
integrated into the Protocol Engine (PE) Device. The DMAC function, within the GOK
Device, is to transfer data between the USB2.0 Protocol Engine Receive/Transmit (RX/TX)
Packet Buffers and the Device Animation Graphics Engine (AGE) Function Endpoints, in
response to PE service requests. The DMAC is the only Bus Master in the system. It is preconfigured to perform the required data transfers to and from the AGE Application Function
Core. The DMAC is capable of performing words gather-scatter, support system data bus
width up to 48bits (6 bytes) and up to 24bits address bus (16Mbytes address range).
Flyby and gather-scatter data transfer modes are supported but memory to memory transfers
does not.

2. System Level Introduction


USB
Connector

Transceiver
Chip (PHY) UTMI

Protocol
Engine

DMAC

AGE

System Bus
Fig. 1 - GOK USB2.0 Device System Block Diagram
The USB 2.0 Device DMAC is pre-programmed (ROM), to perform the required data
transfers to and from the AGE Application Function Core. The DMAC Configuration Memory
contains the necessary information to access any Endpoint Buffer (Memory or Register Files),
in the AGE Core.
The PE issues a Transaction Request command signal and a Packet Transfer Request signal to
the DMAC, for a specific AGE Endpoint. The DMAC responds with an Acknowledge signal
to the PE and starts data transfer transactions between PE Packet Buffers and EP Buffers
Registers or Memory, over the system bus by issuing Endpoint Buffer Address, Read and
Write control signals, while monitoring AGE Wait signal (for slow Memories). Data transfers
are performed in either single bus cycle 16bit words data transfer (flyby mode) or in multiple
bus cycles (gather-scatter mode), to match different source and destination bus widths. In both
single packet and multiple packets data transfers, terminating specific EP Input Transaction
(from EP to Host), is done by the DMAC monitoring the End-Of-Transaction signal, issued by
the Function (last EP Buffer address reached). In case of Output Transaction (from Host to
EP), if last packet size is smaller than the pre-defined EP MaxPacketSize or packet having
data size = 0 (zero), the PE de-asserts its Transfer Request signal. In case of multiple packets
28

data transfers, only the Packet Transfer Request signal is de-asserted and the DMAC will carry
on with next packet data transfers as soon as the Packet Transfer Request signal is be asserted.
When both Transfer Request and Packet Request are de-asserted, the DMAC resorts to its idle
state and is ready to perform the next transfer request.
The DMAC access PEs RX/TX Buffers (FIFOs), as an I/O Devices, using dedicated PE
read/write signals. Data is transferred over the system data bus, as 16bit words.

USB2.0-aware DMAC
DMAC's three main modules are the Control Core (FSM), Configuration ROM and the Data
Interface
Unit (DIU).
2.1. DMAC Top Level Introduction
Control
Core (FSM)
Protocol
Engine

DIU
System
Bus

Configuration
ROM

Fig.

DMAC Block

Diagram
2.2. DMAC Modules
The DMAC is partitioned into modules as shown in Fig. 2 Block Diagram and described
below.
2.2.1. Configuration ROM
The Configuration ROM contains the essential information necessary to access any
pre-defined Application Function Endpoint Buffer (Memory or Register Files). The
Configuration information enables the DMAC to properly carry out the data
transactions, requested by the PE. Since PE issues at transaction request time,
Endpoint's transfer direction (IN-OUT), and Endpoint number (1-15), the specific
Endpoint Buffer can be selected, but EP Buffer data width (DW) must reside within
the Configuration ROM.
2.2.2. Control Core
The Control Core is the main Finite State Machine (FSM), handling all Device
operations.
At system boot time, the DMAC enters its Idle State, ready to carry data transfers.
It operates under the PE control.
The Control Core translates PE requests to data transfer actions, according to the
information stored in the Configuration ROM. PE initiates Data transfer operation by
Transfer Request signal assertion and Endpoint info (4bit EP number + 1bit in/out).
PE requests are being transferred to the Control Core. The Control Core employs the
pre-programmed EPs Buffer Data Width (DW) information, to perform either a
flyby transaction or a gather-scatter transaction. With each bus data transfer, the value
in the current address counter is driven onto the address bus, and the current address
counter is automatically incremented. At transaction completion (single or multiple
29

packets), address counter is cleared. Address counter increment or reset at transaction


completion is performed under the FSM as well as DIU's Gather-Scatter Registers
read & write. When PE issues transaction request signal, the DMAC responds with an
Acknowledge signal to the PE and when the PE issues packet transfer request, the
DMAC starts transfer data as requested. The Control Core issues the required control
signals for both the EP Buffer and the PE RX/TX FIFOs, in the correct sequence, to
perform either a flyby or gather-scatter data transfer operations (issue Read/Write
control signals and monitor Wait signal and increments address counter, as long as
the data transfer is carried on.
When the last data byte has been received or sent from/to the PE Packet Buffer, the
PE negates the DMA Request signals.
2.2.3. Data Interface Unit (DIU)
The DIU contains the temporary Registers for the gather- scatter transfer operation
and their control logic. It also contains the 3'S Data Buffers to manage the bidirectional data flow to/from the AGE Buffers and to/from the PE RX/TX FIFOs.
Gather-Scatter Registers are used for data transfers between different source &
destination data width, i.e. PE RX/TX FIFOs (I/O), is always 16bit wide, while
Functions Endpoints Buffers can be 16/24/32/48bits wide. 24bit transfers must be
performed on even number of transfers.
There are 4 gather-scatter registers: R1 (16bit), R2 & R3 (8bit), R4 (16bit).
R4
47

R3
32

31

R2
24

23

R1
16

15

Gather-Scatter operation:
- IN (from EP to PE TX FIFO)
32bit: Read 32bit word into R1-R2-R3. Write 2 16bit words from R1 & R2+R3.
24bit: Read 2 24bit words into R1+R2 & R3+R4. Write 3 16bit words from R1,
R2+R3 & R4.
48bit: Read 48bit word into R1-R2-R3-R4. Write 3 16bit words from R1, R2+R3 &
R4.
- OUT (from PE RX FIFO to EP)
32bit: Read 2 16bit words into R1 & R2+R3. Write 32bit word from R1-R2-R3.
24bit: Read 3 16bit words into R1-R2-R3-R4. Write 2 24bit words from R1+R2 &
R3+R4.
48bit: Read 3 16bit words into R1-R2-R3-R4. Write 48bit word from R1-R2-R3-R4.
2.3. Interfaces
Signal Name
dbus[47:0]
abus[23:0]
nrd

Signal Type

Description
Data Bus. These pins serve as input and output System data bus
Bi-directional
(for local C, PE Packet Buffers and Application Buffers
Address Bus. Serves as System Address Bus for the DMAC.
Bi-directional
16 LSBs are used by the C to access the Control Registers.
Bi-directional System Read signal issued by Bus Masters (DMAC or C)
30

nwr
npbrd
npbwr
nwait
ndack
ntreq
npreq
neot
epn[3:0]
ep_dir
clk
nrst
Vcc
Vss

Bi-directional
Out
Out
In-Active low
In-Active low
In-Active low
In-Active low
In-Active low
In-Active Hi
In-Active Hi
Input
In-Active low
Input
Input

System Write signal issued by Bus Masters (DMAC or C)


Read signal for PE during data transfers. Active low.
Write signal for PE during data transfers. Active low.
Used to extend bus cycle for slow Application Memories.
DMAC Acknowledge to PE Transfer Request.
DMA Transaction Request signal from Protocol Engine.
DMA Packet Request signal from Protocol Engine.
End-Of-Transaction signal, issued by the Function
Endpoint number (1-15) for requested data transfer.
Endpoint Direction IN (1) or OUT (0) for requested data transfer.
Oscillator input. Connected to an External Oscillator.
Reset. External asynchronous static reset.
Internal Power Source (derived from USB V+, via LDO)
Internal Power Ground (derived from USB V-, via LDO)

Note: DMAC uses Endpoint number (epn[3:0]) and Transaction direction (ep_dir), as
internal ROM address, to perform the expected data transfer to/from the specific End Point
@ the Function Core. They serve as chip selects for the Buffers within the Function Cores.
DMAC also issues nrd/nwr, npbrd/npbwr signals and current EP Buffer address, to handle
data transfer. Control signals npbrd or npbwr are used by the PE to drive RX FIFO output
data onto the system data bus or to latch the data from the system bus to the TX FIFO,
depending on transfer direction.
2.4. Programming Model
2.4.1. Configuration ROM
The ROM holds the configuration data. A single function within a single
configuration, having up to 15 OUT and 15 IN Endpoints is supported.
A set of 15 OUT & 15 IN Endpoints information is pre-programed in the ROM.
Information for each Endpoint includes:
- EP Data Width (DW in 16bit words) - 2bits/EP [00-16bit, 01-32bit, 10-2x24bit, 1148bit].
Default - 00.
Data per EP 2bits. 16bits ROM Word holds DW info of 8 Endpoints.
15 EPs/Dir/Function information, are stored in 2 16bits ROM Words.
Total number of Configuration ROM size is 2 x 2 = 4 16bit ROM Words.
Individual ROM Words are accessed via internal 2-bit address bus.
2.4.2. ROM Words Data Formats
Fn_EPn_I/O Registers EP Data Size (Data Width units)
MSB
LSB
D15 D14

D13

D12

D11

D10

D9

D8

D7

D6

D5

D4

D3

D2

D1

|-----EP8-----|-----EP7--------|------EP6-----|-----EP5----|-----EP4-----|-----EP3-----|-----EP2------|---EP1----|

31

D0

D15

D14

D13

D12

D11

D10

D9

D8

D7

D6

D5

D4

D3

D2

D1

D0

|----EP15------|-----EP14----|----EP13-----|----EP12----|----EP11----|----EP10----|----EP9-----|

2.4.3. Configuration ROM Words List


Address
0h
1h
2h
3h

Register Name
F1_EP1_8_O
F1_EP9_15_O
F1_EP1_8_I
F1_EP9_15_I

Register Description
Function #1, Output EPs 1-8 DW
Function #1, Output EPs 9-15 DW
Function #1, Input EPs 1-8 DW
Function #1, Input EPs 9-15 DW

3. Implementation
The DMAC is designed as a Front-End for near future ASIC implementation. It is designed
using Verilog HDL and simulated/logically verified for correct operation, using Cadence
Incisive Simulator.
Intermediate Hardware Implementation, for proof of concept and correct functionality, is
performed using FPGA Device, located on Altera DE2 Development Board, under Quartus II
Development Environment. The Incisive logically verified Verilog code is used for
implementation.
Quartus II MegaFunction Wizard is not used.
There is an option to incorporate the Protocol-Aware DMAC into the USB2.0 Protocol Engine.

32

4. USB2.0 Device System Diagram

+Vcc

-Vss

nrst

XTAL
Oscillator

ndack
ntreq
npreq
ep_dir
epn[3:0]

clk
nrst

clk

USB2.0-Aware
DMAC

abus dbus
npbwr npbrd nwr nrd neot nwait [23:0] [47:0]

+Vcc -Vss

nrd

LDO
USB
Connector

Data Bus

+V -V

Address Bus
nwr

+D -D
nwr nrd abus
[15:0]

ntreq
dbus
[15:0] npreq
ndack
USB2.0 Protocol ep_dir
epn[3:0]

CLK
UTMI

USB2.0 PHY

Engine

epn[3:0]

Function
Core 0
nrst

npbwr
npbrd

RST
nrst

nwr nrd neot nwait abus dbus


[n:0] [m:0]

clk

+Vcc
+Vcc

clk

-Vss

-Vss
+Vcc

-Vss

Fig. 3 USB2.0 GOK Device Controller System Diagram


4.1. DMA Transfer Types and Modes:
4.1.1. Flyby DMA transfer
The fastest DMA transfer type is referred to as a single-cycle, single-address, or flyby
transfer.
In a flyby DMA transfer, a single bus operation is used to accomplish the transfer,
with data read from the source and written to the destination simultaneously. In flyby
operation, the device requesting service (PE) asserts a DMA request on the
appropriate channel request line of the DMAC (specific Function Endpoint). In
response, the DMAC issues acknowledge signal to the requesting device (PE), and
start the data transfer by issueing the appropriate control signals and Endpoint buffer
start address (0). This signal alerts the requesting device to drive the data onto the
system data bus or to latch the data from the system bus, depending on the direction of
the transfer. In other words, a flyby DMA transfer looks like a memory read or write
cycle with the DMAC supplying the address and the I/O device reading or writing the
data. Because flyby DMA transfers involve a single memory cycle per data transfer,
33

these transfers are very efficient; however, memory to-memory transfers are not
possible in this mode.
4.1.2. Gather-Scatter DMA Transfer
This type of transfer is useful for interfacing devices with different data bus sizes. The
DMA employs a multiple-cycle, multiple-address data transfers, called Gather-Scatter
transfer.
The data being transferred is first read from the I/O device or memory into a
temporary DMA internal data registers. The data is then written to the memory or I/O
device in the next cycles.
This device has only single address counter and hence supports only memory-to- I/O
transfers.

34