HW/SW Co-design Fundamentals

HW/SW Co-design
National Chiao Tung University

Chun-Jen Tsai
05/17/2011
Many materials in the slides are from the book: J. Staunstrup and W. Wolf (Eds.), Hardware/Software Co-Design,
Kluwer Academic, 1997
System Design Process
 The design process is the set of design tasks that

transform a model into an architecture:
 A model is a formal system consisting of objects and
composition rules, and is used for describing a system’
s
characteristics
 An architecture defines system implementation by
specifying the number and types of components as well as
the connection between them
2/50
Generic Hardware Architecture (1/2)
 Any digital circuit can be divided into two parts:

datapath and controller
 Datapath
 Performs data processing e  (a+b) * (c+d);
 Compute functions concurrently

 Controller
 Enforce sequential execution of system behavior
 Follows algorithms step-by-step
Step 1: c  a+b;
Step 2: if (c > 5) then
d  a + b;
else
d  a – b;
end
Step 3: c <- c * d;
3/50
Generic Hardware Architecture (2/2)
 One can rewrite control into datapath and visa versa:

 Control: if i1 = ‘0’ then o1 <= a; else o1 <= b; end if;
 Datapath: o1 <= ((i1 = ‘0’) and a) or ((i1 = ‘1’) and b);
 Data/control distinction is useful but not fundamental

 Generic hardware component:
output
input
combinational state
logic D Q
feedback
4/50
HW/SW Distinction?
 In principle, there is really no such thing as “

HW/SW
Codesign,”since a “ software”component can be
regarded as a register-transfer machine processing
some input data (program instruction codes):
combinational
logic
combinational combinational output

D Q D Q D Q
logic logic
clock
5/50
System Models
 Control-operation oriented model

 Finite-State Machine (FSM)
 Data-processing oriented model
 Dataflow Graph (DFG)
 Joint model
 FSM with Datapath
 Hierarchical Concurrent FSM
 Programming-based model
 Program-State Machine
6/50
FSM for Controller Model
 The finite state machine (FSM) model consists of a

set of states, a set of transitions between states,
and a set of actions associated with these states or
transitions
 FSM is used for modeling control systems
input (condition) output
(a) input-based (Mealy-type FSM) (b) state-based (Moore-type FSM) 7/50

Dataflow Graph for Datapath Model
 Data flow graph (DFG) is used for describing

computational components in a system
a b
t1
square square
t2
add
t3
sqrt
t4
a 2 b 2
8/50
Finite-State Machine with Datapath
 FSMD models a system with mixed control and

computation functions
 FSMD contains a datapath driven by the signals generated
from a controller
input (condition) action (data processing)
(floor != requested) / floor = requested /

move = requested - current
start S1 output
(floor == requested) / x / move = 0
9/50
Hierarchical Concurrent FSM
 HCFSM captures a more complex control system by

further divide each into hierarchical and/or current
sub-states
state Y
state A state D
B
E
a(P)/c b
F
C
G
10/50
Program-State Machine (PSM)
 PSM is a heterogeneous model that integrates an

HCFSM with a programming language paradigm
state Y int A[20];
state A state D
int idx, max;
B max = 0;
for (idx=0; idx<20; idx++)
e1 e2 {
if (A[idx] > max)
{
C max = A[idx];
}
}
Transition-On-Completion, Transition-Immediately
11/50
System Architecture
 Various architectures are suitable for implementation

of system models:
 Hard-wired logic implementation
 Controller
 Datapath
 FSMD
 Software programming implementation
 CISC
 RISC
 VLIW
 Combined Co-design Architecture
12/50
Controller Architecture
 A FSM can be implemented using the controller

architecture
 Control steps are synchronized to clock cycles
 Longer schedule means more states in controller
13/50
Controllers and Scheduling
 For a functional model, you can design different

execution schedules using different controllers:
–/ x = a + b; y = c+ d
x <= a + b;
y <= c + d;
s1
one state design
–/ x = a + b
s1 s2
two-state design
–/ y = c + d
14/50
Distributed Controllers
 You can use a single monolithic controller, or several

smaller synchronized controllers for your design
two distributed controllers
one centralized controller

15/50
Synchronization between FSMs
 To pass values between two machines, you must

schedule output of one machine to coincide with input
expected by the other:
M1 M2
s1 t1
i1= 0/ – i1= 1/ –
w
s2 s3
t2
–/ w = 0 –/ w = 1 w=0/– w=1/–
16/50
Datapath Architecture
 Datapath is a direct implementation of DFGs

 Example: two implementations of an FIR filter:
3-stage pipeline datapath
4-stage pipeline datapath
17/50
Data Operators
 Arithmetic operations are easy to spot in hardware

description languages:
 x <= a + b;
 Multiplexers are implied by conditionals; or from
sharing adders, etc.
x
if x = ‘0’ then
reg1 <= a; a 0
sel
else D Q
reg1 <= b; b 1
end if; reg 1
HDL code
register-transfer
18/50
Hardware Resource Sharing
 Fewer adders, more cycles:

a
mux
+ D Q
b
mux
c
 More adders, fewer cycles:
a
b
+ mux
c +
19/50
Logic Pipeline
 Divide a datapath into a pipeline provide higher

utilization of logic
 For example, the snapshot of a pipelined datapath at time t
is as follows:
output effective at time t+1 output effective at time t+1
Combinational D Q Combinational D Q Combinational

logic 1 logic 2 logic 3
input at time t input at time t

20/50
Pipelines with Control
 Pipeline may do different things at different times

 Example: CPU architecture
control control
combinational combinational output

data D Q D Q D Q
logic logic
clock
21/50
FSMD Architecture
 FSMD architecture combines a controller with a

datapath
22/50
CISC Architecture
 CISC is a program execution logic which is program-

controlled itself:
23/50
RISC Architecture
 RISC is a program execution logic with hardwired-

control:
24/50
VLIW Architecture
 Very Long Instruction Word (VLIW) architecture

issues multiple independent instructions in each cycle
 It is very difficult to develop optimizing compilers for such
architecture  hand-coding in assembly is still a necessity to
program such devices
Memory
(16-port Register file)
A four-way VLIW architecture

25/50
Co-Design Architecture
 Co-design architecture interconnects several

concurrent processing elements (PEs):
GPP 1 Local Global ASIC 1 Local Global Local

Memory 1 Memory 2 Memory 2 GPP 2 Memory 3
Memory 1
GPP : General purpose processor (CISC or RISC)

ASIC: Application specific IC (custom logic)
26/50
Co-Design Descriptions
 To map models of system behaviors into an

architecture, a language that can describe certain
system features is required
 The system must be defined on different levels of
abstraction at each design step; general
characteristics to be described are:
 Hierarchy
 Concurrency
 Communication/Exception handling
 Timing/Synchronization
 …
27/50
Hierarchy Description
 Structural hierarchy:
Processor
data bus
control Memory
lines
 Behavioral hierarchy:
 Behavior P is composed of behavior Q and behavior R …
28/50
Behavioral Decomposition
 After hierarchal decomposition, each bottom-level

sub-behavior can further be classified into either
sequential, concurrent, or pipelined
29/50
Concurrency Classifications
 Data-driven concurrency:
 Operation execution depends only upon the availability of
data; the degree of concurrency is limited by data
dependencies
 Pipelined concurrency:
 An extension to data-driven concurrency by dividing
operations into groups (stages), which operate on different
data sets concurrently
 Control-driven concurrency:
 Also refer to as thread-level parallelism; explicit construct is
used to specify concurrent execution of multiple control tasks
30/50
Communication
 When two components interact with each other in a

system, a mechanism of communication must be
defined
 Software constructs: function calls, global variables,
messages sending thru a network, etc.
 Hardware constructs: inter-connection wires, shared
memories, register files, mailboxes, etc.
 In general, a communication model can be described
as follows:
Behavior B1 Behavior B2
Port P1 I1 Channel C I2 Port P2
31/50
Examples of Communication
 Mapping of communication design to two real

mechanisms:
 Shared memory (direct implementation)
int x; int y;
... ...
M = x; int M; y = M;
... ...
 Channel (model implementation)
int x; Channel C int y;
... ...
void send(int d)
C.send(x); {...} y = C.receive();
... int receive(void); ...
{...}
32/50
Timing
 An abstract computational model does not need to

specify cycle-accurate timing; however, you need to
do so for a physical mapping since you must follow
some “ protocol”to pass data around, internally or
externally
 A timing relation can be described by a 4-tuple
T = (e1, e2, min, max), where e1 precedes e2 by at
least min time units and at most max time units
 Timing delay –when the relation is used to describe a real
component
 Timing constraint –when the relation is used to describe a
component specification
33/50
Synchronization
 In modeling concurrent processes, description of

synchronization among processes are very important
 Control-dependent synchronization: the control structure of
the system prearrange the synchronization of processes in
the system (e.g. via fork-join statements)
 Data-dependent synchronization: processes are
synchronized by the output data or generated events of each
others
34/50
Control-dependent Synchronization
 A behavior X with a fork-join control structure:
behavior X Q
begin
Q();
fork A(); B(); C(); join; A B C
R(); synchronization
end behavior X; point
R
 State description of control-dependent synchro.

A B
A B C
A1 B1
e
A2 B2
35/50
Data-dependent Synchronization
 Data and events triggers state change among

processes:
A B A B A B
A1
A1 B1 A1 B1 B1
x:=0
e e e e x=1
entered A2
A2
A2 B2 A2 B2
x:=1
B2
synchronization synchronization
by common event by common variable
synchronization
by status detection
36/50
Co-design Methodology
 Synthesis flow of a sophisticated system:
System Allocation, Partition Scheduling Communication Communication

Scheduling
Spec partitioning model model synthesis model
High-level Interface
Compilation
synthesis synthesis
implementation
model
manufacturing
37/50
System Specification
 A system must be specified in multiple levels:
B0 shared sync
B1 B1() B3() B7()

{ { {
stmt; stmt; stmt;
... ... ...
B2 } } }
B5 B6() B4()
{ {
int local; int local;
B6 ... wait(sync)
sync shared = local+1; local = local-1;
signal(sync); ...
} }
B7 B4
B3
Atomic behaviors
Control-flow view
38/50
Allocation and Partitioning
 Allocations: selection of components (processors,

memories, lPs, etc.) to implement the system
 Partitioning: defines the mapping between system
behaviors and allocated processing elements (PEs)
 New control behaviors may need to be inserted into the
system whenever child behaviors are assigned to different
PEs from their parent behaviors’
 Synchronization operations between PEs must also be
inserted
39/50
System Model After A & P
Top shared sync B1_start B1_done B4_start B4_done B1() B1_ctrl()

{ {
wait(B1_start); signal(B1_start);
... wait(B1_done);
PE0 PE1 signal(B1_done); ...
} }
B0
B1_start
B4_ctrl()
B1_ctrl B3() B7() {
{ { signal(B4_start);
stmt; stmt;
B2 B1 ... ... wait(B4_done);
...
B1_done } }
B5 }
B6 B4() B6()
sync B4_start { {
B7 B4_ctrl wait(B4_start); ...
wait(sync); shared = local+1;
B4 local = shared-1; signal(sync);
B4_done ... }
B3 signal(B4_done);
}
System model Atomic behaviors
40/50
Allocation & Partition Issues (1/2)
 Key question for allocation:

 How can we reach desired performance with minimal
components?
 Key considerations for partitioning:
 Behavior execution time on each PE
 Data transfer time between PEs
 Synchronization overhead between PEs
 Note that data transfer time include:
 Flushing register/cache values to main memory
 Time required for a PE to set up transaction
 Overhead of data transfers by bus packets, handshaking, etc
41/50
Allocation & Partition Issues (2/2)
 No simple, general rules for allocation/partition

problem  requires good domain knowledge
 For example, for video encoder, should the partition between
software and the accelerator logic be defined at frame-level
or macroblock-level?
VLC to
Bitstream
Video ME/MC － DCT Quantize
data
Frame level Apply Adapt

Calculate RD Model RD Model
Target bits Get QP
Reference
frame + IDCT Quantize-1
Macroblock level
MC
42/50
Example of Partitioning
 Step 1: Divide functional specification into units

 Step 2: Determine proper level of parallelism
f1() f2()
f3(f1(), f2())
vs.
Sequential behaviors f3()
Parallel behaviors
 Step 3: Map behaviors onto allocated components
P1 P2
M1 M2
d1 d2
P3
Task graph Hardware platform

43/50
Execution vs. Communication Times
 Execution time table:
M1 M2
P1 5 5
P2 5 6
P3 – 5
 Communication time table:

 Assume communication within PE is free
 Cost of communication from P1 to P3 is d1 = 2,
cost of P2 to P3 communication is d2 = 4
44/50
First Design
 Allocate P1, P2 -> M1; P3 -> M2.
M1 P1 P2
Time = 19
M2 P3
network d2
5 10 15 20
time
45/50
Second Design
 Allocate P1 -> M1; P2, P3 -> M2:
M1 P1
Time = 18
M2 P2 P3
network d1
5 10 15 20
46/50
Static vs. Dynamic Partitioning
 It is a common practice to perform software-hardware

partitioning of a system statically at design time
 With new generations of multimedia devices and
applications, static partition of a systems behaviors
do not reach optimal performance at runtime
 Execution time of a behavior on a PE may not be fixed at
any given time
 Dynamic partition of system behaviors is possible if a
behavior can run under multiple PEs
47/50
Scheduling
 When a number of partially-ordered hierarchal

behaviors are allocated to a PE, the total order of
invocation among all behaviors for this PE must be
determined
 Coarse-level of scheduling are done during the
allocation & partitioning step
 Scheduling can be very difficult when the system is
“interactive”
 Multimedia handsets are less stable than 2G handsets due
to man-machine interaction
48/50
System Model after Scheduling
Top shared sync B6_start B3_start
PE0 PE1
B1() B3() B7()
{ { {
B1 signal(B6_start); wait(B3_start); stmt;
B6_start ... ... ...
} } }
B6
B4() B6()
sync { {
wait(sync); wait(B6_start);
B7 B4 local = shared-1; ...
B3_start ... shared = local+1;
signal(B3_start); signal(sync);
} }
B3
Atomic behaviors
System model
49/50
Discussions
 Optimal HW/SW system co-design is determined by

many factors, including
Easy to quantify !
 Design cost
 Implementation cost (human cost & target cost)
 Performance
 Runtime resource consumption (memory, power, etc.)
 System robustness
 To achieve optimal design, good domain knowledge
of the application is crucial
Difficult to quantify !
50/50

HW/SW Co-design Fundamentals

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HW/SW Co-design Fundamentals

Uploaded by

Copyright:

Available Formats

HW/SW Co-design

National Chiao Tung University

 The design process is the set of design tasks that

 Any digital circuit can be divided into two parts:

 Compute functions concurrently

 One can rewrite control into datapath and visa versa:

 Data/control distinction is useful but not fundamental

 In principle, there is really no such thing as “

combinational combinational output

 Control-operation oriented model

 The finite state machine (FSM) model consists of a

(a) input-based (Mealy-type FSM) (b) state-based (Moore-type FSM) 7/50

 Data flow graph (DFG) is used for describing

 FSMD models a system with mixed control and

(floor != requested) / floor = requested /

(floor == requested) / x / move = 0

 HCFSM captures a more complex control system by

 PSM is a heterogeneous model that integrates an

 Various architectures are suitable for implementation

 A FSM can be implemented using the controller

 For a functional model, you can design different

 You can use a single monolithic controller, or several

two distributed controllers

one centralized controller

 To pass values between two machines, you must

 Datapath is a direct implementation of DFGs

3-stage pipeline datapath

4-stage pipeline datapath

 Arithmetic operations are easy to spot in hardware

end if; reg 1

 Fewer adders, more cycles:

 More adders, fewer cycles:

 Divide a datapath into a pipeline provide higher

Combinational D Q Combinational D Q Combinational

input at time t input at time t

 Pipeline may do different things at different times

combinational combinational output

 FSMD architecture combines a controller with a

 CISC is a program execution logic which is program-

 RISC is a program execution logic with hardwired-

 Very Long Instruction Word (VLIW) architecture

(16-port Register file)

A four-way VLIW architecture

 Co-design architecture interconnects several

GPP 1 Local Global ASIC 1 Local Global Local

GPP : General purpose processor (CISC or RISC)

 To map models of system behaviors into an

 After hierarchal decomposition, each bottom-level

 When two components interact with each other in a

Port P1 I1 Channel C I2 Port P2

 Mapping of communication design to two real

 Channel (model implementation)

 An abstract computational model does not need to

 In modeling concurrent processes, description of

 A behavior X with a fork-join control structure:

 State description of control-dependent synchro.

 Data and events triggers state change among

 Synthesis flow of a sophisticated system:

System Allocation, Partition Scheduling Communication Communication

 A system must be specified in multiple levels:

B1 B1() B3() B7()

 Allocations: selection of components (processors,