You are on page 1of 50

HW/SW Co-design

National Chiao Tung University


Chun-Jen Tsai
05/17/2011
Many materials in the slides are from the book: J. Staunstrup and W. Wolf (Eds.), Hardware/Software Co-Design,
Kluwer Academic, 1997
System Design Process

 The design process is the set of design tasks that


transform a model into an architecture:
 A model is a formal system consisting of objects and
composition rules, and is used for describing a system’
s
characteristics
 An architecture defines system implementation by
specifying the number and types of components as well as
the connection between them

2/50
Generic Hardware Architecture (1/2)

 Any digital circuit can be divided into two parts:


datapath and controller
 Datapath
 Performs data processing e  (a+b) * (c+d);

 Compute functions concurrently


 Controller
 Enforce sequential execution of system behavior
 Follows algorithms step-by-step
Step 1: c  a+b;
Step 2: if (c > 5) then
d  a + b;
else
d  a – b;
end
Step 3: c <- c * d;
3/50
Generic Hardware Architecture (2/2)

 One can rewrite control into datapath and visa versa:


 Control: if i1 = ‘0’ then o1 <= a; else o1 <= b; end if;
 Datapath: o1 <= ((i1 = ‘0’) and a) or ((i1 = ‘1’) and b);

 Data/control distinction is useful but not fundamental


 Generic hardware component:
output
input

combinational state
logic D Q
feedback

4/50
HW/SW Distinction?

 In principle, there is really no such thing as “


HW/SW
Codesign,”since a “ software”component can be
regarded as a register-transfer machine processing
some input data (program instruction codes):

combinational
logic

combinational combinational output


D Q D Q D Q
logic logic
clock

5/50
System Models

 Control-operation oriented model


 Finite-State Machine (FSM)
 Data-processing oriented model
 Dataflow Graph (DFG)
 Joint model
 FSM with Datapath
 Hierarchical Concurrent FSM
 Programming-based model
 Program-State Machine

6/50
FSM for Controller Model

 The finite state machine (FSM) model consists of a


set of states, a set of transitions between states,
and a set of actions associated with these states or
transitions
 FSM is used for modeling control systems
input (condition) output

(a) input-based (Mealy-type FSM) (b) state-based (Moore-type FSM) 7/50


Dataflow Graph for Datapath Model

 Data flow graph (DFG) is used for describing


computational components in a system
a b
t1

square square

t2

add

t3

sqrt

t4

a 2 b 2
8/50
Finite-State Machine with Datapath

 FSMD models a system with mixed control and


computation functions
 FSMD contains a datapath driven by the signals generated
from a controller
input (condition) action (data processing)

(floor != requested) / floor = requested /


move = requested - current

start S1 output

(floor == requested) / x / move = 0

9/50
Hierarchical Concurrent FSM

 HCFSM captures a more complex control system by


further divide each into hierarchical and/or current
sub-states
state Y

state A state D

B
E

a(P)/c b
F

C
G

10/50
Program-State Machine (PSM)

 PSM is a heterogeneous model that integrates an


HCFSM with a programming language paradigm
state Y int A[20];

state A state D
int idx, max;
B max = 0;
for (idx=0; idx<20; idx++)
e1 e2 {
if (A[idx] > max)
{
C max = A[idx];
}
}

Transition-On-Completion, Transition-Immediately
11/50
System Architecture

 Various architectures are suitable for implementation


of system models:
 Hard-wired logic implementation
 Controller
 Datapath
 FSMD
 Software programming implementation
 CISC
 RISC
 VLIW
 Combined Co-design Architecture

12/50
Controller Architecture

 A FSM can be implemented using the controller


architecture
 Control steps are synchronized to clock cycles
 Longer schedule means more states in controller

13/50
Controllers and Scheduling

 For a functional model, you can design different


execution schedules using different controllers:
–/ x = a + b; y = c+ d
x <= a + b;
y <= c + d;
s1
one state design

–/ x = a + b
s1 s2
two-state design
–/ y = c + d

14/50
Distributed Controllers

 You can use a single monolithic controller, or several


smaller synchronized controllers for your design

two distributed controllers

one centralized controller


15/50
Synchronization between FSMs

 To pass values between two machines, you must


schedule output of one machine to coincide with input
expected by the other:

M1 M2
s1 t1
i1= 0/ – i1= 1/ –
w

s2 s3
t2
–/ w = 0 –/ w = 1 w=0/– w=1/–

16/50
Datapath Architecture

 Datapath is a direct implementation of DFGs


 Example: two implementations of an FIR filter:

3-stage pipeline datapath

4-stage pipeline datapath

17/50
Data Operators

 Arithmetic operations are easy to spot in hardware


description languages:
 x <= a + b;
 Multiplexers are implied by conditionals; or from
sharing adders, etc.
x

if x = ‘0’ then
reg1 <= a; a 0
sel

else D Q
reg1 <= b; b 1

end if; reg 1

HDL code
register-transfer
18/50
Hardware Resource Sharing

 Fewer adders, more cycles:


a
mux

+ D Q

b
mux
c

 More adders, fewer cycles:

a
b
+ mux

c +
19/50
Logic Pipeline

 Divide a datapath into a pipeline provide higher


utilization of logic
 For example, the snapshot of a pipelined datapath at time t
is as follows:
output effective at time t+1 output effective at time t+1

Combinational D Q Combinational D Q Combinational


logic 1 logic 2 logic 3

input at time t input at time t


20/50
Pipelines with Control

 Pipeline may do different things at different times


 Example: CPU architecture

control control

combinational combinational output


data D Q D Q D Q
logic logic
clock

21/50
FSMD Architecture

 FSMD architecture combines a controller with a


datapath

22/50
CISC Architecture

 CISC is a program execution logic which is program-


controlled itself:

23/50
RISC Architecture

 RISC is a program execution logic with hardwired-


control:

24/50
VLIW Architecture

 Very Long Instruction Word (VLIW) architecture


issues multiple independent instructions in each cycle
 It is very difficult to develop optimizing compilers for such
architecture  hand-coding in assembly is still a necessity to
program such devices

Memory

(16-port Register file)

A four-way VLIW architecture


25/50
Co-Design Architecture

 Co-design architecture interconnects several


concurrent processing elements (PEs):

GPP 1 Local Global ASIC 1 Local Global Local


Memory 1 Memory 2 Memory 2 GPP 2 Memory 3
Memory 1

GPP : General purpose processor (CISC or RISC)


ASIC: Application specific IC (custom logic)

26/50
Co-Design Descriptions

 To map models of system behaviors into an


architecture, a language that can describe certain
system features is required
 The system must be defined on different levels of
abstraction at each design step; general
characteristics to be described are:
 Hierarchy
 Concurrency
 Communication/Exception handling
 Timing/Synchronization
 …

27/50
Hierarchy Description

 Structural hierarchy:
Processor
data bus

control Memory
lines

 Behavioral hierarchy:
 Behavior P is composed of behavior Q and behavior R …

28/50
Behavioral Decomposition

 After hierarchal decomposition, each bottom-level


sub-behavior can further be classified into either
sequential, concurrent, or pipelined

29/50
Concurrency Classifications

 Data-driven concurrency:
 Operation execution depends only upon the availability of
data; the degree of concurrency is limited by data
dependencies
 Pipelined concurrency:
 An extension to data-driven concurrency by dividing
operations into groups (stages), which operate on different
data sets concurrently
 Control-driven concurrency:
 Also refer to as thread-level parallelism; explicit construct is
used to specify concurrent execution of multiple control tasks

30/50
Communication

 When two components interact with each other in a


system, a mechanism of communication must be
defined
 Software constructs: function calls, global variables,
messages sending thru a network, etc.
 Hardware constructs: inter-connection wires, shared
memories, register files, mailboxes, etc.
 In general, a communication model can be described
as follows:
Behavior B1 Behavior B2

Port P1 I1 Channel C I2 Port P2

31/50
Examples of Communication

 Mapping of communication design to two real


mechanisms:
 Shared memory (direct implementation)
Behavior B1 Behavior B2
int x; int y;
... ...
M = x; int M; y = M;
... ...

 Channel (model implementation)

Behavior B1 Behavior B2
int x; Channel C int y;
... ...
void send(int d)
C.send(x); {...} y = C.receive();
... int receive(void); ...
{...}
32/50
Timing

 An abstract computational model does not need to


specify cycle-accurate timing; however, you need to
do so for a physical mapping since you must follow
some “ protocol”to pass data around, internally or
externally
 A timing relation can be described by a 4-tuple
T = (e1, e2, min, max), where e1 precedes e2 by at
least min time units and at most max time units
 Timing delay –when the relation is used to describe a real
component
 Timing constraint –when the relation is used to describe a
component specification
33/50
Synchronization

 In modeling concurrent processes, description of


synchronization among processes are very important
 Control-dependent synchronization: the control structure of
the system prearrange the synchronization of processes in
the system (e.g. via fork-join statements)
 Data-dependent synchronization: processes are
synchronized by the output data or generated events of each
others

34/50
Control-dependent Synchronization

 A behavior X with a fork-join control structure:

behavior X Q
begin
Q();
fork A(); B(); C(); join; A B C
R(); synchronization
end behavior X; point
R

 State description of control-dependent synchro.


A B
A B C
A1 B1

e
A2 B2

35/50
Data-dependent Synchronization

 Data and events triggers state change among


processes:

A B A B A B

A1
A1 B1 A1 B1 B1
x:=0
e e e e x=1
entered A2
A2
A2 B2 A2 B2
x:=1
B2

synchronization synchronization
by common event by common variable
synchronization
by status detection

36/50
Co-design Methodology

 Synthesis flow of a sophisticated system:

System Allocation, Partition Scheduling Communication Communication


Scheduling
Spec partitioning model model synthesis model

High-level Interface
Compilation
synthesis synthesis

implementation
model

manufacturing
37/50
System Specification

 A system must be specified in multiple levels:

B0 shared sync

B1 B1() B3() B7()


{ { {
stmt; stmt; stmt;
... ... ...
B2 } } }

B5 B6() B4()
{ {
int local; int local;
B6 ... wait(sync)
sync shared = local+1; local = local-1;
signal(sync); ...
} }
B7 B4

B3
Atomic behaviors
Control-flow view

38/50
Allocation and Partitioning

 Allocations: selection of components (processors,


memories, lPs, etc.) to implement the system
 Partitioning: defines the mapping between system
behaviors and allocated processing elements (PEs)
 New control behaviors may need to be inserted into the
system whenever child behaviors are assigned to different
PEs from their parent behaviors’
 Synchronization operations between PEs must also be
inserted

39/50
System Model After A & P

Top shared sync B1_start B1_done B4_start B4_done B1() B1_ctrl()


{ {
wait(B1_start); signal(B1_start);
... wait(B1_done);
PE0 PE1 signal(B1_done); ...
} }
B0
B1_start
B4_ctrl()
B1_ctrl B3() B7() {
{ { signal(B4_start);
stmt; stmt;
B2 B1 ... ... wait(B4_done);
...
B1_done } }
B5 }

B6 B4() B6()
sync B4_start { {
int local; int local;
B7 B4_ctrl wait(B4_start); ...
wait(sync); shared = local+1;
B4 local = shared-1; signal(sync);
B4_done ... }
B3 signal(B4_done);
}

System model Atomic behaviors

40/50
Allocation & Partition Issues (1/2)

 Key question for allocation:


 How can we reach desired performance with minimal
components?
 Key considerations for partitioning:
 Behavior execution time on each PE
 Data transfer time between PEs
 Synchronization overhead between PEs
 Note that data transfer time include:
 Flushing register/cache values to main memory
 Time required for a PE to set up transaction
 Overhead of data transfers by bus packets, handshaking, etc

41/50
Allocation & Partition Issues (2/2)

 No simple, general rules for allocation/partition


problem  requires good domain knowledge
 For example, for video encoder, should the partition between
software and the accelerator logic be defined at frame-level
or macroblock-level?

VLC to
Bitstream
Video ME/MC - DCT Quantize
data

Frame level Apply Adapt


Calculate RD Model RD Model
Target bits Get QP

Reference
frame + IDCT Quantize-1

Macroblock level
MC

42/50
Example of Partitioning

 Step 1: Divide functional specification into units


 Step 2: Determine proper level of parallelism
f1() f2()
f3(f1(), f2())

vs.
Sequential behaviors f3()

Parallel behaviors

 Step 3: Map behaviors onto allocated components

P1 P2
M1 M2

d1 d2

P3

Task graph Hardware platform


43/50
Execution vs. Communication Times

 Execution time table:

M1 M2
P1 5 5
P2 5 6
P3 – 5

 Communication time table:


 Assume communication within PE is free
 Cost of communication from P1 to P3 is d1 = 2,
cost of P2 to P3 communication is d2 = 4

44/50
First Design

 Allocate P1, P2 -> M1; P3 -> M2.

M1 P1 P2
Time = 19

M2 P3

network d2

5 10 15 20
time
45/50
Second Design

 Allocate P1 -> M1; P2, P3 -> M2:

M1 P1
Time = 18

M2 P2 P3

network d1

5 10 15 20

46/50
Static vs. Dynamic Partitioning

 It is a common practice to perform software-hardware


partitioning of a system statically at design time
 With new generations of multimedia devices and
applications, static partition of a systems behaviors
do not reach optimal performance at runtime
 Execution time of a behavior on a PE may not be fixed at
any given time
 Dynamic partition of system behaviors is possible if a
behavior can run under multiple PEs

47/50
Scheduling

 When a number of partially-ordered hierarchal


behaviors are allocated to a PE, the total order of
invocation among all behaviors for this PE must be
determined
 Coarse-level of scheduling are done during the
allocation & partitioning step
 Scheduling can be very difficult when the system is
“interactive”
 Multimedia handsets are less stable than 2G handsets due
to man-machine interaction

48/50
System Model after Scheduling

Top shared sync B6_start B3_start

PE0 PE1
B1() B3() B7()
{ { {
B1 signal(B6_start); wait(B3_start); stmt;
B6_start ... ... ...
} } }

B6
B4() B6()
sync { {
int local; int local;
wait(sync); wait(B6_start);
B7 B4 local = shared-1; ...
B3_start ... shared = local+1;
signal(B3_start); signal(sync);
} }
B3

Atomic behaviors
System model

49/50
Discussions

 Optimal HW/SW system co-design is determined by


many factors, including
Easy to quantify !
 Design cost
 Implementation cost (human cost & target cost)
 Performance
 Runtime resource consumption (memory, power, etc.)
 System robustness
 To achieve optimal design, good domain knowledge
of the application is crucial
Difficult to quantify !

50/50

You might also like