You are on page 1of 42

!

Building Bigger Systems:


Hardware Threads
Lecture L06

18-545 — Advanced Digital Design

with credit to G. Larson


ECE Department

Many elements © Don Thomas, 2014, used with permission


Today
• We build on our knowledge of FSMs
• Want to make more complex systems
- Stuff that "Computes"
• How should they be organized?
• How should we think about them?

• This material often not taught in Digital Design Class

2
Sequential Systems
• Every synchronous sequential design can be
classified as a “finite-state” machine
- all that really means is that it has a finite number of flip-flop/
registers in the design

• Intel Itanium2 has more than 227 state bits and more
than 2(2^27) distinguishable states
- No. There is not a gigantic state-transition diagram somewhere
- Yes. There are control FSMs (the way you understand FSMs now)
✦ may be many tens of states in the largest ones
✦ more complicated control exists as many cooperating FSMs
- The majority of the “logic” is in the datapath
✦ very stylistic usage of state and logic
✦ a very different way of design

3
Motivation: serial adder example
• Add two N-bit numbers in serial
- unsigned numbers appear (A , B )
i i

- 1 bit per cycle (LSB first) starting after reset


- assert “done”, “Cout” and “sum” when the computation is finished
Reset

done
...Ai
Cout
...Bi
sum
N

Clk

How many different states does this FSM need?


4
If you don’t know any better
0,0 "00+
reset 00"

0,0
init "0+0" 0,1 "00+
0,1 01"

1,1 1,0
"01+
1,0 "0+1"
00"
"01+
1,1 0,0 01"
"1+0"
"1+1"
"00+ 0,0 "000+
0,1 10" 100"

0,1
1,0 1,0
"00+1 "000+
1" 101"
1,1 1,1
"01+ “001+
10" 100"
"01+1 “001+
1" 101"

This could work, but obviously not recommended 5


A better idea: datapath + control
• Much of the functionality can be built with pieces
you already know how to use (Full Adder, FFs)

ci co
……..ai
Full
……..bi Adder
s

sum
• Still need something to “control the system”
• Use a FSM for control
- at reset, carry flip flop gets cleared (perhaps also the sum bits)
- you need to “count” from 0 to N-1 after reset
- when FSM is N-1, you need to signal “done”
- only N states and log N state bits!!
2
6
These are called “datapaths”
Input

• Datapaths 6
6'b1
6

- Data is stored in the registers…


- Flows along the lines… Mux
sel

- Is transformed or selected by 
 6

the combinational components…


✦ e.g., adder, mux Adder

- Loaded into another register 6

• What can this datapath do?


- increment: sum = sum + 1
ld_L D
clock Sum

✦ ld_L=0, cl_L=1, sel=1 cl_L Q

- add_input: sum = sum + input 16

✦ ld_L=0, cl_L=1, sel=0


- synchronous clear of sum, cl_L=0
• Inputs sel, ld_L, cl_L are called control points
Computations
• In combinational logic, (think Boolean algebra here)
- sum = a ⊕ b ⊕ cin, and
- cout = a•b + a•cin + b•cin
- … all the variables are 1-bit values
• In register-transfer level systems, we say
- a = b + c (addition!)
✦ don’t think Boolean algebra here, think C or SystemVerilog
- … oh, and by the way:
✦ they’re all 37-bit numbers, the add occurs in FSM state 17
✦ the result is stored in a register containing a bunch of FFs
collectively called “a”
• The difference: the level of abstraction
- We’re describing computations, the order and FSM state they occur
in, what components — registers, ALUs — they use
- It looks a little like programming 8
What can this circuit do?
• Three supported RTs
- sum = sum + inputA
- sum = 0
InputA

16

- sum = sum
• What is “sum” in this case? Adder

The state of a
computation 16

• With proper sequencing we 



ld_L D
clock

can do a computation: cl_L Q

sum=0; 16

for (i=0, i<2; i++)


sum += InputA;

• What provides “the proper sequencing?”


- A Finite State Machine
9
FSM sequencing
reset_L

InputA
A 16
ld_L = 1
cl_L = 0

datapath
control
Adder

B
ld_L = 0 16
cl_L = 1
ld_L D
clock

C cl_L Q
ld_L = 0
16
cl_L = 1

sum=0;
for (i=0, i<2; i++)
sum += InputA;

It clears Q to zero and then adds inputA in each successive state


10
Tracing the state changes
A B C
InputA

ld_L=1,cl_L=0 ld_L=0,cl_L=1 ld_L=0,cl_L=1 16

sum=0 sum += inputA sum += inputA


Adder
reset_L

16
C D Q
A

A Clk
ld_L D
ld_L = 1
cl_L = 0 clock
A D Q
B
cl_L Q
Clk
16

B C
B D Q
ld_L = 0
cl_L = 1 Clk

reset_L

C
ld_L = 0
cl_L = 1

11
Generalize on what we just did
• FSM-D — A finite state machine with a datapath
- The finite state machine is what we’ve been studying
- A datapath is combinational logic and registers that can do
computation (sometimes spelled data-path, or data path)
- What senses and controls the computation?
✦ The FSM

• FSM-D is often called a “Hardware Thread”


outputs
inputs

FSM Datapath

clock
reset
12
What’s this Look Like in SysVerilog?
module top
#(parameter W = 16) InputA

(input logic [W-1:0] inputA, no 16

input logic clock, rst_L); connection

logic cl_L, ld_L;


Adder
logic [W-1:0] sum, addOut;
16
adder #(W) a1 (inputA, sum, 1'b0, addOut, );
regLoad #(W) r1 (addOut, ld_L, cl_L, clock, ld_L D
rst_L, sum);
clock

cl_L Q
fsm c1 (clock, rst_L, ld_L, cl_L);
16

endmodule: top

The module def’s:


module regLoad (d, ld_L, cl_L, clk, reset_L, q);
module adder (A, B, Cin, Sum, Cout);
13
How about the FSM Module?
module fsm
(input logic clk, rst_L,
output logic ld_L, cl_L);

enum logic [2:0] {A=3'b100, B=3'b010, C=3'b001} ns, cs;

always_comb
unique case (cs) reset_L
A: begin //load zero
ns = B;
cl_L = 0; ld_L = 1; A
end ld_L = 1
B: begin //add input cl_L = 0
ns = C;
cl_L = 1; ld_L = 0;
end
C: begin //add input B
ld_L = 0
ns = A;
cl_L = 1
cl_L = 1; ld_L = 0;
end
endcase
C
always_ff @(posedge clk, negedge rst_L) ld_L = 0
if (~rst_L) cs <= A; cl_L = 1
else cs <= ns;

endmodule: fsm 14
Termination
• Our solution doesn't exactly match our specification
- i.e. the code snippet
sum=0;
for (i=0, i<2; i++)
sum += InputA;

• When complete, our FSM loops to do it again


• If we care, then go to a "finish" state and stay there
reset_L reset_L

Computes
A
ld_L = 1
eternally A
ld_L = 1
Computes once
cl_L = 0 cl_L = 0
Holds answer eternally

B B
ld_L = 0 ld_L = 0
cl_L = 1 cl_L = 1

C C Stop
ld_L = 0 ld_L = 0 ld_L = 1
cl_L = 1 cl_L = 1 cl_L = 1

15
A Thorough Example
Ones Counter

• Problem statement d_in_ready


d_in
d_out_ready
d_out

- When the d_in_ready signal is FSM Datapath


asserted, read the 30-bit input word
(d_in), count the number of bits in it clock
that are set to one, make this 5-bit reset
number available at the d_out
output, assert the d_out_ready
signal, and wait for the next
d_in_ready signal

• What components do we need?


- To store the input word? 30-bit shift register

- To count the number of ones? 5-bit up counter

- To count the number of bits examined? 5-bit up-counter

- To determine when we are done? Comparator (count to 30)


16
Example: Datapath Components
BTW, this is but one approach to this datapath problem

30 We’ll load this register when the


d_in_ready signal is asserted. Then
load_L D
we’ll shift it right 30 times, each time
shift_L Shift Register looking at the low-order output bit (bit 0)
lowbit

clr_L

clr_L inc_L Shift Count Register


inc_L Ones Count Register
5'd30
5

5
A B
This register will count
from zero to 30. The comparator
This register will count comparator will tell us
the number of 1s we see when we’re done
eq

on the low order bit of


done
the shift register

17
Start Piecing the System Together
• Datapath inputs and outputs

d_in 30
clr_L

load_L D inc_L Ones Count Register


shift_L Shift Register

lowbit 5

d_in_

ready
d_out
clr_L
inc_L Shift Count Register
clock
5
5'd30 d_out_
reset FSM ready
A B
comparator
done

18
Start Piecing the System Together
• Datapath control points
- Inputs to the datapath used by FSM to control the datapath
d_in 30
clr_L

load_L D inc_L Ones Count Register


shift_L Shift Register

lowbit 5

d_in_

ready
d_out
clr_L
inc_L Shift Count Register
clock
5
5'd30 d_out_
reset FSM ready
A B
comparator
done

19
Start Piecing the System Together
• Datapath status points
- Values in datapath used by the FSM on state transitions
d_in 30
clr_L

load_L D inc_L Ones Count Register


shift_L Shift Register

lowbit 5

d_in_

ready
d_out
clr_L
inc_L Shift Count Register
clock
5
5'd30 d_out_
reset FSM ready
A B
comparator
done

20
Start Piecing the System Together
• Hook up the rest of the inputs and outputs

d_in 30
clr_L

load_L D inc_L Ones Count Register


shift_L Shift Register

lowbit 5

d_in_

ready
d_out
clr_L
inc_L Shift Count Register
clock
5
5'd30 d_out_
reset FSM ready
A B
comparator
done

21
The FSM — state by state
Cclr_L
Reset
Cinc_L Shift Count Register
~ d_in_ready
5'd30
5
SC
A
A B
comparator
done
d_in_ready /
Cclr_L, Sload_L, Oclr_L 30

Sload_L D
Sshift_L Shift Register

B lowbit

Oclr_L
When we get to this Oinc_L Ones Count Register
state, what will be
the values in the
5
registers?

Note: we’re only showing the output signals asserted in each state
22
FSM — arc by arc
Cclr_L
Reset
Cinc_L Shift Count Register
~ d_in_ready
5'd30
5
SC
A
A B
comparator
done
d_in_ready /
Cclr_L, Sload_L, Oclr_L 30

Sload_L D
Sshift_L Shift Register

B lowBit & (~ done) / lowbit


Oinc_L, Cinc_L, Sshift_L

Oclr_L
Oinc_L Ones Count Register
If the low bit is 1, and the shift
count is not 30, increment the 5
counters and shift

23
FSM — arc by arc
Cclr_L
Reset
Cinc_L Shift Count Register
~ d_in_ready
5'd30
5
SC
A
A B
comparator
done
d_in_ready /
30
Cclr_L, Sload_L, Oclr_L
Sload_L D
Sshift_L Shift Register

B lowBit & (~ done) / lowbit


Oinc_L, Cinc_L, Sshift_L

Oclr_L
~lowBit & (~ done) / Oinc_L Ones Count Register
Cinc_L, Sshift_L
If the low bit is 0 and the 5
shift count is not 30, inc the
shift counter and shift. Don’t
enable One’s Count
24
And a final arc
Cclr_L
Reset
Cinc_L Shift Count Register
~ d_in_ready
5'd30
5
SC
A
A B
comparator
done
done /
D_out_ready d_in_ready / 30
Cclr_L, Sload_L, Oclr_L
Sload_L D
Sshift_L Shift Register

lowBit & (~ done) / lowbit


B
Oinc_L, Cinc_L, Sshift_L

Oclr_L
~lowBit & (~ done) / Oinc_L Ones Count Register
Cinc_L, Sshift_L

5
When the shift count is 30,
signal D_out_ready
25
Specify the Main Module
module OnesCount
#(parameter w = 30)
(input logic d_in_ready,
input logic clock, reset,
output logic d_out_ready, $clog2(i) is a system
input logic [w-1:0] d_in, function (indicated by
output logic [$clog2(w)-1:0] d_out); the “$”) that calculates
// ceiling of log2 of w the ceiling of log2(i)
//instantiate FSM and Datapath components here

endmodule: OnesCount

Ones Counter
d_in_ready d_out_ready
d_in d_out

FSM Datapath

clock
reset

26
FSM SystemVerilog: State A
module fsm #(…) (clock, reset, … );

enum logic {A = 1'b0, B = 1'b1} Reset


cur_state, n_state; ~ d_in_ready
always_comb begin
case (cur_state) A
A: begin //State A
n_state = d_in_ready ? B : A;
Cclr_L = d_in_ready ? 0 : 1; done /
Sload_L = d_in_ready ? 0 : 1; D_out_ready d_in_ready /
Oclr_L = d_in_ready ? 0 : 1; Cclr_L, Sload_L, Oclr_L
Sshift_L = 1;
Cinc_L = 1;
Oinc_L = 1;
dor = 0; // D_out_ready lowBit & (~ done) /
B
end Oinc_L, Cinc_L, Sshift_L
B: begin //State B

end
~lowBit & (~ done) /
endcase
end Cinc_L, Sshift_L

always_ff @(posedge clock, posedge reset)


if (reset) cur_state <= A;
else cur_state <= n_state;

endmodule: fsm 27
FSM SystemVerilog: State B
module fsm #(…) (clock, reset, … );
Reset
enum logic {A = 1'b0, B = 1'b1} ~ d_in_ready
cur_state, n_state;
A
always_comb begin
case (cur_state)
A: begin //State A done /
… D_out_ready d_in_ready /
B: begin //State B Cclr_L, Sload_L, Oclr_L
n_state = (done)? A : B;
dor = (done)? 1 : 0;
Cclr_L = 1;
Sload_L = 1; lowBit & (~ done) /
B
Oclr_L = 1; Oinc_L, Cinc_L, Sshift_L
Cinc_L = (done) ? 1 : 0;
Sshift_L = (done) ? 1 : 0;
Oinc_L = (done)? 1:~lowBit; ~lowBit & (~ done) /
end Cinc_L, Sshift_L
endcase
end

endmodule: fsm

28
More…
module OnesCount module fsm
#(parameter w = 30) #(parameter w = 30)
(input logic d_in_ready, clock, reset, (input logic clock, reset, done,
input logic [w-1:0] d_in, input logic d_in_ready, lowBit,
output logic dor, input logic [$clog2(w):0] SC,
output logic [$clog2(w)-1:0] d_out); output logic Cclr_L, Cinc_L,
Sload_L, Sshift_L,
logic lowBit, done, Cclr_L, Cinc_L; Oclr_L, Oinc_L, dor);
logic Sload_L, Sshift_L, Oclr_L;
logic Oinc_L, dor; enum logic {A = 1'b0, B = 1'b1}
cur_state, n_state;
logic [$clog2(w)-1:0] SC;
always_comb begin
fsm #(w) control (.*); case (cur_state)
A: begin //State A
ShiftReg_PISO_Right #(w) sr (lowBit, …
d_in, clk, Sload, Sshift); endcase
end
counter #($clog2(w)) sc (clock, Cclr_L,
Cinc_L, SC); always_ff @(posedge clock,
posedge reset)
compare #($clog2(w)) cmp (, done, , SC, begin
'd30); if (reset) cur_state <= A;
else cur_state <= n_state;
counter #($clog2(w)) oct (clock, end
Oclr_L, Oinc_L, d_out);
endmodule: OnesCount endmodule: fsm
Trace a Transition
CLK Reset
~ d_in_ready
dinready
Cclr_L A
Sload

Oclr_L done / d_in_ready /


0 D_out_ready Cclr_L, Sload_L, Oclr_L
SC ?
OC ? 0
ShiftReg ? input
lowBit & (~ done) /
lowBit ? B
Oinc_L, Cinc_L, Sshift_L
Oinc_L

~lowBit & (~ done) /


A A B B Cinc_L, Sshift_L

30
An Alternate Approach
Reset
~ d_in_ready
• Lose the Shift Count Reg/Comp
- Let’s just have 30 states where we do the
A shifting and then just return to the A state
- I got tired, and didn’t draw all 30 of them!
d_in_ready /
Cclr_L, Sload_L, Oclr_L
30

Sload_L D
B
Sshift_L Shift Register

~lowBit / lowBit / lowbit


Sshift_L Oinc_L, Sshift_L

Oclr_L
C
Oinc_L Ones Count Register

~lowBit / lowBit /
Sshift_L Oinc_L, Sshift_L 5

etc OC
31
How Different Will This Be?
• Pick a state encoding
- How about A = 00000, B = 00001, C = 00010, D = 00011, E = 00100, …
- How would our design be different?

32
Two Approaches
• These two state transition diagrams suggest two
ways of envisioning a controller
- Exclude all of the states where you are just counting from the FSM
✦ Treat the counter as something else to control and monitor for
when it’s done
- Include all of the states, i.e. ones where you’re just counting, in the
FSM
✦ This was our second approach

• Comparison
- Excluding the counter states
✦ Smaller, simpler FSM to design — functional partitioning
✦ Give the synthesis tool a smaller thing to design
- Including all of the states
✦ Bigger, more complex FSM to design
✦ Let a synthesis tool wrestle with a state encoding
33
Cooperating FSMs
• Turns out, they’re about the same
• One view: Cooperating FSMs
- Control FSM + shift-count FSM
✦ two separate FSMs each with simple-to-think-about control
sequences
✦ compose them to form the more elaborate control sequences

34
More Alternates
• Don’t use a shift register
Oclr_L
Oinc_L Ones Count Register

- Use a register called DIN to hold the


input data 5

- Put a 32-to-1 mux on the output of DIN OC


- Use the old shift count (now called
BCount) to select one of the bits to feed
to the FSM 30

Cclr_L load_L D
Cinc_L Bit Count Register DIN Register

5 30

BCount
5'd30 32-to-1 MUX
sel
A B
comparator
done
selected bit to FSM “lowBit”
35
Here’s Another
• Who needs a state machine?
30

- Just build up a big combinational load_L D

DIN Register
network of adders
- Load the DIN register, and several
gate delays later you’ll have the
+ + +
answer

• Comparison
- The “all combinational” version of
these circuits are generally fewer + +

cycles but require more gates (no


reuse)
- The “all combinational” may not be A big combinational circuit of
faster if clock frequency goes way adders to add up all of the bits
down

36
… And Another
30

• How might you write the 
 Sload_L D

loop in software? Sshift_L Shift Register

for (Ocount = 0, ShiftReg = in; Q lowbit


30'd0
ShiftReg != 0; ShiftReg >> 1) 30

A B
if (ShiftReg & 1) Ocount++;
comparator

- Like before, no loop counter, but end when eq


shift-reg has no 1’s left
ShiftReg != 0
• Observations
- Many software “alternates” apply Oclr_L
Oinc_L Ones Count Register
✦ Some S/W, some H/W
- Hardware faster, a two state machine 5

- But you couldn’t have used the all OC


combinational string of adders in software
- Evaluation functions for hardware and
software differ 37
Another: 15-513 Datalab Way
V = d_in;
for (onesCount = 0; V; onesCount++) d_in
{
// clear the least significant bit set 2-to-1 MUX
V &= V - 1;
30
}
load_L D

• This one loops only once for Register

each set bit 30 30'd1


30'd0

• A value and that value -1 differ


only from the least significant A B
Subtracter
set bit down comparator

eq
0101001000 ← A value
0101000111 ← That value -1 done
30 AND gates

Only different here

• Sometimes known as "Brian


Kernighan's way"
38
An exercise for the student

unsigned int v; // count the number of bits set in v


unsigned int c; // c accumulates the total bits set in v

// option 1, for at most 14-bit values in v:


c = (v * 0x200040008001ULL & 0x111111111111111ULL) % 0xf;

// option 2, for at most 24-bit values in v:


c = ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL)
% 0x1f;

// option 3, for at most 32-bit values in v:


c = ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) %
0x1f;
c += ((v >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;

Thanks to: graphics.stanford.edu/~seander/bithacks.html


39
Reviewing the Parts
outputs
inputs

FSM Datapath

clock
reset

inputs outputs
next state and ALUs, MUXes,
outputs inputs
output logic comparators, etc.

State FFs Registers

clock clock
reset reset 40
Notes on Hardware Thread Design
• Design the computational machinery (datapath)
separate from the control (FSM) machinery
- Keep control points and status points straight
- Make sure the FSM inputs status points and outputs control points
• Datapath should be structured as RTL
- Registers hold values
- At clock edges, values are transferred from a register, through
combinational circuitry, into a (usually different) register
- Might even list transformations during design as:
✦ Register A ➙ Register B
✦ Register C + Register D ➙ Register C

• Think of transformations using standard


components whenever possible
• When datapath definition is complete, then develop
FSM to drive it
Summary
• RTL level systems
- FSM-D — finite state machine and datapath
- Hardware Thread
- Described in terms of the functional units
✦ datapath and controller(s)
- Generally multibit ops — we’re describing computations
• We’ve now seen Mealy and Moore implementations
- Plenty of alternate implementations
- Software background can help suggest alternates
- But software and hardware implementations are very different
✦ What’s good for one isn’t necessarily good for the other

42

You might also like