EE 465 Final Design Project

Fall, 2013
Lab 9: System Power Optimization


Xia, Jiwei
Li, Muzi
12/10/2013

EE 465 FINAL DESIGN PROJECT December 10, 2013

1

EE 465 – Fall 2013
Final Design Project
Introduction: In this project, we are going to design a triangle
rendering engine. Basically we are required to first write Verilog code
for the circuit, and use RTL Compiler to synthesize the code to get
schematic, then we use Encounter to produce the layout.
After we record the data we produced from the RTL Compiler and
Encounter, we need to redo the design by optimizing the circuit. Our
option is as follows:
1. Using pipe lining to reduce the area. In this lab, the main use of pipe
lining we can apply is to reduce the number of multipliers.
2. Applying clock gating to turn of unnecessary activity in the circuit.
In this way, you may save power and area.
3. Modifying the logic of the essential part to make it more efficient
doing calculation.

Project description: Given relative files and the skeleton of the Verilog
code, we are going to design a triangle rendering engine.
File given:
Triangle Rendering Engine description (pdf)
Supporting files (tar): input.dat, testfixture.v, triangle.sdc, tiangle.v
expect.dat, trangle.vhd

EE 465 FINAL DESIGN PROJECT December 10, 2013

2

The block overview is as follows


What we need to design is the right shaded part. All the I/O interface
decryption is as follows

Fig. 1 block overview
Fig. 2 I/O Interface
EE 465 FINAL DESIGN PROJECT December 10, 2013

3

Functional Example
By inputting (x1,y1), (x2,y2)and (x3,y3) of a triangle
Which are (1,1), (6,3) and (1,6) in this case.
Note: it is constrained that x1=x3, y1<y2<y3

It will produce a triangle as following.


The triangle rendering engine would output the valid coordinates in the
following order:
(1,1), (1,2), (2,2), (3,2), (1,3) ,(2,3), (3,3), (4,3) ,(5,3) ,(6,3), (1,4), (2,4),
(3,4) , (4,4) , (1,5) , (2,5) , (1,6)

Fig.3 Example of a triangle
EE 465 FINAL DESIGN PROJECT December 10, 2013

4

Design methodology and details
For coordinate imports:
From the test bench we know, the coordinates of the triangle is being
inputted one pair by one pair every CYCLE.
`define CYCLE 100 // Modify your clock period here (unit: 0.1ns)

Each CYCLE is 100*0.1 =10 ns
For three points, we need to wait three times to receive complete set of
coordinates.
current_i <= current_i +1;
end

if (current_i==3'b001) begin
x1 <= xii;
y1 <= yii;
end

if (current_i==3'b010) begin
x2 <= xii;
y2 <= yii;
end

if (current_i==3'b011 ) begin
x3 <= xii;
y3 <= yii;

end

After counter reaches 3, we set busy to 1, which indicating we are
busying calculation and prevent receiving new data from test bench.
Calculation
To judge whether the point is inside the or on the edge of the triangle,
we need to use the equation provided.

EE 465 FINAL DESIGN PROJECT December 10, 2013

5

Note: for the upper line, we need to invert the inequity sign.
However, these equations requires divider, which is difficult to be
implemented. What’s more important, the divider is more area and
power consuming than multiplier. As a result, we modify the equation to
use multiplier instead.
reg signed [7:0] bot_line;
reg signed [7:0] top_line;

bot_line <=(X_next-x1)*(y2-y1)-(x2-x1)*(Y_next-y1);
top_line <=(X_next-x3)*(y2-y3)-(x2-x3)*(Y_next-y3);

To make the equations identical to the original equation, we have to
compare the value only if when y1<Y<y3. This makes sure the
inequality sign doesn’t change the direction when multiplying both sides
with the denominator.
For a triangle like this,
x1=x3< x2









if (Y>=y1 && x1<x2 && X>=x1 && bot_line<=0 && top_line>=0 ) begin
po<=1;
xo<= X;
yo<= Y;
end


Fig.4 Example of a triangle
EE 465 FINAL DESIGN PROJECT December 10, 2013

6

For a triangle like this,
x1=x3> x2




if (Y>=y1 && x1>x2 && X<=x1 && bot_line>=0 && top_line<=0 ) begin
po<=1;
xo<= X;
yo<= Y;
end

Verilog Design
Code Version 1
Code
module triangle (clk, reset, nt, xi, yi, busy, po, xo, yo);
input clk, reset, nt;
input [2:0] xi, yi;
output busy, po;
output [2:0] xo, yo;

reg [2:0] x1,y1; //cordinates of
point 1,2,3
reg [2:0] x2,y2;
reg [2:0] x3,y3;

reg [2:0] x11,y11; //cordinates of
point 4,5,6
reg [2:0] x22,y22; //but it will be
passed to x1,y1, x2,y2, x3,y3 later, to calculate
reg [2:0] x33,y33;

reg [2:0] xii, yii; //temp storage of
xi,yi

reg busy,po;
reg [2:0] xo,yo;

Fig.5 Example of a triangle
EE 465 FINAL DESIGN PROJECT December 10, 2013

7

reg [2:0]current_i; //a counter to
indicate which point's cordinate is being recording,example: counter_i=1,
x1=xi,y1=yi
reg [2:0] X;
reg [2:0] X_next; //Next value of x
reg [2:0] Y;
reg [2:0] Y_next; //Next value of y
reg Start_importing; //Means wheather
it is still in the process of recording cordinates.

reg signed [7:0] bot_line; //expression for
judging whether the point is on the left or right of the line
reg signed [7:0] top_line;

reg cycle; //indicating which
triangle is being calculating. cycle = 1 means first triangle is being
calculating.

always@(posedge clk or posedge reset) begin
if(reset) begin

busy <=0; //reset all
register
po <= 0;
x2 <=3'bzzz;
y2 <=3'bzzz;
x1 <=3'bzzz;
y1 <=3'bzzz;
x3 <=3'bzzz;
y3 <=3'bzzz;

x22 <=3'bzzz;
y22 <=3'bzzz;
x11 <=3'bzzz;
y11 <=3'bzzz;
x33 <=3'bzzz;
y33 <=3'bzzz;

y3 <=3'bzzz;
xo <=3'bzzz;
yo <=3'bzzz;

current_i <=3'b01;
X <= 3'b0;
Y <=3'b0;
X_next<=3'b001;
Y_next<=3'b000;

Start_importing<=0;
cycle <=0;

end

else begin //Start importing cordinates
from the testbench

xii<=xi;
EE 465 FINAL DESIGN PROJECT December 10, 2013

8

yii<=yi;

if (nt && ~busy) Start_importing <=1;

if (Start_importing && ~busy) begin // counter_i increases from 1
to 6 to indicates which point to record.
current_i <= current_i +1;
end

if (Start_importing && current_i==3'b001) begin
x1 <= xii;
y1 <= yii;
end

if (current_i==3'b010) begin
x2 <= xii;
y2 <= yii;
end

if (current_i==3'b011 ) begin
x3 <= xii;
y3 <= yii;

end
if (current_i==3'b100 ) begin
x11 <= xii;
y11 <= yii;

end


if (current_i==3'b101) begin
x22 <= xii;
y22 <= yii;
end
if (current_i==3'b110 && ~busy) begin
x33 <= xii;
y33 <= yii;
busy <= 1;
Start_importing <=0;
end

if (busy) begin // start judging whether the
points is inside the triangle

X<=X+1; //loop from (0,0) to (8,8)

if (X==3'b111) begin Y<= Y+1; end

X_next<=X_next+1;

if (X_next==3'b111) begin Y_next<= Y_next+1; end

bot_line <=(X_next-x1)*(y2-y1)-(x2-x1)*(Y_next-y1);
top_line <=(X_next-x3)*(y2-y3)-(x2-x3)*(Y_next-y3);


EE 465 FINAL DESIGN PROJECT December 10, 2013

9

po <=0;

if (Y>=y1 && x1<x2 && X>=x1 && bot_line<=0 && top_line>=0 ) begin
po<=1;
xo<= X;
yo<= Y;
end

if (Y>=y1 && x1>x2 && X<=x1 && bot_line>=0 && top_line<=0 ) begin
po<=1;
xo<= X;
yo<= Y;
end


if (Y==3'b111 && X==3'b111) begin // first triangle is
calculated
cycle <=cycle +1; // move to second triangle
x2 <=x22; // pass the cordinates of
second triangle to the first 3 registers so that we don't need to modify the
expression of bot_line, top_line
y2 <=y22;
x1 <=x11;
y1 <=y11;
x3 <=x33;
y3 <=y33;

xo <=3'bzzz;
yo <=3'bzzz;

current_i <=3'b001;
X <= 3'b0;
Y <=3'b0;
X_next<=3'b001;
Y_next<=3'b000;
if (cycle) busy <=0;


end

end

end
end



endmodule


EE 465 FINAL DESIGN PROJECT December 10, 2013

10

Simulation Result


In Transcript window:
# ****** START to VERIFY the Triangel Rendering Enginen OPERATION ******
#
# Waiting for the rendering operation of the triangle points with:
# (x1, y1)=(1, 0)
# (x2, y2)=(7, 2)
# (x3, y3)=(1, 7)
# Waiting for the rendering operation of the triangle points with:
# (x1, y1)=(6, 1)
# (x2, y2)=(0, 3)
# (x3, y3)=(6, 6)
# PASS! All data have been generated successfully!
# ---------------------------------------------
# Total delay: 126000 ns
# ---------------------------------------------
# ** Note: $finish : /home/alfredx/ee330/Formaltest.v(147)
# Time: 12800 ns Iteration: 1 Instance: /test


Fig.6 Simulation Result for the
EE 465 FINAL DESIGN PROJECT December 10, 2013

11

RTL Compiler Result:
Schematic


Timing report
Cost Group : 'clk' (path_group 'clk')
Timing slack : 629ps
Start-point : nt
End-point : Start_importing_reg/SE

Fig.7 Schematic for version 1
EE 465 FINAL DESIGN PROJECT December 10, 2013

12

Power report
============================================================
Leakage Dynamic Total
Instance Cells Power(nW) Power(nW) Power(nW)
------------------------------------------------------------------
triangle 528 20742.271 348657.011 369399.281

Area report
============================================================
Instance Cells Cell Area Net Area Total Area Wireload
--------------------------------------------------------------------------------------
triangle 528 2883 0 2883 ZeroWireload (S)
EE 465 FINAL DESIGN PROJECT December 10, 2013

13

Encounter result
After Select FileRTL Synthesis




Fig.8 Layout after RTL Synthesis
EE 465 FINAL DESIGN PROJECT December 10, 2013

14

Select FloorplanSpecify Floorplan
After clicking “Apply”




After done mapping Floorplan


Fig.9 Layout after RTL Synthesis
Fig.10 Layout after Floorplan
EE 465 FINAL DESIGN PROJECT December 10, 2013

15

Select PowerPower PlanningAdd Ring


Select PlacePlace Standard Cell


Fig.11 Layout after Adding Ring
Fig.12 Layout after Place Standard Cell
EE 465 FINAL DESIGN PROJECT December 10, 2013

16

Report Power
Total Power
-----------------------------------------------------------------------------------------
Total Internal Power: 0.2861 52.78%
Total Switching Power: 0.2328 42.94%
Total Leakage Power: 0.02322 4.283%
Total Power: 0.542
-----------------------------------------------------------------------------------------
Power Units = 1mW
Area Information


Area =280* 148 = 41440

Fig.12 Area Measurement
EE 465 FINAL DESIGN PROJECT December 10, 2013

17

Debug Timing



The clock period I use is 5 ns in the triangle.sdc constrain file.

Fig.13 Debug Timing
EE 465 FINAL DESIGN PROJECT December 10, 2013

18

Code Version 2(Optimized Circuit)
module triangle (clk, reset, nt, xi, yi, busy, po, xo, yo);
input clk, reset, nt;
input [2:0] xi, yi;
output busy, po;
output [2:0] xo, yo;

reg [2:0] x1,y1; //cordinates of
point 1,2,3
reg [2:0] x2,y2;
reg [2:0] x3,y3;

reg [2:0] x11,y11; //cordinates of
point 4,5,6
reg [2:0] x22,y22; //but it will be
passed to x1,y1, x2,y2, x3,y3 later, to calculate
reg [2:0] x33,y33;

reg [2:0] xii, yii; //temp storage of
xi,yi

reg busy,po,EN;
reg [2:0] xo,yo;

reg [2:0]current_i; //a counter to
indicate which point's cordinate is being recording,example: counter_i=1,
x1=xi,y1=yi
reg [2:0] X;
reg [2:0] Y;
reg Start_importing; //Means wheather
it is still in the process of recording cordinates.

reg signed [6:0] bot_line; //expression for
judging whether the point is on the left or right of the line
reg signed [6:0] top_line;

reg cycle; //indicating which
triangle is being calculating. cycle = 1 means first triangle is being
calculating.

wire ENCLK1,ENCLK2;
assign ENCLK1 = clk|EN;
assign ENCLK2 = clk|(~EN);

reg signed [6:0] m0,m1;

reg [1:0] sel_tmp;

always@(posedge clk or posedge reset) begin
if(reset) begin

busy <=0; //reset all
register
po <= 0;
EN <= 0;
EE 465 FINAL DESIGN PROJECT December 10, 2013

19

x2 <=3'bzzz;
y2 <=3'bzzz;
x1 <=3'bzzz;
y1 <=3'bzzz;
x3 <=3'bzzz;
y3 <=3'bzzz;

x22 <=3'bzzz;
y22 <=3'bzzz;
x11 <=3'bzzz;
y11 <=3'bzzz;
x33 <=3'bzzz;
y33 <=3'bzzz;

y3 <=3'bzzz;
xo <=3'bzzz;
yo <=3'bzzz;

current_i <=3'b01;
X <= 3'b0;
Y <=3'b0;

Start_importing<=0;
cycle <=0;
sel_tmp<=0;

end

else begin //Start importing cordinates
from the testbench

if(~busy) begin
xii<=xi;
yii<=yi;

if (nt) Start_importing <=1;

if (Start_importing) begin // counter_i increases from 1 to
6 to indicates which point to record.
current_i <= current_i +1;
end

if (current_i==3'b001) begin
x1 <= xii;
y1 <= yii;
end

if (current_i==3'b010) begin
x2 <= xii;
y2 <= yii;
end

if (current_i==3'b011 ) begin
x3 <= xii;
y3 <= yii;

end
EE 465 FINAL DESIGN PROJECT December 10, 2013

20

if (current_i==3'b100 ) begin
x11 <= xii;
y11 <= yii;

end


if (current_i==3'b101) begin
x22 <= xii;
y22 <= yii;
end
if (current_i==3'b110 ) begin
x33 <= xii;
y33 <= yii;
busy <= 1;
Start_importing <=0;
end
end


end
end


always@(posedge ENCLK1)
begin
if (busy) begin // start judging whether the
points is inside the triangle

X<=X+1; //loop from (0,0) to (8,8)

if (X==3'b111) Y<= Y+1;


EN <=1;

if (Y>=y1 && x1<x2 && X>=x1 && bot_line<=0 && top_line>=0 ) begin
po<= 1;
xo<= X;
yo<= Y;
end

if (Y>=y1 && x1>x2 && X<=x1 && bot_line>=0 && top_line<=0 ) begin
po<= 1;
xo<= X;
yo<= Y;
end


if (Y==3'b111 && X==3'b111) begin // first triangle is
calculated
cycle <=cycle +1; // move to second triangle
x2 <=x22; // pass the cordinates of
second triangle to the first 3 registers so that we don't need to modify the
expression of bot_line, top_line
y2 <=y22;
x1 <=x11;
EE 465 FINAL DESIGN PROJECT December 10, 2013

21

y1 <=y11;
x3 <=x33;
y3 <=y33;

xo <=3'bzzz;
yo <=3'bzzz;

X <= 3'b0;
Y <=3'b0;
if (cycle) busy <=0;


end

end
end

always@(posedge ENCLK2)
begin

if (Y>=y1 && Y<=y2) begin

sel_tmp<=sel_tmp+1;
m1<= m0; // Moving m0 to m1
top_line <=0;


if (~sel_tmp) m0<=(X-x1)*(y2-y1); // store first part into m0
if (sel_tmp == 2'b01) m0<=(x2-x1)*(Y-y1);// store second part into m0
if it is in part 2

if (sel_tmp==2'b10) begin // set bot_line<= m1-m0 if it is in part3

bot_line <=m1-m0;
EN<=0;
sel_tmp<=0;
m0<=0;
m1<=0;
end

end

else if (Y>y2 && Y<=y3) begin

sel_tmp<=sel_tmp+1;
m1<= m0; // Moving m0 to m1
bot_line<=0;

if (~sel_tmp) m0<=(X-x3)*(y2-y3); // store first part into m0
if (sel_tmp == 2'b01) m0<=(x2-x3)*(Y-y3); // store second part into
m0 if it is in part 2

if (sel_tmp==2'b10) begin // set bot_line<= m1-m0 if it is in part3

top_line <=m1-m0;
EN<=0;
sel_tmp<=0;
EE 465 FINAL DESIGN PROJECT December 10, 2013

22

m0<=0;
m1<=0;
end
end

if (Y<y1 ||Y>y3) EN<=0;
if (po==1)po<=0;

end

endmodule

Optimization Strategy
In the original code,
bot_line <=(X_next-x1)*(y2-y1)-(x2-x1)*(Y_next-y1);
top_line <=(X_next-x3)*(y2-y3)-(x2-x3)*(Y_next-y3);
We calculate value of bot_line and top_line at the same time, which is
inefficient and power consuming. Because when Y <y2, we only need to
know the value of bot_line; when Y>y3, we only need to know the value
of top_line.
Furthermore, these two equation uses 4 multiplier in total which may
take a lot of area. As a result, we could do pipe lining to reduce the
number of the multipliers.
Here’s the modified code:
always@(posedge ENCLK2)
begin

if (Y>=y1 && Y<=y2) begin

sel_tmp<=sel_tmp+1;
m1<= m0; // Moving m0 to m1
top_line <=0;


if (~sel_tmp) m0<=(X-x1)*(y2-y1); // store first part into m0
if (sel_tmp == 2'b01) m0<=(x2-x1)*(Y-y1);// store second part into m0
if it is in part 2

if (sel_tmp==2'b10) begin // set bot_line<= m1-m0 if it is in part3

bot_line <=m1-m0;
EE 465 FINAL DESIGN PROJECT December 10, 2013

23

EN<=0;
sel_tmp<=0;
m0<=0;
m1<=0;
end

end

else if (Y>y2 && Y<=y3) begin

sel_tmp<=sel_tmp+1;
m1<= m0; // Moving m0 to m1
bot_line<=0;

if (~sel_tmp) m0<=(X-x3)*(y2-y3); // store first part into m0
if (sel_tmp == 2'b01) m0<=(x2-x3)*(Y-y3); // store second part into
m0 if it is in part 2

if (sel_tmp==2'b10) begin // set bot_line<= m1-m0 if it is in part3

top_line <=m1-m0;
EN<=0;
sel_tmp<=0;
m0<=0;
m1<=0;
end
end

if (Y<y1 ||Y>y3) EN<=0;
if (po==1)po<=0;

end

By using pipe lining, we reduce the activated multiplier to only 1, which
would save a lot of power.
In addition, to make it convenient to do pipe lining, we only need to
extend the clock cycle for the block which contains calculation of
bot_line and top_line, while remain other part unchanged.
To do this, we pulled out the part of the calculation and put it in a always
block which is driven by ENCLK.
This is a kind of clock gating.

EE 465 FINAL DESIGN PROJECT December 10, 2013

24

The type of the clock gating cell we use is as follows


Here is the structure we use
input clk;
wire ENCLK1,ENCLK2;

assign ENCLK1 = clk|EN;
assign ENCLK2 = clk|(~EN);

always@(posedge clk or posedge reset) begin
// do some operation to EN
end

always@(posedge ENCLK1) begin

end
always@(posedge ENCLK2) begin

end


Fig.14 Integrated clock gating cell using DFF
EE 465 FINAL DESIGN PROJECT December 10, 2013

25

Simulation Result


In Transcript window:
# ****** START to VERIFY the Triangel Rendering Enginen OPERATION ******
#
# Waiting for the rendering operation of the triangle points with:
# (x1, y1)=(1, 0)
# (x2, y2)=(7, 2)
# (x3, y3)=(1, 7)
# Waiting for the rendering operation of the triangle points with:
# (x1, y1)=(6, 1)
# (x2, y2)=(0, 3)
# (x3, y3)=(6, 6)
# PASS! All data have been generated successfully!
# ---------------------------------------------
# Total delay: 464000 ns
# ---------------------------------------------
# ** Note: $finish : /home/alfredx/ee330/Formaltest.v(147)
# Time: 46600 ns Iteration: 1 Instance: /test

Fig.14 Simulation Result for version 2
EE 465 FINAL DESIGN PROJECT December 10, 2013

26

RTL Compiler Result:
Schematic



Time report
Cost Group : 'cg_enable_group_clk' (path_group 'cg_enable_group_clk')
Timing slack : 678ps
Start-point : reset
End-point : RC_CG_HIER_INST2/RC_CGIC_INST/E

Fig.15 Schematic for version 2
EE 465 FINAL DESIGN PROJECT December 10, 2013

27

Power report
============================================================
Leakage Dynamic Total
Instance Cells Power(nW) Power(nW) Power(nW)
----------------------------------------------------------
triangle 341 11485.826 147180.865 158666.690

Area report
============================================================
Instance Cells Cell Area Net Area Total Area Wireload
-----------------------------------------------------------------------------
triangle 341 1604 0 1604 ZeroWireload (S)

EE 465 FINAL DESIGN PROJECT December 10, 2013

28

Encounter result
Physical View


EE 465 FINAL DESIGN PROJECT December 10, 2013

29

After floor plan

Power ring


EE 465 FINAL DESIGN PROJECT December 10, 2013

30

Report Power
Total Power
-----------------------------------------------------------------------------------------
Total Internal Power: 0.1725 67.07%
Total Switching Power: 0.07019 27.3%
Total Leakage Power: 0.01449 5.633%
Total Power: 0.2572

Area Information

Area = 262*128 = 33536

EE 465 FINAL DESIGN PROJECT December 10, 2013

31

Debug Timing

The period I use in .sdc file is also 5ns.


EE 465 FINAL DESIGN PROJECT December 10, 2013

32

Comparison for both circuits
RTL Result:
Power (nW) Area (cells) Clock Period
Circuit 1 369,399 2883 5ns
Circuit 2 158,666 1604 5ns

Encounter Result
Power (mW) Area(um^2) Clock Period
Circuit 1 0.542 41,440 5ns
Circuit 2 0.2572 33,536 5ns


Conclusion: We’ve done our best in this project. The total hour we
spend is like 50 hours I think. The difficulty is just right for us to
experience design process.
From this project, we have learned quit a lot of how to write the Verilog
code and what’s the difference between this and other language. We also
are aware that we need to use non-blocking assignment all the way
through, otherwise it cannot pass the synthesis.
If there are some suggestion I can think of, it might be to have RTL and
Encounter installed in the lab rather than always need to remote to server
to do the synthesis and produce the layout. Sometimes server is not
stable or shut down and it is frustrating.