Professional Documents
Culture Documents
Dynamic Operands Insertion For VLIW Architecture With A Reduced Bit-Width Instruction Set
Dynamic Operands Insertion For VLIW Architecture With A Reduced Bit-Width Instruction Set
Jongwon Lee , Jonghee M. Youn , Jihoon Lee , Minwook Ahn , Yunheung Paek
School
of EECS,
Seoul Natl Univ., Korea,
{jwlee,jhlee,ypaek}@sor.snu.ac.kr,
Department
of CSE,
Gangneung-Wonju Natl Univ., Korea
jhyoun@gwnu.ac.kr
I. I NTRODUCTION
As multimedia applications such as videos, sounds and
images are becoming ever more complex and diverse, embedded systems targeting such applications have an increased tendency to attain a desirable performance with
VLIW (very long instruction word) to take advantage of
instruction-level parallelism (ILP). VLIW processors are
designed to enhance the performance by executing opera1530-2075/12 $26.00 2012 IEEE
DOI 10.1109/IPDPS.2012.21
Samsung
119
120
32-bit
H1
32-bit
32-bit
H2
A0
20-bit
16-bit
32-bit
A1
20-bit
Figure 1.
16-bit
instructions designate the same name for source and destination operands. In our work, we try to utilize this fact as
much as we can to avoid creating remote operands. As in
the case of add r4,r4,r5 of Figure 3, we thereby were
able to generate complete 16-bit instructions frequently in
our experiment. But not surprisingly, most instructions still
become partial instructions coupled with remote operands.
To distinguish between complete and partial instructions, we
set aside one bit, called D-bit, in the 16-bit ISA. D-bit is
represented in assembly syntax as shown in Figure 3 (c). In
Section III, we discuss how D-bit is decoded by hardware
in the pipeline stage.
Program Memory
A0
P0
P0
At Runtime
A1
A2
P1
P1
A0
Assembled
Complete Instruction
A1
Assembled
Complete Instruction
Legend
&RPSOHWH,QVWUXFWLRQ
Figure 2.
5HPRWH
RSHUDQG
3DUWLDO,QVWUXFWLRQ
4GOQVG1RGTCPF#TTC[
Conceptual View
add
add
r10
r11
r12
A. ISA Design
'ELW
r5
r4
D-bit value
syntax
example
add
sematic
r1 = r1 + r2
0,r1,r2
add
1,r1,r2
remote operand = r1 + r2
Figure 3.
121
reg
reg
reg
4bit
4bit
4bit
reg
reg
4bit
4bit
4bit
16
(a) R-type Instruction
reg
reg
imm16
4bit
4bit
16bit
reg
reg
4bit
4bit
imm4
32
Remote Operands
reg
32
Remote Operands
imm4
imm4
imm4
16bit
16
(b) I-type Instruction
imm20
20bit
32
Remote Operands
imm8
imm4
4bit
imm4
imm4
12bit
16
(c) J-type Instruction
Figure 4.
ISA Conversion
122
cycle
slot 0
0 ROA 3,0xa,0xb
sub 0,r3,r4
1
slot 2
add 1,r1,r2
mul 1,r6,r7
slot 3
movi 0,r10,10
addi 0,r10,4
buffer
(a) an example code and an initial buffer
cycle
slot 0
slot 1
0 ROA 3,0xa,0xb ROA 0xc,0xd,8
sub 0,r3,r4
lw 1,r5,[r10]
1
buffer
slot 2
add 1,r1,r2
mul 1,r6,r7
slot 3
movi 0,r10,10
addi 0,r10,4
slot 0
0 ROA 3,0xa,0xb
sub 0,r3,r4
1
buffer
C. Microarchitecture
slot 1
ROA 0xc,0xd,8
lw 1,r5,[r10]
slot 2
add 1,r1,r2
mul 1,r6,r7
slot 3
movi 0,r10,10
addi 0,r10,4
slot 1
ROA 0xc,0xd,8
lw 1,r5,[r10]
cycle
slot 0
0 ROA 3,0xa,0xb
sub 0,r3,r4
1
buffer
slot 1
ROA 0xc,0xd,8
lw 1,r5,[r10]
slot 2
add 1,r1,r2
mul 1,r6,r7
slot 3
movi 0,r10,10
addi 0,r10,4
(d) add 1,r1,r2 reads one remote operand from the buffer
cycle
slot 0
0 ROA 3,0xa,0xb
sub 0,r3,r4
1
buffer
slot 1
ROA 0xc,0xd,8
lw 1,r5,[r10]
C
slot 2
add 1,r1,r2
mul 1,r6,r7
slot 3
movi 0,r10,10
addi 0,r10,4
slot 0
0 ROA 3,0xa,0xb
sub 0,r3,r4
1
slot 1
ROA 0xc,0xd,8
lw 1,r5,[r10]
buffer
slot 2
add 1,r1,r2
mul 1,r6,r7
slot 3
movi 0,r10,10
addi 0,r10,4
(f) mul 1,r6,r7 reads one remote operand from the buffer
Figure 5.
(1)
123
VORW
'HFRGH
%XIIHU
5HDG
3RLQWHU
&DOFXODWRU
RI52$
EXIIHUZRUGV
VORW
VORW
)
(
'
&
'HFRGH
:ULWH
,QGH[
:ULWH
'DWD
:ULWH
,QGH[
:ULWH
'DWD
:ULWH
,QGH[
:ULWH
'DWD
:ULWH
,QGH[
:ULWH
'DWD
RI52$
EXIIHUZRUGV
'HFRGH
%XIIHU
:ULWH
3RLQWHU
&DOFXODWRU
RI52$
EXIIHUZRUGV
VORW
5HDG
,QGH[
5HDG
,QGH[
5HDG
,QGH[
5HDG
,QGH[
'HFRGH
RI52$
EXIIHUZRUGV
5HDG
'DWD
5HDG
'DWD
5HDG
'DWD
'
&
(
;
5HDG
'DWD
52$%XIIHU
Figure 6.
RI52$EXIIHUZRUGV
WREHUHDGZULWWHQDW
VORW
RI52$EXIIHUZRUGV
WREHUHDGZULWWHQDW
VORW
RI52$EXIIHUZRUGV
WREHUHDGZULWWHQDW
VORW
Figure 7.
$
'
'
$
'
'
*OREDO
5HDG
:ULWH
%XIIHU
3RLQWHU
$
'
'
$
'
'
,QGH[ZKHUH52$
EXIIHUZRUGVDUH
UHDGZULWWHQDW
VORW
,QGH[ZKHUH52$
EXIIHUZRUGVDUH
UHDGZULWWHQDW
VORW
,QGH[ZKHUH52$
EXIIHUZRUGVDUH
UHDGZULWWHQDW
VORW
,QGH[ZKHUH52$
EXIIHUZRUGVDUH
UHDGZULWWHQDW
VORW
124
add
mov
addi
addi
mul
add
mul
add
sub
r5,r1,r2
r12,r13
r14,r13,0xabcd
r10,r10,1
r5,r5,r6
r7,r12,r3
r9,r6,r8
r1,r10,r13
r11,r14,r12
Slot 2
Slot 1
Slot 0
for each ni G
current_inst = NULL;
unresolved_inst = NULL;
input_code_seq = ni.get_code_seg();
output_code_seq = NULL;
resolved_inst_seq = NULL;
ROA_buffer = NULL;
temp_ROA_buffer = NULL;
ROA_slot = NULL;
ROA_slot_generated = false;
while !input_code_seq.empty() || unresolved_inst NULL do
if unresolved_inst NULL then
current_inst = unresolved_inst;
unresolved_inst = NULL;
else
current_inst = input_code_seq.get_front();
buffer_words=convert_32bit_to_16bit_and_get_ROA_
buffer_words(current_inst);
temp_ROA_buffer.insert(buffer_words);
Slot 3
add r5,r1,r2
mov r12,r13
addi r14,r13,0xabcd addi r10,r10,1
mul r5,r5,r6
add r7,r12,r3
mul r9,r6,r8
add r1,r10,r13
sub r11,r14,r12
(b) a code after VLIW scheduling
cycle
0
1
2
3
Slot 1
Slot 0
ROA 5,0xa,0xb
add 1,r1,r2
mul 0,r5,r6
sub 1,r14,r12
ROA 0xc,0xd,7
mov 0,12,r13
add 1,r12,r3
Slot 2
ROA 9,1,11
addi 1,r14,r13
mul 1,r6,r8
Slot 3
addi 0,r10,1
add 1,r10,r13
Figure 8.
fi
while !temp_ROA_buffer.empty() do
ROA_buffer.insert( temp_ROA_buffer.get_front() );
if ROA_buffer.size() % 3 = 0 then
ROA_slot = generate_ROA_slot(ROA_buffer);
ROA_slot_generated = true;
if ROA_buffer.full() then
ROA_buffer.clear();
fi
break
fi
od
if temp_ROA_buffer.empty() then
resolved_inst_seq.insert(current_inst);
else
unresolved_inst = current_inst;
fi
if ROA_slot_generated == true then
output_code_seq.insert(ROA_slot);
output_code_seq.insert(resolved_inst_seq);
ROA_slot_generated = false;
fi
od
end
for each ni G
build_data_dependence_graph (ni);
apply_list_scheduling(ni);
end
Figure 9.
125
ROA 5,0xa,0xb
6LQJOHLVVXH&RGH
*HQHUDWLRQZLWKELW,6$
add r5,r1,r2
move r12,r13
ELW,QVWUXFWLRQ
*HQHUDWLRQ
ROA 0xc,0xd,7
addi
''**HQHUDWLRQ
6FKHGXOLQJ
r14,r13,0xabcd
addi r10,r10,1
mul r5,r5,r6
Figure 10.
Compilation Framework
add r7,r12,r3
ROA 9,1,11
r5,r1,r2
r12,r13
r14,r13,0xabcd
r10,r10,1
r5,r5,r6
r7,r12,r3
r9,r6,r8
r1,r10,r13
r11,r14,r12
add
mov
addi
addi
mul
add
mul
add
sub
5,0xa,0xb
r5,r1,r2
r12,r13
0xc,0xd,7
r14,r13,0xabcd
r10,r10,1
r5,r5,r6
r7,r12,r3
9,1,11
r9,r6,r8
r1,r10,r13
r11,r14,r12
add
mov
addi
addi
mul
add
mul
add
sub
r5,r1,r2
r12,r13
r14,r13,0xabcd
r10,r10,1
r5,r5,r6
r7,r12,r3
r9,r6,r8
r1,r10,r13
r11,r14,r12
mul r9,r6,r8
sub r11,r14,r12
Figure 12.
[D[E[F[G
r5,r1,r2
r12,r13
r14,r13,0xabcd
r10,r10,1
r5,r5,r6
r7,r12,r3
r9,r6,r8
r1,r10,r13
r11,r14,r12
: ROA slot
: complete instruction
: partial instruction
add r1,r10,r13
[D[E[F[G
cycle
slot 0
slot 2
mov 0,r12,r13
slot 3
ROA 5,0xa,0xb
add 1,r1,r2
addi 1,r14,r13
add 0,r10,1
mul 0,r5,r6
add 1,r12,r3
ROA 9,1,11
mul 1,r6,r8
add 1,r10,r13
sub 1,r14,r12
Figure 13.
slot 1
ROA 0xc,0xd,7
126
Table II
C ELL AREA OF EACH VLIW
Table I
T WO VLIW
Name
32-bit ISA VLIW
processor (32VP)
16-bit ISA VLIW
processor (16VP)
Description
32-bit instruction set, 32-bit data
word length, ve stages of pipeline
(IF,DC,EX,MEM,WB)
16-bit instruction set, 32-bit data word
length, the same ve pipeline stages with
a modied DC stage, a ROA buffer capable
of storing 6 words(each word is 4 bits)
Combinational area
Non-combinational area
Net interconnect area
Total cell area
Total area
PROCESSOR
32VP
556391.13
79797.41
undened
636188.54
undened
16VP
593813.89
69759.37
undened
663573.26
undened
Table III
C LOCK CYCLE TIME OF EACH VLIW PROCESSOR ( UNIT : NS )
Point
Clock clk main(rise edge)
Library setup time
Data required time
Data arrival time
Slack
32VP
Incr.
Path
2.00
2.00
-0.07
1.93
1.93
-3.67
-1.74
16VP
Incr.
Path
2.00
2.00
-0.05
1.95
1.95
-3.84
-1.90
B. Results
The synthesizable HDL codes for both of two VLIW
processors are generated by the HDL generator in Synopsys
127
UHPDLQLQJ123
52$UHSODFLQJZLWK123
52$DGGHGDVQHZ9/,:SDFNHW
Figure 15.
,GHDO
93
Figure 14.
93
93
Figure 16.
128
93
93
Figure 17.
93
129
VII. C ONCLUSION
In our approach, we attempt not only to reduce the
instruction encoding bit width into 16-bit but also to suggest
a strategy that extends the instruction bit width to 32
bits dynamically on demand with hardware and compiler
support. In this attempt, we have introduced notions of
partial instructions and remote operands which are not
encoded in the instruction word but stored in a special slot
called a ROA slot. Our compiler tries to have ROA slots
occupy as many slots in the VLIW packets as possible
that otherwise would be lled with NOPs, and synchronizes
partial instructions with ROA slots. A partial instruction
and its coupled remote operands combine together to form
a complete instruction at run time. To prove the overall
efciency of our technique, we have conducted experiments
with a set of benchmark programs. In our experiments, we
obtain 13% power reduction and 45% code size reduction on
average in comparison with the 32-bit ISA VLIW processor,
while the execution time slows down slightly. This result,
therefore, proves that the proposed strategy converts 32bit ISA into an equivalent 16-bit ISA without loss of the
original semantics, while leading to a substantial reduction
of code size along with a potentially large reduction of power
consumption and chip size at a slight run time cost.
ACKNOWLEDGMENT
This work was supported by the Korea Science and
Engineering Foundation (KOSEF) NRL Program grant (No.
2011-0018609), the Engineering Research Center of Excellence Program (Grant 2011-0000975) funded by the Korea
government (MEST), Basic National Research Foundation
of Korea (NRF) funded by MEST (No. 2011-0012522).
R EFERENCES
[1] TMS320C6x Programmers Guide, http://www.ti.com, 2003.
[2] S. Gurumurthi, A. Sivasubramaniam, M. J. Irwin, N. Vijaykrishnan,
M. Kandemir, T. Li, and L. K. John, Using Complete Machine Simulation for Software Power Estimation: The SoftWatt Approach, in
Proceedings of the 8th International Symposium on High-Performance
Computer Architecture, 2002, pp. 141.
[3] A. Suga and K. Matsunami, Introducing the FR500 Embedded
Microprocessor, IEEE Micro, vol. 20, pp. 2127, July 2000.
[4] Xtensa
Architecture
White
Paper,
http://www.tensilica.com/products/literature-docs/whitepapers/xtensa-architecture/.
[5] A. Krishnaswamy and R. Gupta, Prole guided selection of ARM
and thumb instructions, in Proceedings of the joint conference on
Languages, compilers and tools for embedded systems: software and
compilers for embedded systems, 2002, pp. 5664.
[6] A. Shrivastava and N. Dutt, Energy efcient code generation exploiting reduced bit-width instruction set architectures (rISA), in
Proceedings of the 2004 Asia and South Pacic Design Automation
Conference, 2004, pp. 475477.
130