Professional Documents
Culture Documents
Vector Unit Architecture For Emotion Synthesis
Vector Unit Architecture For Emotion Synthesis
EMOTION SYNTHESIS
TWO VECTOR UNITS EMBEDDED IN THE EMOTION ENGINE CHIP SUPPORT
ENTERTAINMENT SYSTEM.
Atsushi Kunimatsu
Nobuhiro Ide Processors designed for computer tions in collaboration with the CPU core.2
entertainment must perform 3D graphics cal- Both calculation types are indispensable in
Toshinori Sato culations, especially geometry and perspective producing emotion synthesis. By employing
transformations. In the PlayStation2, we intro- software pipelining techniques, a vector unit
Yukio Endo duced the idea of synthesizing emotion called can execute 3D perspective transformations
Emotion Synthesis and devised a new proces- with seven-cycle throughput at 300 MHz (85
Hiroaki Murakami sor architecture to support its graphics Mvectors/sec). The vector unit in the 0.25-
demands.1 The architecture is embodied in the micron CMOS Emotion Engine chip contains
Takayuki Kamei PlayStation2’s Emotion Engine CPU,2 which 5.8 million transistors in 8.76 mm × 7.87 mm.
uses vector units (VUs)3 as the key units for
Masashi Hirano floating-point calculations. Architecture design strategy
Emotion synthesis means the real-time syn- The 3D graphics calculations in emotion
Fujio Ishihara thesis of a computer graphics animation scene synthesis require perspective transformation
that projects a great deal of atmosphere. For and lighting calculations, which are achieved
Haruyuki Tago example, when a female character walks into a through well-established algorithms. We
video game scene, her motion must be deter- designed vector unit VU1 for this purpose,
Toshiba Corporation mined by solving physical equations in enabling larger instruction and data memo-
response to interactive events instead of replay- ry comparison with vector unit VU0 and
ing prerecorded data. Moreover, differential equipping it with better stand-alone perfor-
Masaaki Oka equations with a large number of variables mance. There is no direct control path from
must be used to describe, for example, the wav- the CPU core.
Akio Ohba ing motions of her hair in a breeze. For authen- An example of flexible calculation is a motion
ticity in emotion synthesis, the CPU must calculation of a body or a liquid; their algo-
Teiji Yutaka execute these calculations in real time. rithms are rich in variety. To collaborate with
The Emotion Engine has achieved a peak the CPU core, we closely coupled VU0 with
Toyoshi Okada performance of 5.5 Gflops at an operation fre- the CPU core via a 128-bit coprocessor bus. An
quency of 300 MHz. Its vector units operate example of the roles that VU0 and VU1 play
Masakazu Suzuoki in two modes: VLI, for use as a stand-alone in a fighting game is a scene that consists of main
processor and coprocessor for use as a MIPS characters fighting each other and background
Sony Computer COP2. This arrangement allows a vector unit objects such as buildings or a cheering audience.
to simultaneously execute huge amounts of Since the main characters need to move and
Entertainment 3D graphics calculations and flexible calcula- change their shapes in real time for a player’s
128 GIF
Instruction Data
memory memory
Instruction Data (16 Kbytes) (16 Kbytes)
Scratch- memory memory
Instruction Data (4 Kbytes) (4 Kbytes)
pad
cache cache
RAM
VIF0 VIF1
System bus
(128 bits)
Image
10-channel Memory I/O
processing
DMAC interface interface
unit
input, the flexible calculations in VU0 process synthesizer interface (GIF). These interfaces
these needs. On the other hand, VU1 process- are responsible for DMA access from and to
es the background objects, which consist of the vector unit memories together with data
many polygons. expansion and compression. By using these
The vector units feature four parallel interfaces, the vector unit, and its memory
fMACs (floating-point multiply accumula- double-buffering techniques simultaneously,
tion units),4 a high-speed fDIV (floating-point we achieve nonstop, continuous processes.
division unit), a broadcast mechanism, and a As mentioned earlier, while the VLIW
VLIW architecture. To compute 4 × 4 matrix mode is available in both VU0 and VU1, the
operations efficiently, we use the four parallel coprocessor mode is available only in VU0.
fMACs with the broadcast mechanism. We For the interfaces, VU0 also includes the
also employ VLIW instruction formats and coprocessor interface and VIF0, while VU1
the high-speed fDIV for efficient perspective includes VIF1 and the graphics synthesizer
transformations and vector normalizations. interface. VU0 includes a 4-Kbyte instruction
RAM and a 4-Kbyte data RAM. VU1
VU architecture includes a 16-Kbyte instruction RAM and a
Figure 1 shows the Emotion Engine block 16-Kbyte data RAM. VU1 also has an ele-
diagram. The chip contains three processors: mentary function unit (EFU). By using the
the CPU core, VU0, and VU1. The CPU fMAC’s 0.6 Gflops and the fDIV’s 0.04
core2,5 contains 128-bit registers, 128-bit Gflops for performance calculation, the VU0
ALUs for multimedia processing, and a reaches a peak performance of 2.44 Gflops,
coprocessor interface for VU0 and VU1. and VU1 achieves 3.08 Gflops for a total per-
A vector processing unit (VPU) block formance of 5.52 Gflops.
includes vector units VPU0 and VPU1, VU1 is a stand-alone processor mainly
instruction and data memories, and some inter- responsible for conventional 3D graphics cal-
face units. The interfaces are VU interface 0 culations. Therefore VU1 must process larg-
(VIF0), VU interface 1 (VIF1), and graphics er amounts of data and calculations than the
MARCH–APRIL 2000 41
VECTOR UNIT ARCHITECTURE
RANDU/etc
Floating Integer
fMACw
registers
fMACz
fMACy
fMACx
registers GIF
fMAC
iALU
EFU
fDIV
LSU
fDIV
(128 bits × 32) (16 bits
64
× 16)
128 128 16
64 128
VIF Path 3
128
128
System bus
VU0. VU1 has four times more memory than tor units is just one of the examples of emo-
VU0 and the additional elementary function tion synthesis. Individual user programmers
unit. This unit includes an fMAC and an can alter it, depending on interpretation of
fDIV to calculate elementary functions such the chip. For example, they can execute con-
as exp, sin, and so on. ventional 3D graphics calculations on both of
However, this job assignment for the vec- the vector units.
42 IEEE MICRO
Upper 32 bits Lower 32 bits
63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
1 1 1 1 1 1 1 4 bits 5 bits 5 bits 5 bits 4 bits 2 bits 7 bits 4 bits 5 bits 5 bits 11 bits
I E M D T - - dest ft reg
FMAC xfs4reg fd reg MADD b c Lower OP. dest ft regload/store,
FDIV, is reg iALU LQI
- - - - - 0 0 ---- ----- ----- ----- 0010 -- 1,000,000 ---- ----- ----- 01101 1111 00
Figure 4. Instruction format to support two modes: 64-bit VLIW (a) and 32-bit COP2 (b).
Figure 2 shows the block diagram of the The fDIV executes a single-precision floating-
VPU1.5 The major internal blocks of the vec- point divide/square-root operation with a
tor unit are throughput/latency of seven cycles. Floating-
point numbers calculated by the VU are com-
1. an upper execution unit with four paral- patible with the IEEE-754 format, while its
lel fMACs, rounding technique is not.
2. a lower execution unit with fDIV, The coprocessor mode has two kinds of
load/store, iALU, and branch, instructions. One corresponds to an upper
3. 128-bit × 32 floating registers, and instruction, a lower instruction, and a COP2
4. 16-bit × 16 integer registers. instruction. The other is a “call VLIW mode”
instruction. Prior to using this instruction, the
Table 1 lists the VU0 and VU1 features. program stores VLIW mode instructions and
In the VLIW mode, each 64-bit VLIW large data to the VU0 memory first using DMA
instruction format is split into upper instruc- via VIF0. Then the program executes a call
tion and lower instruction parts. In the VLIW mode instruction as a subroutine call
coprocessor mode, each 32-bit COP2 instruc- instruction. For example, in the case of calcu-
tion consists of a COP2 header and an lating a physical dynamics simulation, an inner
“opcode body” of the upper or lower instruc- loop may be coded as a VLIW mode program,
tion parts. Only an upper or a lower instruc- while the CPU core controls complicated flows.
tion can be selected in a COP2 instruction.
Figure 3 provides operation examples of the Instruction sets
four parallel fMACs. The upper half of this The vector units have various instruction
figure shows a normal four-parallel SIMD sets. The VLIW mode has 164 instructions:
multiply-accumulation operation. The lower 95 upper instructions (which include 68
half shows a four-parallel SIMD multiply- instructions with broadcasts) and 69 lower
accumulation operation with broadcasting. instructions. The coprocessor mode has 130
Figure 4 shows the VLIW mode pipeline instructions, including most of the upper
stages and the coprocessor mode pipeline instructions and the lower instructions, and
stages. The fMAC executes sequential single- all of the COP2 instructions.
precision floating-point multiply-accumulate Table 2 (next page) summarizes the upper
operations with a throughput of one cycle. instruction varieties.
MARCH–APRIL 2000 43
VECTOR UNIT ARCHITECTURE
Examples
Source register 1
By employing a broadcast mechanism and
w z y x the four parallel fMACs with a throughput of
Source one cycle, the vector unit can execute a 4 × 4-
register 2
matrix geometry transformation with only
w z y x
four instructions (see Figure 5a,b). By using
two operation issues in the VLIW mode, the
fDIV with seven-cycle latency, a 128-bit
fMACw + fMACz + fMACy + fMACx +
load/store unit, and software pipelining tech-
niques, the vector unit can execute a perspec-
ACCw ACCz ACCy ACCx tive transformation with a throughput of
(a) seven cycles.
The flow of software pipelining for per-
Source register 1
spective transformation follows:
w z y x
Source
register 2 1. Read four single-precision floating num-
w z y x bers (x, y, z, w) by a 128-bit load instruc-
tion with address post increment.
2. Execute geometry transformation.
fMACw + fMACz + fMACy + fMACx + 3. Calculate 1/w using the results of the
geometry transformation, which move to
a temporary register to keep the current
ACCw ACCz ACCy ACCx
(b) x, y, z.
4. Multiply x, y, z in the temporary register
Figure 5. Geometry transformation pipeline: four parallel FMADDs without by 1/w.
broadcast (a) and with broadcast (b). 5. Write final results to memory by a 128-
bit store instruction with address post
increment.
The operations in the lower instruction are
With this flow, vector unit performance
• Division/square root/reciprocal square reaches as high as 85 Mvectors/sec at 300
root; MHz. This seven-cycle loop of perspective
44 IEEE MICRO
transformation includes load/store operations 180
of vertex data and loop controls. 160 150 Pentium III (600 MHz)
140 Emotion Engine (300 MHz)
Performance
Mvectors/sec
120
Figure 6 compares our chip’s performance 100
85 85
with that of the Intel Pentium III SSE. The 80 75
right bar indicates the performance of the 60
46
Emotion Engine at 300 MHz, and the left bar 40 33
21
shows that of the Pentium III SSE at 600 20 13
MHz. Table 3 lists the conditions and 0
assumptions for this comparison. Geometry Perspective Distance 1/distance
transformation transformation
For the vector unit performance values, we
developed an actual program and counted the Figure 6. Performance comparison: Emotion Engine (right bars) and Pen-
number of execution clocks. Figure 6 reflects tium III SSE (left bars).
the combined performance of VU0 and VU1.
For Pentium III SSE, we assume the maxi-
mum software pipelining efficiency where the Table 3. Conditions and assumptions for
latencies of the instructions limit the perfor- performance comparison in Figure 6.
mance. We employ the latency values from
Intel reference manuals.6 In the Pentium III Task Critical operation
SSE, we do not use approximate functions in Geometry transformation 4 × 4 parallel multiply-adds
order to keep the same condition on compu- Perspective transformation Division
tational precisions. With respect to physical Distance calculation Square root
dynamics simulations frequently processed in Reciprocal distance calculation Reciprocal square root
emotion synthesis, we believe approximate
functions are not usable.
This performance chart reveals that the
Emotion Engine at 300 MHz performs at VI1data VI1data
RAM RAM
fMAC
fMAC
fMAC
fMAC
fMAC
EFU 4 Kbytes
fDIV
Implementation Vector unit 0
DMAC
We implemented the Emotion Engine in
fMAC
fMAC
TLB
16 Kbytes SPRAM
8.76 mm × 7.87 mm. 16 Kbytes
PLL
MARCH–APRIL 2000 45
VECTOR UNIT ARCHITECTURE
fMAC
fMAC
Yvette, France, ISBN 2-86332-246-X, 1999.
16 Kbytes 16 Kbytes
pp. 106-109.
5. M. Oka and M. Suzuoki, “Designing and
Programming the Emotion Engine,” IEEE
Vector unit 1
fMAC
fMAC
fMAC
fMAC
46 IEEE MICRO
is working on the design of a microprocessor University. Ohba works in the Architecture Lab-
core and a system LSI chip for entertainment oratory and holds an MS degree in biophysical
applications. Earlier, he was a visiting scholar engineering from Osaka University. Yutaka
at the Computer Science Department of the received his BSEE from Keio University, Tokyo,
University of Illinois. and develops real-time computer graphics sys-
Toshinori Sato is an associate professor tems such as the PlayStation. Okada received
in the Department of Artificial Intelligence at the BEng degree in electrical engineering from
Kyushu Institute of Technology in Iizuka, Keio University and develops video game con-
Japan. Sato holds BE, ME, and PhD degrees soles such as the PlayStation. Suzoki works in
in electronic engineering from Kyoto Uni- the Software Development Department and
versity. He is a member of the IEEE Com- holds an MS degree in electronic engineering
puter Society; ACM; Institute of Electronics, from Tokyo University.
Information and Communication Engineers;
and Information Processing Society of Japan. Direct comments concerning this article to
Masaaki Oka, Akio Ohba, Teiji Yutaka, Atsushi Kunimatsu, System ULSI Engineer-
Toyoshi Okada, and Masakazu Suzuoki work ing Laboratory, Toshiba Corporation;
for Sony Computer Entertainment in Tokyo. atsushi.kunimatsu@toshiba.co.jp.
Oka holds a BS degree in science from Kyoto
Guest Editors
Dr. B. Johnson Dr. D. Avresky Dr. F. Lombardi Dr. K. Grosspietsch
University of Virginia Boston University Northeastern University German Nat. Research Center
Dept. of El. Eng. Dept. of El. & Comp. Eng. Dept. El. & Comp. Eng. Center for I.T. (GMD)
Charlottesville, VA 22903 Boston, MA 02215 Boston, MA 02115 D-53754 St. Augustin, Germany
Phone: +1 804 924 7623 Phone: +1 617 353 9850 Phone: +1 617 373 4159 Phone: +49 2241 14 2750
Email: bwj@virginia.edu E-mail: avresky@bu.edu E-mail: lombardi@ece.neu.edu E-mail: grosspietsch@gmd.de
MARCH–APRIL 2000 47