Professional Documents
Culture Documents
Collated by: Professor Kurt Keutzer Computer Science 252, Spring 2000 With contributions from: Dr. Brock Barton, Clark Hise TI; Dr. Surendar S. Magar, Berkeley Concept Research Corporation
1
Multipliers (MUL)
Application Examples
Video/Imaging W-CDMA Radars Digital Radios High-End Control Modems Voice Coding Instruments Low-End Modems Industrial Control
C and Analog
1980
1985
1990
1995
2
Time Frame Early 1970s Late 1970s Early 1980s Late 1980s Early 1990s Late 1990s
Approach
Primary Application
Enabling Technologies
Discrete logic Building block Single Chip DSP P Function/Application specific chips Multiprocessing Single-chip multiprocessing
Non-real time procesing Simulation Military radars Digital Comm. Telecom Control Computers Communication
Bipolar SSI, MSI FFT algorithm Single chip bipolar multiplier Flash A/D P architectures NMOS/CMOS Vector processing Parallel processing Advanced multiprocessing VLIW, MIMD, etc. Low power single-chip DSP Multiprocessing
TMS32010 TMS320C25 TMS320C30 TMS320C50 TMS320C2XXX Multiprocessor Based TMS320C80 TMS320C62XX TMS310C67XX
20 40 33 57
400 100 60 35 25
5 20 33 60 80
Features
K K K K K K K K K K 200 ns instruction cycle (5 MIPS) 144 words (16 bit) on-chip data RAM 1.5K words (16 bit) on-chip program ROM - TMS32010 External program memory expansion to a total of 4K words at full speed 16-bit instruction/data word single cycle 32-bit ALU/accumulator Single cycle 16 x 16-bit multiply in 200 ns Two cycle MAC (5 MOPS) Zero to 15-bit barrel shifter Eight input and eight output channels
Microprocessor Mode 16-bit word 0 1 2 Reset 1st Word Reset 2nd Word Interrupt
1525
Internal Memory Space Reserved For Testing
1536
External Memory Space
4095
4095 7
a n-1 a n-2 a0
a1
a0 a n-1
X0 X1 X2 X3 X4 X5
Xn-1
End
+
Acc
For N=50, Indirect Addressing t=42 s (23.8 KHz) For N=50, Direct Addressing t=21.6 s (40.2 KHz)
10
One 4K x 32-bit single-cycle dual-access on-chip ROM block Two 1K x 32-bit single-cycle dual-access on-chip RAM blocks 64 x 32-bit instruction cache 32-bit instruction and data words, 24-bit addresses 40/32-bit floating-point/integer multiplier and ALU 32-bit barrel shifter
11
12
13
14
15
807FFFH Peripheral Bus Memory Mapped 80800h Registers (Internal) (6K) 8097FFh RAM Block 0 (1K) 809800h (Internal) 809BFFh 809C00h 809FFFh 80A00h 0FFFFFFh RAM Block 1 (1K) (Internal) External STRB Active
Microprocessor Mode
Microcomputer Mode
16
17
C54x Architecture
18
19
an * xn a
MPY ADD
y
21
S/U
S/U
A B O
22
MUX
ALU
E Bus
24
@x, T @a, A
EXP Encoder
A B For example: A = xa
25
INTERNAL MEMORY
M U X E S
P D C E C
T
EXTERNAL MEMORY
M U X D
ALU SHIFTER MAC A B
M
27
Prefetch: Calculate address of instruction Fetch: Collect instruction Decode: Interpret instruction Access: Collect address of operand Read: Collect operand Execute: Perform operation
28
29
X3 R4 X4 A5 R5 X5 D6 A6 R6 X6
C62x Architecture
33
TMS320C6201 Revision 2
Program Cache / Program Memory
32-bit address, 256-Bit data512K Bits RAM Pwr Dwn
Host Port Interface C6201 CPU Megamodule
Program Fetch Instruction Dispatch Instruction Decode Control Registers Control Logic Test Emulation Interrupts
4DMA
Data Path 1
A Register File L1 S1 M1 D1
Data Path 2
B Register File D2 M2 S2 L2
Data Memory
32-Bit address, 8-, 16-, 32-Bit data 512K Bits RAM
34
K Data
32K x 16 Single Ported Accessible by Both CPU Data Buses 4 x 8K 16-bit Banks M 2 Possible Simultaneous Memory Accesses (4 Banks) M 4-Way Interleave, Banks and Interleave Minimize Access Conflicts
35
K Interrupt Acknowledge (IACK) and Number (INUM) Signals K Branch Delay Slots Protected From Interrupts K Edge Triggered
36
C62x Datapaths
Registers A0 - A15
1X
Registers B0 - B15
2X
S1
S2
D DL SL
L1
SL DL D S1
S1
S2
D S1
M1
S2
D S1 S2
D1
S2 S1 D
S2
S1 D
D2
M2
S2
S2
S1 D DL SL
SL DL D
L2
S2
S1
DADR1 DADR2 (address) (address) Cross Paths 40-bit Write Paths (8 MSBs) 40-bit Read Paths/Store Paths
37
Functional Units
K L-Unit (L1, L2) K S-Unit (S1, S2)
40-bit Integer ALU, Comparisons Bit Counting, Normalization 32-bit ALU, 40-bit Shifter Bitfield Operations, Branching 16 x 16 -> 32
38
C62x Datapaths
Registers A0 - A15
1X
Registers B0 - B15
2X
S1
S2
D DL SL
L1
SL DL D S1
S1
S2
D S1
M1
S2
D S1 S2
D1
S2 S1 D
S2
S1 D
D2
M2
DDATA_I2 (load data)
S2
S2
S1 D DL SL
SL DL D
L2
S2
S1
DADR1 (address)
DADR2 (address)
Cross Paths 40-bit Write Paths (8 MSBs) 40-bit Read Paths/Store Paths
39
Example 1
A B C D E F G H A B C D Example 2 E F G H A B C D Example 3 E F G H
K Execute Packet
CPU executes 1 to 8 instructions/cycle Fetch packets can contain multiple execute packets
40
Program Generate E1Address - E5 Execute 1 through Execute 5 Program Address Send Program Access Ready Wait Program Fetch Packet Receive
Execute Packet 1 PG PS PW PR DP DC Execute Packet 2 PG PS PW PR DP Execute Packet 3 PG PS PW PR Execute Packet 4 PG PS PW Execute Packet 5 PG PS Execute Packet 6 PG Execute Packet 7
E1 DC DP PR PW PS PG
E2 E1 DC DP PR PW PS
E3 E2 E1 DC DP PR PW
E4 E3 E2 E1 DC DP PR
E5 E4 E3 E2 E1 DC DP
E5 E4 E3 E2 E1 DC
E5 E4 E3 E2 E1
E5 E4 E5 E3 E4 E5 E2 E3 E4 E5 41
K Parallelism
8 new instructions can always be dispatched every cycle
43
44
K Signed/Unsigned Byte, Half-Word, Word, Double-Word Addressable K Register or 5-Bit Unsigned Constant Index
Indexes are Scaled by Type
45
46
47
C67x Architecture
48
4 Channel DMA
Data Path 2
B Register File D2 M2 S2 L2
2 Timers
Data Memory 32-Bit address 8-, 16-, 32-Bit data 512K Bits RAM
49
K K K K
Load store architecture 3.3-V I/Os, 1.8-V internal Single- and double-precision IEEE floating-point Dual data paths
6 floating-point units / 8 x 32-bit instructions
50
K K K K K
4-channel bootloading DMA 16-bit host port interface 1Mbit on-chip SRAM 2 multichannel buffered serial ports (T1/E1) Pin compatible with C6201
51
Multiplier Unit
Floating-Point Capabilities
52
C67x Interrupts
K K K K 12 Maskable Interrupts Non-Maskable Interrupt (NMI) Interrupt Return Pointers (IRP, NRP) Fast Interrupt Handling
Branches Directly to 8-Instruction Service Fetch Packet 7 Cycle Overhead: Time When No Code is Running 12 Cycle Latency : Interrupt Response Time
K Interrupt Acknowledge (IACK) and Number (INUM) Signals K Branch Delay Slots Protected From Interrupts K Edge Triggered
53
.M Unit
Floating Point Multiply Unit MPYSP MPYDP MPYI MPYID MPY24 MPY24H
.S Unit
ABSSP ABSDP CMPGTSP CMPEQSP CMPLTSP CMPGTDP CMPEQDP CMPLTDP RCPSP RCPDP RSQRSP RSQRDP SPDP
54
C67x Datapaths
K K 2 Data Paths 8 Functional Units Orthogonal/Independent 2 Floating Point Multipliers K 2 Floating Point Arithmetic 2 Floating Point Auxiliary Control Independent Up to 8 32-bit Instructions K Registers 2 Files K 32, 32-bit registers total Cross paths (1X, 2X) K L-Unit (L1, L2) Floating-Point, 40-bit Integer ALU Bit Counting, Normalization S-Unit (S1, S2) Floating Point Auxiliary Unit 32-bit ALU/40-bit shifter Bitfield Operations, Branching M-Unit (M1, M2) Multiplier: Integer & Floating-Point D-Unit (D1, D2) 32-bit add/subtract Addr Calculations
K K K
Registers A0 - A15
1X
Registers B0 - B15
2X
S1
S2
D DL SL
L1
SL DL D S1
S1
S2
D S1
M1
S2
D S1 S2
D1
S2 S1 D
S2
S1 D
D2
M2
S2
S2
S1 D DL SL
SL DL D
L2
S2
S1
55
A B C D E F G H A B C D E F G H
Example 2
A B C D Example 3 E F G H
K Fetch Packet CPU fetches 8 instructions/cycle K Execute Packet CPU executes 1 to 8 instructions/cycle Fetch packets can contain multiple execute packets K Parallelism determined at compile/assembly time K Examples 1) 8 parallel instructions 2) 8 serial instructions 3) Mixed Serial/Parallel Groups M A // B M C M D M E // F // G // H K Reduces Codesize Number of Program Fetches Power Consumption
56
PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
KOperate in Lock Step KFetch PG Program Address Generate PS Program Address Send PW Program Access Ready Wait PR Program Fetch Packet Receive K Decode DP DC K Execute E1 - E5 E6 - E10 Instruction Dispatch Instruction Decode Execute 1 through Execute 5 Double Precision Only
Execute Packet 1 PG PS PW PR DP DC Execute Packet 2 PG PS PW PR DP Execute Packet 3 PG PS PW PR Execute Packet 4 PG PS PW Execute Packet 5 PG PS Execute Packet 6 PG Execute Packet 7
57
4 Delay Slots
Branch Target
M-Unit 1 M-Unit 2 Multiplier Multiplier Unit Unit Control D-Unit 1 D-Unit 2 Data Load/ Registers Data Load/ Store Store Emulation S-Unit 2 S-Unit 1 Auxiliary Auxiliary Logic Unit Logic Unit L-Unit 1 L-Unit 2 Arithmetic Arithmetic Logic Unit Logic Unit
Register file
Decode
Register file
Register file
Decode
Register file
60
Copyright 1999
61