Professional Documents
Culture Documents
Computer Architecture
Course ID: IT3283
Chương 8
CÁC KIẾN TRÚC SONG SONG
IS DS
CU PU MU
DS
PU1 LM1
DS
PU2 LM2
IS
CU
.
.
.
DS
PUn LMn
IS DS
CU1 PU1
IS DS
CU2 PU2 Bộ nhớ
dùng
. . chung
. .
. .
IS DS
CUn PUn
IS DS
CU1 PU1 LM1
Mạng
IS DS liên
CU2 PU2 LM2
kết
hiệu
. . .
. . . năng
. . . cao
IS DS
CUn PUn LMn
Cache
Bus
(a) (b) (c)
Figure 8-26. Three bus-based multiprocessors. (a) Without caching. (b) With
caching. (c) With caching and private memories.
If the bus is busy when a CPU wants to read or write memory, the CPU just
waits until the bus becomes idle. Herein lies the problem with this design. With
two or three CPUs, contention for the bus will be manageable; with 32 or 64 it will
be unbearable. The system will be totally limited by the bandwidth of the bus, and
most of the CPUs will be idle most of the time.
NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 15
The solution is to add a cache to each CPU, as depicted in Fig. 8-26(b). The
SMP hay UMA (Uniform Memory Access)
MMU
Local bus Local bus Local bus Local bus
System bus
§ Có một không
Figure 8-32. A NUMA gian địabased
machine chỉ chung
on two levels of cho tấtCm*cảwasCPU
buses. The the
first multiprocessor to use this design.
§ Mỗi CPU có thể truy cập từ xa sang bộ nhớ của
Memory coherence is guaranteed in an NC-NUMA machine because no cach-
CPU
ing khácEach word of memory lives in exactly one location, so there is no
is present.
danger of one copy having stale data: there are no copies. Of course, it now mat-
Truy
§ters nhập
a great bộ nhớ
deal which page istừ xa chậm
in which memoryhơn because truy nhập bộpenalty
the performance
for being in the wrong place is so high. Consequently, NC-NUMA machines use
nhớ cục
elaborate bộto move pages around to maximize performance.
software
Typically, a daemon process called a page scanner runs every few seconds.
ItsNKK-CA2021.1.0
job is to examine the usage statistics and
IT3283-Kiến trúc máymove
tính pages around in an attempt to 18
improve performance. If a page appears to be in the wrong place, the page scanner
Bộ xử lý đa lõi (multicores)
Thay đổi của bộ xử lý:
666 CHAPTER 18 / MULTICORE COMPUTERS
§
Issue logic
§ Pipeline L2 cache
Issue logic
§ Đa luồng
Registers n
Register 1
PC n
PC 1
§ Đa lõi: nhiều CPU trên một chip Instruction fetch unit Execution units and queues
L2 cache
(superscalar or SMT)
(superscalar or SMT)
(superscalar or SMT)
(superscalar or SMT)
Processor n
Processor 1
Processor 2
Processor 3
L1-D
L1-D
L1-D
L1-D
L1-I
L1-I
L1-I
L1-I
L2 cache
(c) Multicore
Figure 18.1 Alternative Chip Organizations
L2 cache L2 cache
L2 cache I/O
Main memory I/O
Main memory
(b) Dedicated L2 cache
(a) Dedicated L1 cache
Intel - Core Duo to manage chip heat dissipation to maximize processor performance within ther
constraints. Thermal management also improves ergonomics with a cooler sys
and lower fan acoustic noise. In essence, the thermal management unit moni
digital sensors for high-accuracy die temperature measurements. Each core
§ 2006 be defined as an independent thermal zone. The maximum temperature for e
32-kB L1 Caches
32-kB L1 Caches
shared L2 cache
Arch. state
Arch. state
Execution
Execution
resources
resources
§ Dedicated L1 cache per
core Thermal control Thermal control
Bus interface
Front-side bus
32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB
L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D
12 MB
L3 Cache
The general structure of the Intel Core i7-990X is shown in Figure 18.10. Each
core has its own dedicated L2 cache and the four cores share a 12-MB L3 cache.
NKK-CA2021.1.0
One mechanism Intel uses to make IT3283-Kiến trúc máy
its caches moretính effective is prefetching, in which 22
8.3.3. Many different topologies, switching schemes, and routing algorithms are
used. What all multicomputers have in common is that when an application pro-
8.3. Đa xử lý bộ nhớ phân tán
gram executes the send primitive, the communication processor is notified and
transmits a block of user data to the destination machine (possibly after first asking
for and getting permission). A generic multicomputer is shown in Fig. 8-36.
Node
CPU Memory
… …
Disk Disk
Local interconnect and … Local interconnect and
I/O I/O
Communication
processor
§ MáyIntính cụm
Fig. 8-36 we see(clusters)
that multicomputers are held together by interconnection
networks. Now it is time to look more closely at these interconnection networks.
Interestingly enough, multiprocessors and multicomputers are surprisingly similar
in this respect because multiprocessors often have multiple memory modules that
must also be interconnected with one another and with the CPUs. Thus the mater-
NKK-CA2021.1.0 IT3283-Kiến
ial in this section frequently applies to bothtrúc máy tính
kinds of systems. 23
(a) (b)
(c) (d)
(e) (f)
(g) (h)
Figure 8-37. Various topologies. The heavy dots represent switches. The CPUs
NKK-CA2021.1.0 and memories are not shown. IT3283-Kiến
(a) A star. trúc
(b) Amáy tínhinterconnect. (c) A tree.
complete 25
(d) A ring. (e) A grid. (f) A double torus. (g) A cube. (h) A 4D hypercube.
Massively Parallel Processors
2-GB
DDR2
DRAM
Figure 8-39. The BlueGene/P: (a) chip. (b) card. (c) board. (d) cabinet.
(e) system.
The cards are mounted on plug-in boards, with 32 cards per board for a total of
32 chips (and thus 128 CPUs) per board. Since each card contains 2 GB of
DRAM, the boards contain 64 GB apiece. One board is illustrated in Fig. 8-39(c).
At the next level, 32 of these boards are plugged into a cabinet, packing 4096
CPUs into a single cabinet. A cabinet is illustrated in Fig. 8-39(d).
Finally, a full system, consisting of up to 72 cabinets with 294,912 CPUs, is
depicted in Fig. 8-39(e). A PowerPC 450 can issue up to 6 instructions/cycle, thus
hold exactly 80 PCs and switches can be larger or smaller than 128 ports; these are
just typical values for a Google cluster.
OC-12 Fiber OC-48 Fiber
Two gigabit
Ethernet links
80-PC rack
nStreaming
multiprocessor
n8 × Streaming
processors
Instruction Cache
Third Generation Streaming Warp Scheduler Warp Scheduler
Multiprocessors (SM)
The third generation SM introduces several
The third generation SM introduces several Register File (32,768 x 32-bit)
architectural innovations that make it not only the
architectural innovations that make it not only the
most powerful SM yet built, but also the most
most powerful SM yet built, but also the most
Mỗi SM có 32 CUDA
LD/ST
programmable and efficient.
programmable and efficient.
n Core Core Core Core
LD/ST
SFU
cores.
512 High Performance CUDA cores
512 High Performance CUDA cores Core Core Core Core
LD/ST
LD/ST
(Cumpute Unified
Core Core Core Core
designs. Each CUDA
designs. Each CUDA FP Unit INT Unit
LD/ST
Device Architecture) có
pipelined integer arithmetic
pipelined integer arithmetic Result Queue LD/ST
SFU
logic unit (ALU) and floating
logic unit (ALU) and floating Core Core Core Core
LD/ST
01 FPU và 01 IU
LD/ST
point unit (FPU). Prior GPUs used IEEE 754-1985
point unit (FPU). Prior GPUs used IEEE 754-1985
LD/ST
floating floating point arithmetic. The Fermi architecture
point arithmetic. The Fermi architecture Core Core Core Core
LD/ST
implements the new IEEE 754-2008 floating-point SFU
LD/ST
standard, providing the fused multiply-add (FMA) Core Core Core Core
LD/ST
instruction for both single and double precision
arithmetic. FMA improves over a multiply-add Interconnect Network
Figure 6. GA100 Full GPU with 128 SMs (A100 Tensor Core GPU has 108
SMs)
A100 SM Architecture
NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 35
The new A100 SM significantly increases performance, builds upon features introduced in both
NVIDIA A100 Tensor Core GPU Architecture In-Depth
Hết chương 8