It3283-Ca2021 1 0-CH8

KIẾN TRÚC MÁY TÍNH
Computer Architecture
Course ID: IT3283
Nguyễn Kim Khánh

Nội dung học phần
Chương 1. Giới thiệu chung

Chương 2. Hệ thống máy tính
Chương 4. Kiến trúc tập lệnh
Chương 3. Số học máy tính
Chương 5. Bộ xử lý
Chương 6. Bộ nhớ máy tính
Chương 7. Hệ thống vào-ra
Chương 8. Các kiến trúc song song
NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 2

Kiến trúc máy tính
Chương 8
CÁC KIẾN TRÚC SONG SONG

Nội dung
8.1. Phân loại kiến trúc máy tính

8.2. Đa xử lý bộ nhớ dùng chung
8.3. Đa xử lý bộ nhớ phân tán
8.4. Bộ xử lý đồ họa đa dụng

8.1. Phân loại kiến trúc máy tính
Phân loại kiến trúc máy tính (Michael Flynn -1966)
§ SISD - Single Instruction Stream, Single Data Stream
§ SIMD - Single Instruction Stream, Multiple Data Stream
§ MISD - Multiple Instruction Stream, Single Data Stream
§ MIMD - Multiple Instruction Stream, Multiple Data
Stream

SISD
IS DS
CU PU MU
§ CU: Control Unit

§ PU: Processing Unit
§ MU: Memory Unit
§ Một bộ xử lý
§ Đơn dòng lệnh
§ Dữ liệu được lưu trữ trong một bộ nhớ
§ Chính là Kiến trúc von Neumann (tuần tự)

SIMD
DS
PU1 LM1
DS
PU2 LM2
IS
CU
.
.
.
DS
PUn LMn

SIMD (tiếp)
§ Đơn dòng lệnh điều khiển đồng thời các đơn vị xử
lý PUs
§ Mỗi đơn vị xử lý có một bộ nhớ dữ liệu riêng LM
(local memory)
§ Mỗi lệnh được thực hiện trên một tập các dữ liệu
khác nhau
§ Các mô hình SIMD
§ Vector Computer
§ Array processor

MISD
§ Một luồng dữ liệu cùng được truyền đến một tập
các bộ xử lý
§ Mỗi bộ xử lý thực hiện một dãy lệnh khác nhau.
§ Chưa tồn tại máy tính thực tế
§ Có thể có trong tương lai

MIMD
§ Tập các bộ xử lý
§ Các bộ xử lý đồng thời thực hiện các dãy lệnh khác
nhau trên các dữ liệu khác nhau
§ Các mô hình MIMD
§ Multiprocessors (Shared Memory)
§ Multicomputers (Distributed Memory)

MIMD - Shared Memory
Đa xử lý bộ nhớ dùng chung
(shared memory mutiprocessors)
IS DS
CU1 PU1
IS DS
CU2 PU2 Bộ nhớ
dùng
. . chung
. .
. .
IS DS
CUn PUn

MIMD - Distributed Memory
Đa xử lý bộ nhớ phân tán
(distributed memory mutiprocessors or
multicomputers)
IS DS
CU1 PU1 LM1
Mạng
IS DS liên
CU2 PU2 LM2
kết
hiệu
. . .
. . . năng
. . . cao
IS DS
CUn PUn LMn

Phân loại các kỹ thuật song song
§ Song song mức lệnh
§ pipeline
§ superscalar
§ Song song mức dữ liệu
§ SIMD
§ Song song mức luồng
§ MIMD
§ Song song mức yêu cầu
§ Cloud computing

8.2. Đa xử lý bộ nhớ dùng chung
§ Hệ thống đa xử lý đối xứng (SMP- Symmetric
Multiprocessors)
§ Hệ thống đa xử lý không đối xứng (NUMA – Non-
Uniform Memory Access)
§ Bộ xử lý đa lõi (Multicore Processors)

The simplest multiprocessors are based on a single bus, as illustrated in
SMP hay UMA (Uniform Memory Access)
Fig. 8-26(a). Two or more CPUs and one or more memory modules all use the
same bus for communication. When a CPU wants to read a memory word, it first
checks to see whether the bus is busy. If the bus is idle, the CPU puts the address
of the word it wants on the bus, asserts a few control signals, and waits until the
memory puts the desired word on the bus.
Private memory Shared

Shared memory memory
CPU CPU M CPU CPU M CPU CPU M
Cache
Bus
(a) (b) (c)
Figure 8-26. Three bus-based multiprocessors. (a) Without caching. (b) With
caching. (c) With caching and private memories.
If the bus is busy when a CPU wants to read or write memory, the CPU just
waits until the bus becomes idle. Herein lies the problem with this design. With
two or three CPUs, contention for the bus will be manageable; with 32 or 64 it will
be unbearable. The system will be totally limited by the bandwidth of the bus, and
most of the CPUs will be idle most of the time.
The solution is to add a cache to each CPU, as depicted in Fig. 8-26(b). The
SMP hay UMA (Uniform Memory Access)

SMP (tiếp)
§ Một máy tính có n >= 2 bộ xử lý giống nhau
§ Các bộ xử lý dùng chung bộ nhớ và hệ thống vào-
ra
§ Thời gian truy cập bộ nhớ là bằng nhau với các bộ
xử lý
§ Các bộ xử lý có thể thực hiện chức năng giống
nhau
§ Hệ thống được điều khiển bởi một hệ điều hành
phân tán
§ Hiệu năng: Các công việc có thể thực hiện song
song
§ Khả năng chịu lỗi

the request was routed over the system bus to the system containing the word,
NUMA (Non-Uniform Memory Access)
which then responded. Of course, the latter took much longer than the former.
While a program could run happily out of remote memory, it took 10 times longer
to execute than the same program running out of local memory.
CPU Memory CPU Memory CPU Memory CPU Memory
MMU
Local bus Local bus Local bus Local bus
System bus
§ Có một không
Figure 8-32. A NUMA gian địabased
machine chỉ chung
on two levels of cho tấtCm*cảwasCPU
buses. The the
first multiprocessor to use this design.
§ Mỗi CPU có thể truy cập từ xa sang bộ nhớ của
Memory coherence is guaranteed in an NC-NUMA machine because no cach-
CPU
ing khácEach word of memory lives in exactly one location, so there is no
is present.
danger of one copy having stale data: there are no copies. Of course, it now mat-
Truy
§ters nhập
a great bộ nhớ
deal which page istừ xa chậm
in which memoryhơn because truy nhập bộpenalty
the performance
for being in the wrong place is so high. Consequently, NC-NUMA machines use
nhớ cục
elaborate bộto move pages around to maximize performance.
software
Typically, a daemon process called a page scanner runs every few seconds.
ItsNKK-CA2021.1.0
job is to examine the usage statistics and
IT3283-Kiến trúc máymove
tính pages around in an attempt to 18
improve performance. If a page appears to be in the wrong place, the page scanner
Bộ xử lý đa lõi (multicores)
Thay đổi của bộ xử lý:
666 CHAPTER 18 / MULTICORE COMPUTERS
§
Issue logic
§ Tuần tự Program counter

Instruction fetch unit
Single-thread register file
Execution units and queues
L1 instruction cache L1 data cache
§ Pipeline L2 cache
§ Siêu vô hướng (a) Superscalar
Issue logic
§ Đa luồng
Registers n
Register 1
PC n
PC 1
§ Đa lõi: nhiều CPU trên một chip Instruction fetch unit Execution units and queues
L1 instruction cache L1 data cache
L2 cache
(b) Simultaneous multithreading
(superscalar or SMT)
Processor n
Processor 1
Processor 2
Processor 3
L1-D
L1-D
L1-D
L1-D
L1-I
L1-I
L1-I
L1-I
L2 cache
(c) Multicore
Figure 18.1 Alternative Chip Organizations
NKK-CA2021.1.0 IT3283-Kiến trúc For

máyeach
tính of these innovations, designers have over the years attempted
19 to
increase the performance of the system by adding complexity. In the case of pipelin-
Các dạng tổ chức bộ xử lý đa lõi
18.3 / MULTICORE ORGANIZATION 675
CPU Core 1 CPU Core n CPU Core 1 CPU Core n
L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I
L2 cache L2 cache
L2 cache I/O
Main memory I/O
Main memory
(b) Dedicated L2 cache
(a) Dedicated L1 cache
CPU Core 1 CPU Core n CPU Core 1 CPU Core n
L1-D L1-I L1-D L1-I

L1-D L1-I L1-D L1-I
L2 cache L2 cache
L2 cache L3 cache
Main memory I/O Main memory I/O
(c) Shared L2 cache (d ) Shared L3 cache

Figure 18.8 Multicore Organization Alternatives

4. Interprocessor communication is easy to implement, via shared memory locations.
density of today’s chips, thermal management is a fundamental capability, es
cially for laptop and mobile systems. The Core Duo thermal control unit is desig
Intel - Core Duo to manage chip heat dissipation to maximize processor performance within ther
constraints. Thermal management also improves ergonomics with a cooler sys
and lower fan acoustic noise. In essence, the thermal management unit moni
digital sensors for high-accuracy die temperature measurements. Each core
§ 2006 be defined as an independent thermal zone. The maximum temperature for e
§ Two x86 superscalar,
32-kB L1 Caches
32-kB L1 Caches
shared L2 cache
Arch. state
Arch. state
Execution
Execution
resources
resources
§ Dedicated L1 cache per
core Thermal control Thermal control
§ 32KiB instruction and 32KiB APIC APIC
data Power management logic
§ 2MiB shared L2 cache

2 MB L2 shared cache
Bus interface
Front-side bus
Figure 18.9 Intel Core Duo Block Diagram

Intel Core i7-990X
678 CHAPTER 18 / MULTICORE COMPUTERS
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5
32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB
L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D
256 kB 256 kB 256 kB 256 kB 256 kB 256 kB

L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache
12 MB
L3 Cache
DDR3 Memory QuickPath

Controllers Interconnect
3 ! 8B @ 1.33 GT/s 4 ! 20B @ 6.4 GT/s
Figure 18.10 Intel Core i7-990X Block Diagram
The general structure of the Intel Core i7-990X is shown in Figure 18.10. Each
core has its own dedicated L2 cache and the four cores share a 12-MB L3 cache.
NKK-CA2021.1.0
One mechanism Intel uses to make IT3283-Kiến trúc máy
its caches moretính effective is prefetching, in which 22
8.3.3. Many different topologies, switching schemes, and routing algorithms are
used. What all multicomputers have in common is that when an application pro-
8.3. Đa xử lý bộ nhớ phân tán
gram executes the send primitive, the communication processor is notified and
transmits a block of user data to the destination machine (possibly after first asking
for and getting permission). A generic multicomputer is shown in Fig. 8-36.
Node
CPU Memory
… …
Disk Disk
Local interconnect and … Local interconnect and
I/O I/O
Communication
processor
High-performance interconnection network
Figure 8-36. A generic multicomputer.
§ Máy tính qui mô lớn (Warehouse Scale Computers

or8.4.1
Massively Parallel
Interconnection NetworksProcessors – MPP)
§ MáyIntính cụm
Fig. 8-36 we see(clusters)
that multicomputers are held together by interconnection
networks. Now it is time to look more closely at these interconnection networks.
Interestingly enough, multiprocessors and multicomputers are surprisingly similar
in this respect because multiprocessors often have multiple memory modules that
must also be interconnected with one another and with the CPUs. Thus the mater-
NKK-CA2021.1.0 IT3283-Kiến
ial in this section frequently applies to bothtrúc máy tính
kinds of systems. 23
The fundamental reason why multiprocessor and multicomputer intercon-

Đa xử lý bộ nhớ phân tán

Mạng liên kết
SEC. 8.4 MESSAGE-PASSING MULTICOMPUTERS 619
(a) (b)
(c) (d)
(e) (f)
(g) (h)
Figure 8-37. Various topologies. The heavy dots represent switches. The CPUs
NKK-CA2021.1.0 and memories are not shown. IT3283-Kiến
(a) A star. trúc
(b) Amáy tínhinterconnect. (c) A tree.
complete 25
(d) A ring. (e) A grid. (f) A double torus. (g) A cube. (h) A 4D hypercube.
Massively Parallel Processors
n Hệ thống qui mô lớn

n Đắt tiền: nhiều triệu USD
n Dùng cho tính toán khoa học và các bài
toán có số phép toán và dữ liệu rất lớn
n Siêu máy tính

go to the main DRAM takes about 75 cycles.
The four CPUs are connected via a high-bandwidth bus to a 3D torus network,
which requires six connections: up, down, north, south, east, and west. In addition,
IBM Blue Gene/P
each processor has a port to the collective network, used for broadcasting data to
all processors. The barrier port is used to speed up synchronization operations, giv-
ing each processor fast access to a specialized synchronization network.
At the next level up, IBM designed a custom card that holds one of the chips
shown in Fig. 8-38 along with 2 GB of DDR2 DRAM. The chip and the card are
shown in Fig. 8-39(a)–(b) respectively.
2-GB
DDR2
DRAM
Chip: Card Board Cabinet System

4 processors 1 Chip 32 Cards 32 Boards 72 Cabinets
8-MB L3 cache 4 CPUs 32 Chips 1024 Cards 73728 Cards
2 GB 128 CPUs 1024 Chips 73728 Chips
64 GB 4096 CPUs 294912 CPUs
2 TB 144 TB
(a) (b) (c) (d) (e)
Figure 8-39. The BlueGene/P: (a) chip. (b) card. (c) board. (d) cabinet.
(e) system.
The cards are mounted on plug-in boards, with 32 cards per board for a total of
32 chips (and thus 128 CPUs) per board. Since each card contains 2 GB of
DRAM, the boards contain 64 GB apiece. One board is illustrated in Fig. 8-39(c).
At the next level, 32 of these boards are plugged into a cabinet, packing 4096
CPUs into a single cabinet. A cabinet is illustrated in Fig. 8-39(d).
Finally, a full system, consisting of up to 72 cabinets with 294,912 CPUs, is
depicted in Fig. 8-39(e). A PowerPC 450 can issue up to 6 instructions/cycle, thus

Cluster
§ Nhiều máy tính được kết nối với nhau bằng mạng
liên kết tốc độ cao (~ Gbps)
§ Mỗi máy tính có thể làm việc độc lập (PC hoặc
SMP)
§ Mỗi máy tính được gọi là một node
§ Các máy tính có thể được quản lý làm việc song
song theo nhóm (cluster)
§ Toàn bộ hệ thống có thể coi như là một máy tính
song song
§ Tính sẵn sàng cao
§ Khả năng chịu lỗi lớn

PC Cluster của Google
SEC. 8.4 MESSAGE-PASSING MULTICOMPUTERS 635
hold exactly 80 PCs and switches can be larger or smaller than 128 ports; these are
just typical values for a Google cluster.
OC-12 Fiber OC-48 Fiber
128-port Gigabit 128-port Gigabit

Ethernet switch Ethernet switch
Two gigabit
Ethernet links
80-PC rack
Figure 8-44. A typical Google cluster.
NKK-CA2021.1.0 Power density is also a keyIT3283-Kiến

issue. A typical
trúc PC burns about 120 watts or about
2 máy tính 29
10 kW per rack. A rack needs about 3 m so that maintenance personnel can in-
8.4. Bộ xử lý đồ họa đa dụng
§ Kiến trúc SIMD
§ Xuất phát từ bộ xử lý đồ họa GPU (Graphic
Processing Unit) hỗ trợ xử lý đồ họa 2D và 3D: xử
lý dữ liệu song song
§ GPGPU – General purpose Graphic Processing
Unit
§ Hệ thống lai CPU/GPGPU
§ CPU là host: thực hiện theo tuần tự
§ GPGPU: tính toán song song

Bộ xử lý đồ họa trong máy tính

GPGPU: NVIDIA Tesla
nStreaming
multiprocessor
n8 × Streaming
processors

The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA
cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The
GPGPU: NVIDIA Fermi

512 CUDA cores are organized in 16 SMs of 32 cores each. The GPU has six 64-bit memory
partitions, for a 384-bit memory interface, supporting up to a total of 6 GB of GDDR5 DRAM
memory. A host interface connects the GPU to the CPU via PCI-Express. The GigaThread
global scheduler distributes thread blocks to SM thread schedulers.
Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical

rectangular strip that contain an orange portion (scheduler and dispatch), a green portion
(execution units), and light trúc máy(register
blue portions tính file and L1 cache). 33
NVIDIA Fermi
Instruction Cache
Third Generation Streaming Warp Scheduler Warp Scheduler
n Có 16 Streaming Multiprocessor Dispatch Unit Dispatch Unit
Multiprocessors (SM)
The third generation SM introduces several
The third generation SM introduces several Register File (32,768 x 32-bit)
architectural innovations that make it not only the
architectural innovations that make it not only the
most powerful SM yet built, but also the most
most powerful SM yet built, but also the most
Mỗi SM có 32 CUDA
LD/ST
programmable and efficient.
programmable and efficient.
n Core Core Core Core
LD/ST
SFU
cores.
512 High Performance CUDA cores
512 High Performance CUDA cores Core Core Core Core
LD/ST
LD/ST
Each SM features 32 CUDA

Each SM features 32 CUDA CUDA Core LD/ST
Mỗi CUDA core

Core Core Core Core
processors—a fourfold
processors—a fourfold Dispatch Port LD/ST
n increase over prior SM

increase over prior SM
Operand Collector
LD/ST
SFU
(Cumpute Unified
Core Core Core Core
designs. Each CUDA
designs. Each CUDA FP Unit INT Unit
LD/ST
processor has a fully

processor has a fully LD/ST
Core Core Core Core
Device Architecture) có
pipelined integer arithmetic
pipelined integer arithmetic Result Queue LD/ST
SFU
logic unit (ALU) and floating
logic unit (ALU) and floating Core Core Core Core
LD/ST
01 FPU và 01 IU
LD/ST
point unit (FPU). Prior GPUs used IEEE 754-1985
point unit (FPU). Prior GPUs used IEEE 754-1985
LD/ST
floating floating point arithmetic. The Fermi architecture
point arithmetic. The Fermi architecture Core Core Core Core
LD/ST
implements the new IEEE 754-2008 floating-point SFU
LD/ST
standard, providing the fused multiply-add (FMA) Core Core Core Core
LD/ST
instruction for both single and double precision
arithmetic. FMA improves over a multiply-add Interconnect Network
(MAD) instruction by doing the multiplication and 64 KB Shared Memory / L1 Cache

addition with a single final rounding step, with no
Uniform Cache
Uniform Cache
loss of precision in the addition. FMA is more
accurate than performing the operations Fermi Streaming Multiprocessor (SM)
separately. GT200 implemented double precision FMA.
In GT200, the integer ALU wastrúc máy to
limited tính
24-bit precision for multiply operations; as a result, 34
multi-instruction emulation sequences were required for integer arithmetic. In Fermi, the newly
NVIDIA A100 Tensor Core GPU Architecture In-Depth
NVIDIA A100 Tensor Core GPU Architecture
Figure 6. GA100 Full GPU with 128 SMs (A100 Tensor Core GPU has 108
SMs)
A100 SM Architecture
The new A100 SM significantly increases performance, builds upon features introduced in both
NVIDIA A100 Tensor Core GPU Architecture In-Depth
GA100 Streaming Multiprocessor
Figure 7. GA100 Streaming Multiprocessor (SM)

Kiến trúc máy tính
Hết chương 8

It3283-Ca2021 1 0-CH8

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

It3283-Ca2021 1 0-CH8

Uploaded by

Copyright:

Available Formats

KIẾN TRÚC MÁY TÍNH

Nguyễn Kim Khánh

Chương 1. Giới thiệu chung

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 2

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 3

8.1. Phân loại kiến trúc máy tính

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 4

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 5

§ CU: Control Unit

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 6

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 7

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 8

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 9

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 10

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 11

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 12

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 13

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 14

Private memory Shared

CPU CPU M CPU CPU M CPU CPU M

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 16

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 17

CPU Memory CPU Memory CPU Memory CPU Memory

§ Tuần tự Program counter

L1 instruction cache L1 data cache

§ Siêu vô hướng (a) Superscalar

L1 instruction cache L1 data cache

(b) Simultaneous multithreading

NKK-CA2021.1.0 IT3283-Kiến trúc For

CPU Core 1 CPU Core n CPU Core 1 CPU Core n

L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I

CPU Core 1 CPU Core n CPU Core 1 CPU Core n

L1-D L1-I L1-D L1-I

Main memory I/O Main memory I/O

(c) Shared L2 cache (d ) Shared L3 cache

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 20

§ Two x86 superscalar,

§ 32KiB instruction and 32KiB APIC APIC

data Power management logic

§ 2MiB shared L2 cache

Figure 18.9 Intel Core Duo Block Diagram

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 21

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5

256 kB 256 kB 256 kB 256 kB 256 kB 256 kB

DDR3 Memory QuickPath

3 ! 8B @ 1.33 GT/s 4 ! 20B @ 6.4 GT/s

Figure 18.10 Intel Core i7-990X Block Diagram

High-performance interconnection network

Figure 8-36. A generic multicomputer.

§ Máy tính qui mô lớn (Warehouse Scale Computers

The fundamental reason why multiprocessor and multicomputer intercon-

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 24

n Hệ thống qui mô lớn

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 26

Chip: Card Board Cabinet System

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 27

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 28

128-port Gigabit 128-port Gigabit

Figure 8-44. A typical Google cluster.

NKK-CA2021.1.0 Power density is also a keyIT3283-Kiến

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 30

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 31

NKK-CA2021.1.0 IT3283-Kiến trúc máy tính 32