You are on page 1of 31

William Stallings

Computer Organization
and Architecture
8th Edition
Chapter 18
Multicore Computers

Hardware Performance Issues


Microprocessors have seen an exponential
increase in performance
Improved organization
Increased clock frequency

Increase in Parallelism
Pipelining
Superscalar (multi-issue)
Simultaneous multithreading (SMT)

Diminishing returns
More complexity requires more logic
Increasing chip area for coordinating and
signal transfer logic
Harder to design, make and debug

Alternative Chip
Organizations

http://www.cadalyst.com/files/cadalyst/nodes/2008/6351/i4.jpg

Intel Hardware
Trends
Exponential speedup trend
ILP has come and gone

http://smoothspan.files.wordpress.com/2007/09/clockspeeds.jpg

http://www.ixbt.com/cpu/semiconductor/intel-65nm/power_density.jpg

Increased Complexity
Power requirements grow exponentially with chip
density and clock frequency
Can use more chip area for cache

Smaller
Order of magnitude lower power requirements

By 2015

100 billion transistors on 300mm2 die


Cache of 100MB
1 billion transistors for logic

http://techreport.com/r.x/core-i7/die-callout.jpg

http://www.tomshardware.com/reviews/core-duo-notebooks-trade-batterylife-quicker-response,1206-4.html

Power and Memory Considerations

More action

Less action

We passed 50%!!!
Is this a RAM or a processor?

Increased Complexity
Pollacks rule:

Performance is roughly proportional to square root of


increase in complexity
Double complexity gives 40% more performance

Multicore has the potential for near-linear


improvement (needs some programming effort
and wont work for all problems)
Unlikely that one core can use all of a huge
cache effectively, so add PEs to make an MPSoC

Chip Utilization of Transistors

Cache

CPU

Software Performance Issues


Performance benefits dependent on
effective exploitation of parallel resources
(obviously)
Even small amounts of serial code impact
performance (not so obvious)
10% inherently serial on 8 processor system
gives only 4.7 times performance

Many overheads of MPSoC:


Communication
Distribution of work
Cache coherence

Some applications effectively exploit


multicore processors

Effective Applications for Multicore Processors


Database (e.g. Select *)
Servers handling independent transactions
Multi-threaded native applications
Lotus Domino, Siebel CRM

Multi-process applications
Oracle, SAP, PeopleSoft

Java applications
Java VM is multi-threaded with scheduling and memory
management (not so good at SSE )
Suns Java Application Server, BEAs Weblogic, IBM
Websphere, Tomcat

Multi-instance applications
One application running multiple times

Multicore Organization
Main design variables:
Number of core processors on chip (dual, quad ... )
Number of levels of cache on chip (L1, L2, L3, ...)
Amount of shared cache v.s. not shared (1MB, 4MB, ...)

The following slide has examples of each organization:


a) ARM11 MPCore
b) AMD Opteron
c) Intel Core Duo
d) Intel Core i7

ARM11 MPCore

AMD Opteron

Intel Core Duo

Intel Core i7

Multicore Organization Alternatives

No shared

Shared

Advantages of shared L2 Cache


Constructive interference reduces overall miss
rate (A wants X then B wants X good!)
Data shared by multiple cores not replicated at
cache level (one copy of X for both A and B)
With proper frame replacement algorithms mean
amount of shared cache dedicated to each core is
dynamic
Threads with less locality can have more cache

Easy inter-process communication through


shared memory
Cache coherency confined to small L1
Dedicated L2 cache gives each core more rapid
access
Good for threads with strong locality

Shared L3 cache may also improve performance

Core i7 and Duo


Let us review these two Intel
architectures

Individual Core Architecture


Intel Core Duo uses superscalar cores
Intel Core i7 uses simultaneous multithreading (SMT)
Scales up number of threads supported
4 SMT cores, each supporting 4 threads appears as
16 core (my corei7 has 2 threads per CPU)

Core i7

Core 2 duo

Intel x86 Multicore Organization Core Duo (1)

2006
Two x86 superscalar, shared L2 cache
Dedicated L1 cache per core
32KB instruction and 32KB data

Thermal control unit per core


Manages chip heat dissipation with sensors, clock speed
is throttled
Maximize performance within thermal constraints
Improved ergonomics (quiet fan)

Advanced Programmable Interrupt


Controlled (APIC)
Inter-process interrupts between cores
Routes interrupts to appropriate core
Includes timer so OS can self-interrupt a core

Intel x86 Multicore Organization Core Duo (2)

Power Management Logic


Monitors thermal conditions and CPU activity
Adjusts voltage (and thus power consumption)
Can switch on/off individual logic subsystems
to save power
Split-bus transactions can sleep on one end

2MB shared L2 cache


Dynamic allocation
MESI support for L1 caches
Extended to support multiple Core Duo in SMP
(not SMT)
L2 data shared between local cores (fast) or external

Bus interface is FSB

Intel Core Duo Block Diagram

Intel x86 Multicore Organization Core i7

November 2008
Four x86 SMT processors
Dedicated L2, shared L3 cache
Speculative pre-fetch for caches
On chip DDR3 memory controller
Three 8 byte channels (192 bits) giving 32GB/s
No front side bus (just like labs 1 & 2 with the SDRAM
controller)

QuickPath Interconnect (QPI video if time allows)


Cache coherent point-to-point link
High speed communications between processor chips
6.4G transfers per second, 16 bits per transfer
Dedicated bi-directional pairs
Total bandwidth 25.6GB/s

Intel Core i7 Block Diagram

ARM11 MPCore
ARM vs. x86 and Microsoft
Intel started this fight by challenging ARM
with its Atom processor, which is moving
downmarket and towards
smartphones. Apparently, the major ARM
vendors are feeling the threat, are now
moving upmarket and are beginning to
make their run at low-end PCs and
storage appliances to put the pressure
back on Intel.

http://www.tgdaily.com/trendwatch-features/41561-the-coming-arm-vs-intel-pc-battle

ARM11 MPCore
Up to 4 processors each with own L1 instruction and data
cache
Distributed Interrupt Controller (DIC)
Recall the APIC from Intels core architecture

Timer per CPU


Watchdog (feed or it barks!)

Warning alerts for software failures


Counts down from predetermined values
Issues warning at zero

CPU interface

Interrupt acknowledgement, masking and completion


acknowledgement

CPU

Single ARM11 called MP11

Vector floating-point unit (VFP)


FP co-processor

L1 cache
Snoop control unit

L1 cache coherency
http://barfblog.foodsafety.ksu.edu/DogObedienceTraining.jpg

ARM11
MPCore
Block
Diagram

ARM11 MPCore Interrupt Handling


Distributed Interrupt Controller (DIC) collates
from many sources (ironically it is a centralized
controller)
It provides
Masking (who can ignore an interrupt)
Prioritization (CPU A is more important than CPU B)
Distribution to target MP11 CPUs
Status tracking (of interrupts)
Software interrupt generation

Number of interrupts independent of MP11 CPU


design
Memory mapped DIC control registers
Accessed by CPUs via private interface through
SCU
DIC can:
Route interrupts to single or multiple CPUs
Provide inter-process communication

Thread on one CPU can cause activity by thread on another CPU

DIC Routing

Direct to specific CPU


To defined group of CPUs
To all CPUs
OS can generate interrupt to:
All but self
Self
Other specific CPU

Typically combined with shared memory


for inter-process communication
16 interrupt ids available for inter-process
communication (per cpu)

Interrupt States
Inactive
Non-asserted
Completed by that CPU but pending or active
in others
E.g. allgather

Pending
Asserted
Processing not started on that CPU

Active
Started on that CPU but not complete
Can be pre-empted by higher priority interrupt

Interrupt Sources
Inter-process Interrupts (IPI)
Private to CPU
ID0-ID15 (16 IPIs per CPU as mentioned earlier)
Software triggered
Priority depends on receiving CPU not source

Private timer and/or watchdog interrupt


ID29 and ID30

Legacy FIQ line


Legacy FIQ pin, per CPU, bypasses interrupt distributor
Directly drives interrupts to CPU

Hardware
Triggered by programmable events on associated
interrupt lines
Up to 224 lines
Start at ID32

ARM11 MPCore Interrupt Distributor

Cache Coherency
Snoop Control Unit (SCU) resolves most shared
data bottleneck issues
Note: L1 cache coherency based on MESI similar to
Intels core architecture

3 types of SCU shared data resolution:


1. Direct data Intervention

Copying clean entries between L1 caches without accessing


external memory or L2
Can resolve local L1 miss from remote L1 rather than L2
Reduces read after write from L1 to L2

2. Duplicated tag RAMs

Cache tags implemented as separate block of RAM, a copy is held


in the SCU. So the SCU knows when 2 CPUs have the same cache
lines.
Tag RAM has same length as number of lines in cache
TAG duplicates used by SCU to check data availability before
sending coherency commands
Only send to CPUs that must update coherent data cache
Less bus locking due to less communication during coherency step

3. Migratory lines

Allows moving dirty data between CPUs without writing to L2 and


reading back from external memory(See Stallings CH 18.5 pg703)

Performance Effect of Multiple Cores

Recommended Reading

Multicore Association web site


Stallings chapter 18
ARM web site
(if we have time)
http://www.intel.com/technology/quickpat
h/index.htm
http://www.arm.com/products/CPUs/ARM
11MPCoreMultiprocessor.html
http://www.eetimes.com/news/design/fea
tures/showArticle.jhtml?articleID=239011
43