Fragouli - AMD Mpros

 AMD’s Microprocessors Architecture
AMD’s Launce Competitor

Microprocessor Model Yr Intel Model
I. K5 (Four-Issue Out-of-order 1994 100 MHz
Processor, first chip of K86 family) Pentium
II. K7 (Three-Issue SuperScalar 1998 Katmai
Processor)
III. Athlon (World’s First 7th Generation 2000 Pentium III
x86 Processor)
Fields of discussion:
 Basic features of each model’s architecture,
 Instruction handling,
 Caches & Memory management,
 Pipeline (# of stages, possible penalties).
Advanced Computer Architectures 1
 Part I: AMD’s K5
- Basic Features
- Advanced four-issue superscalar core: that supports

speculative, out-of-order execution and register renaming.
- Die-sized static chip of 4.3 million transistors, 3.3-V designed,
implemented in AMD’s 0.5-micron, three-metal CMOS process.
- 30% Faster at Same Clock Rate as Pentium – (2.5 times as
fast as 486). K5: a flexible, aggressive microarchitecture that
achieves higher performance (on integer code, though less
emphasis is given on floating-point performance).
- The challenge of Decoding multiple x86 instructions in parallel
is achieved by predecoding them when fetched from memory
into the instruction cache.

Block diagram of AMD’s K5
 K5 adds predecode bits to x86 instructions before caching them.

 x86 instructions are converted into one or more microinstructions (RISC-
operations or ROPs) – 16 bytes at a time - up to four ROPs issued per cycle.
3
K5 ‘s instruction translation process
 Instructions are fed into a 16-byte queue,

 Up to four ROPs worth of instructions can be pulled from queue during
each clock cycle,
 Four identical ROP converters translate x86 instruction into ROPs.
 A RISK-like outcome is acquired once past the ROP converters.
K5 ‘s Execution Units
 ROPs are dispatched to Execution Units, in reservation stations, and wait
for the needed resources to become available.
Available Execution Units (6 in number) include:
 2 Dual ALUs (one with a shifter, the other with the divider – 2
execution stations)
 1 FPU (1 execution station)
 2 Dual Load/Store Units (2 execution stations each), and
 1 Branch Unit (2 execution stations).
 8 Operand Buses feed the Execution Units (4 units are fed with 2
operands on every cycle).
 5 Result Buses, 41 bits wide, support transfers on floating-point data (2

used in parallel to support x86 compatible floating-point operations).
 40-Word Register File
 1 16-entry reorder buffer (ROB), stores results from speculatively

executed instructions (and then results are written to register file).

Pipeline in K5’s architecture
 6 stages of pipelining, only 5 of them affect performance – instruction timing.
 Requirement for an extra decode stage for the x86-to-ROP translation.
 Combination of address generation into the same stage as cache access.
Extra 1st phase 2nd phase

decode Calculation Full linear 32-bit
stage of cache address calculation,
index segment-protection
checking
Caches and Memory management
 The Data
Cache is
divided into  The Instruction Cache has a 16-
four banks. byte line size – a 16-byte buffer
There are 2 ensures compatibility with the
access ports Pentium bus which performs 32-byte
one for each bursts, so it
Load/Store holds the
Unit. second
Cache line.
 The caches are virtually addressed and tagged to avoid the need to
translate addresses before a cache access.
 A single set of physical tags is shared by instruction and data caches.
Thus, conflicts with the CPU for cache access are eliminated and ensure
consistency between instruction and data caches

Summarizing about AMD’s K5
 K5 goes further in combining x86 compatibility with RISC-like core,

(employing similar architecture as the NexGen’s Nx586)
achieving: - a large software base,
- high performance.
 K5 (at that time) competitors:

- Cyrix’s M1 (not yet launched in the market),
- Pentium
 holds large part of the marketplace, often forcing AMD to
provide higher performance at the same price,

 already on the way to increasing it’s clock rate (promising
to launch a 120-MHz speed rate (in 1 year’s time – early

1995), or even a 150-MHz speed rate (in 2 years time –
late 1995)).

 Part II: AMD’s K7 3-issue superscalar processor - Basic Features
 10 stages of pipeline->
 High clock rates
achievement,
 Large L1 instruction -
data caches ->
 Functions in systems
with of without
backside L2,
 Astonishingly small
(184mm2), despite
transistor complexity
(22-million).
 Up to 72 instructions can be in execution in K7’s out of order integer pipe,

floating-point pipe, and load/store unit.
K7’s deep 10-stage pipeline
 Long pipeline (1st half occupied for x86 instruction decoding),

 Simple branch prediction by a 2.048-entry branch history
table with a 2-bit Smith prediction algorithm, but
 Misprediction cost equals to a minimum of a 10-cycle penalty.

K7’s Instruction Decoding
deliver High Instruction Bandwidth
 x86 instructions are

decoded to MOPs
(3/cycle) and
dispatched to either
the integer pipe
(via direct or
vector path
decoders
depending their
complexity)
or to the
floating-point pipe
(directly passed to
the ICU)

K7’s Out-of-Order Integer Pipe Issuing 6 ROPs/Cycle
 The Integer Scheduler is an out-of-order 15-entry

reservation station organized as 3 5-MOPs queues.
A MOP equals 1 or 2
ROP(s) (load, store,
load/store, ALU
operation, or branch).
 The integer pipe
provides 3 IEUs and 3
AGUs.
 Each of the 3 queues
of the scheduler is
physically associated
with a IEU/AGU pair.

K7 provides large Memory Bandwidth
to support instruction execution
 The 44-entry LSU

queues memory requests
by the AGUs (3/cycle), and
 Issues them to the D-
cache out-of-order (it can
provide up to 8 Gbytes/s of
bandwidth)
 Data is snooped from
the result buses.
 The cache is
nonblocking.
 The data cache is physically tagged. A 2-level translation lookaside buffer
(TLB), translates the effective addresses to physical (in parallel).

K7’s additional features
 On-Chip tags support Large External L2.
 K7 connects to the
chip set via point-to-
point interconnect
instead of a shared bus.
 This requires more
pins in MP
configurations but
allows the bus to run at
higher speed.
 K7’s bus comprises
three separate ports:
address in, address out,
and a 72-bit
bidirectional data port.
 It uses a five-state MOESI cache coherence protocol (‘owned’ state added)

Summarizing about K7 microprocessor
 K7 is the most
complex of any current
x86 processor.
 It seems to
outperform Intel’s
models on an
instructions-per-clock
basis.
 It promises AMD
performance
leadership, allowing to
increase both prices
and profit margins.

 Part III: AMD’s Athlon (1st member of of the 7th-generation AMD-processors’
family)
AMD Seventh Generation Intel Previous Generation
Processor
Architecture/
Technology –
Competitive
Comparison
AMD Athlon Processor Microarchitecture Features
I. The industry's first nine-issue, superpipelined, superscalar x86 processor
microarchitecture designed for high clock frequencies
 Multiple x86 instruction decoders
 Three out-of-order, superscalar, fully pipelined floating point
execution units, which execute all x87 (floating point), MMX and
3DNow! Instructions
 Three out-of-order, superscalar, pipelined integer units
 Three out-of-order, superscalar, pipelined address calculation units
 72-entry instruction control unit
 Advanced dynamic branch prediction
II. High-performance cache architecture featuring an integrated 128KB L1
cache and a programmable, high-speed backside L2 cache interface
III. 200MHz AMD Athlon processor system bus (scalable beyond 400 MHz)
enabling leading-edge system bandwidth for data movement-intensive
applications
IV. Enhanced 3DNow! technology with new instructions to enable improved
integer math calculations for speech or video encoding and improved data
movement for Internet plug-ins and other streaming applications

AMD Athlon Processor Architecture Block Diagram

II. High-Performance Cache Design
 Integrated, dual-ported 64KB split-L1 data and instruction caches with

separate snoop port, with eight banks to support concurrent access by two
64-bit loads or stores,
 Multi-level translation look-aside buffers (TLBs),
 A scalable L2 cache controller with a 72-bit interface, and
 An integrated tag for cost-effective 512KB L2 configurations
 First to incorporate a system-based MOESI (Modify, Owner, Exclusive,

Shared, Invalid) cache control protocol for x86 multiprocessing platforms,
thus deliver exceptional performance in both uni and multi processor systems
 It supports error correction code (ECC) protection

III. The Industry's First 200-MHz System Bus for x86 Platforms
 ADVANTAGES OFFERED
Intel Previous
AMD Seventh Generation
Generation

IV. Leading-Edge Floating Point and
3D Multimedia Technology
 The three execution units (Fmul, Fadd,

and Fstore) in the AMD Athlon processor's
floating point pipeline handle all x87
(floating point) instructions, MMX
instructions, and enhanced 3DNow!
Instructions.
(Using a data format and single-instruction
multiple-data (SIMD) operations, it can deliver
as many as four 32-bit, single-precision floating
point results per clock cycle, resulting in a peak
performance of 4.0 Gigaflops at 1000 MHz).
 Use of the AMD's original 21 3DNow! instructions plus 24 new ones:
 12, that improve multimedia-enhanced integer math calculations,
 7, that accelerate data movement, and Internet functionality,
 5 DSP instructions, that enhance the performance of communications
applications (!!! unique in Athlon)
Examples of applications that benefit
from Athlon’s processor capabilities
 Advanced imaging software for processing digital imaging

 Enhanced Internet browsing using next-generation browser features
 Architectural 3D rendering systems
 CAD/CAE software packages
 Near real-time MPEG-2 video encoding/editing for higher quality video
 Speech recognition in Web browsing and word processing
 Financial modelling and trading software
 Realistic 3D software, including 3D games and flight simulators.

Summarizing about AMD’s Athlon processor
 It uses the latest microarchitecture innovation and system bus

technology to deliver the industry’s highest performance for x86-
compatible platforms.
 Compatible with x86 versions of Microsoft Windows, as well as other
operating systems.
 Provides a new level of performance and data movement capabilities for
the next generation computation-intensive software.
 Powers the Next Generation for the growing fields of
digital imaging,
the Internet,
enterprise computing,
CAD/CAE packages,
scientific/technical applications, and
3D gaming.

 CONCLUSION
 AMD’s microprocessors seem always to compete against

Intel’s performance and market dominance,
 Each model manages to accept the challenge of offering:
 High performance at the same clock rate,
 Advanced techniques of register renaming,
out-of-order execution and of superscalar
design,
 Support for high-end desktop products,
uniprocessor and multiprocessing
workstations and servers.
 AMD’s forthcoming optimized chipsets are planned to enable multiprocessing

system design based on 2 or more AMD family processors.

Fragouli - AMD Mpros

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fragouli - AMD Mpros

Uploaded by

Copyright:

Available Formats

 AMD’s Microprocessors Architecture

AMD’s Launce Competitor

- Advanced four-issue superscalar core: that supports

Advanced Computer Architectures 2

 K5 adds predecode bits to x86 instructions before caching them.

 Instructions are fed into a 16-byte queue,

 5 Result Buses, 41 bits wide, support transfers on floating-point data (2

 40-Word Register File

 1 16-entry reorder buffer (ROB), stores results from speculatively

Advanced Computer Architectures 5

Extra 1st phase 2nd phase

Advanced Computer Architectures 7

 K5 goes further in combining x86 compatibility with RISC-like core,

 K5 (at that time) competitors:

provide higher performance at the same price,

to launch a 120-MHz speed rate (in 1 year’s time – early

Advanced Computer Architectures 8

 Up to 72 instructions can be in execution in K7’s out of order integer pipe,

 Long pipeline (1st half occupied for x86 instruction decoding),

Advanced Computer Architectures 10

 x86 instructions are

Advanced Computer Architectures 11

 The Integer Scheduler is an out-of-order 15-entry

Advanced Computer Architectures 12

 The 44-entry LSU

Advanced Computer Architectures 13

Advanced Computer Architectures 14

Advanced Computer Architectures 17

Advanced Computer Architectures 18

 Integrated, dual-ported 64KB split-L1 data and instruction caches with

 First to incorporate a system-based MOESI (Modify, Owner, Exclusive,

Advanced Computer Architectures 19

Advanced Computer Architectures 20

 The three execution units (Fmul, Fadd,

 Advanced imaging software for processing digital imaging

Advanced Computer Architectures 22

 It uses the latest microarchitecture innovation and system bus

Advanced Computer Architectures 23

 AMD’s microprocessors seem always to compete against

 AMD’s forthcoming optimized chipsets are planned to enable multiprocessing

You might also like