You are on page 1of 24

 AMD’s Microprocessors Architecture

AMD’s Launce Competitor


Microprocessor Model Yr Intel Model
I. K5 (Four-Issue Out-of-order 1994 100 MHz
Processor, first chip of K86 family) Pentium
II. K7 (Three-Issue SuperScalar 1998 Katmai
Processor)
III. Athlon (World’s First 7th Generation 2000 Pentium III
x86 Processor)

Fields of discussion:
 Basic features of each model’s architecture,
 Instruction handling,
 Caches & Memory management,
 Pipeline (# of stages, possible penalties).
Advanced Computer Architectures 1
 Part I: AMD’s K5
- Basic Features

- Advanced four-issue superscalar core: that supports


speculative, out-of-order execution and register renaming.
- Die-sized static chip of 4.3 million transistors, 3.3-V designed,
implemented in AMD’s 0.5-micron, three-metal CMOS process.
- 30% Faster at Same Clock Rate as Pentium – (2.5 times as
fast as 486). K5: a flexible, aggressive microarchitecture that
achieves higher performance (on integer code, though less
emphasis is given on floating-point performance).
- The challenge of Decoding multiple x86 instructions in parallel
is achieved by predecoding them when fetched from memory
into the instruction cache.

Advanced Computer Architectures 2


Block diagram of AMD’s K5

 K5 adds predecode bits to x86 instructions before caching them.


 x86 instructions are converted into one or more microinstructions (RISC-
operations or ROPs) – 16 bytes at a time - up to four ROPs issued per cycle.
3
K5 ‘s instruction translation process

 Instructions are fed into a 16-byte queue,


 Up to four ROPs worth of instructions can be pulled from queue during
each clock cycle,
 Four identical ROP converters translate x86 instruction into ROPs.
 A RISK-like outcome is acquired once past the ROP converters.
Advanced Computer Architectures 4
K5 ‘s Execution Units
 ROPs are dispatched to Execution Units, in reservation stations, and wait
for the needed resources to become available.
Available Execution Units (6 in number) include:
 2 Dual ALUs (one with a shifter, the other with the divider – 2
execution stations)
 1 FPU (1 execution station)
 2 Dual Load/Store Units (2 execution stations each), and
 1 Branch Unit (2 execution stations).

 8 Operand Buses feed the Execution Units (4 units are fed with 2
operands on every cycle).

 5 Result Buses, 41 bits wide, support transfers on floating-point data (2


used in parallel to support x86 compatible floating-point operations).

 40-Word Register File

 1 16-entry reorder buffer (ROB), stores results from speculatively


executed instructions (and then results are written to register file).

Advanced Computer Architectures 5


Pipeline in K5’s architecture
 6 stages of pipelining, only 5 of them affect performance – instruction timing.
 Requirement for an extra decode stage for the x86-to-ROP translation.
 Combination of address generation into the same stage as cache access.

Extra 1st phase 2nd phase


decode Calculation Full linear 32-bit
stage of cache address calculation,
index segment-protection
checking
Advanced Computer Architectures 6
Caches and Memory management

 The Data
Cache is
divided into  The Instruction Cache has a 16-
four banks. byte line size – a 16-byte buffer
There are 2 ensures compatibility with the
access ports Pentium bus which performs 32-byte
one for each bursts, so it
Load/Store holds the
Unit. second
Cache line.

 The caches are virtually addressed and tagged to avoid the need to
translate addresses before a cache access.
 A single set of physical tags is shared by instruction and data caches.
Thus, conflicts with the CPU for cache access are eliminated and ensure
consistency between instruction and data caches

Advanced Computer Architectures 7


Summarizing about AMD’s K5

 K5 goes further in combining x86 compatibility with RISC-like core,


(employing similar architecture as the NexGen’s Nx586)
achieving: - a large software base,
- high performance.

 K5 (at that time) competitors:


- Cyrix’s M1 (not yet launched in the market),
- Pentium
 holds large part of the marketplace, often forcing AMD to

provide higher performance at the same price,


 already on the way to increasing it’s clock rate (promising

to launch a 120-MHz speed rate (in 1 year’s time – early


1995), or even a 150-MHz speed rate (in 2 years time –
late 1995)).

Advanced Computer Architectures 8


 Part II: AMD’s K7 3-issue superscalar processor - Basic Features

 10 stages of pipeline->
 High clock rates
achievement,
 Large L1 instruction -
data caches ->
 Functions in systems
with of without
backside L2,
 Astonishingly small
(184mm2), despite
transistor complexity
(22-million).

 Up to 72 instructions can be in execution in K7’s out of order integer pipe,


floating-point pipe, and load/store unit.
Advanced Computer Architectures 9
K7’s deep 10-stage pipeline

 Long pipeline (1st half occupied for x86 instruction decoding),


 Simple branch prediction by a 2.048-entry branch history
table with a 2-bit Smith prediction algorithm, but
 Misprediction cost equals to a minimum of a 10-cycle penalty.

Advanced Computer Architectures 10


K7’s Instruction Decoding
deliver High Instruction Bandwidth

 x86 instructions are


decoded to MOPs
(3/cycle) and
dispatched to either
the integer pipe
(via direct or
vector path
decoders
depending their
complexity)
or to the
floating-point pipe
(directly passed to
the ICU)

Advanced Computer Architectures 11


K7’s Out-of-Order Integer Pipe Issuing 6 ROPs/Cycle

 The Integer Scheduler is an out-of-order 15-entry


reservation station organized as 3 5-MOPs queues.
A MOP equals 1 or 2
ROP(s) (load, store,
load/store, ALU
operation, or branch).
 The integer pipe
provides 3 IEUs and 3
AGUs.
 Each of the 3 queues
of the scheduler is
physically associated
with a IEU/AGU pair.

Advanced Computer Architectures 12


K7 provides large Memory Bandwidth
to support instruction execution

 The 44-entry LSU


queues memory requests
by the AGUs (3/cycle), and
 Issues them to the D-
cache out-of-order (it can
provide up to 8 Gbytes/s of
bandwidth)
 Data is snooped from
the result buses.
 The cache is
nonblocking.
 The data cache is physically tagged. A 2-level translation lookaside buffer
(TLB), translates the effective addresses to physical (in parallel).

Advanced Computer Architectures 13


K7’s additional features
 On-Chip tags support Large External L2.

 K7 connects to the
chip set via point-to-
point interconnect
instead of a shared bus.
 This requires more
pins in MP
configurations but
allows the bus to run at
higher speed.
 K7’s bus comprises
three separate ports:
address in, address out,
and a 72-bit
bidirectional data port.
 It uses a five-state MOESI cache coherence protocol (‘owned’ state added)

Advanced Computer Architectures 14


Summarizing about K7 microprocessor

 K7 is the most
complex of any current
x86 processor.
 It seems to
outperform Intel’s
models on an
instructions-per-clock
basis.
 It promises AMD
performance
leadership, allowing to
increase both prices
and profit margins.


Advanced Computer Architectures 15
 Part III: AMD’s Athlon (1st member of of the 7th-generation AMD-processors’
family)
AMD Seventh Generation Intel Previous Generation

Processor
Architecture/
Technology –
Competitive
Comparison
Advanced Computer Architectures 16
AMD Athlon Processor Microarchitecture Features
I. The industry's first nine-issue, superpipelined, superscalar x86 processor
microarchitecture designed for high clock frequencies
 Multiple x86 instruction decoders
 Three out-of-order, superscalar, fully pipelined floating point
execution units, which execute all x87 (floating point), MMX and
3DNow! Instructions
 Three out-of-order, superscalar, pipelined integer units
 Three out-of-order, superscalar, pipelined address calculation units
 72-entry instruction control unit
 Advanced dynamic branch prediction
II. High-performance cache architecture featuring an integrated 128KB L1
cache and a programmable, high-speed backside L2 cache interface
III. 200MHz AMD Athlon processor system bus (scalable beyond 400 MHz)
enabling leading-edge system bandwidth for data movement-intensive
applications
IV. Enhanced 3DNow! technology with new instructions to enable improved
integer math calculations for speech or video encoding and improved data
movement for Internet plug-ins and other streaming applications

Advanced Computer Architectures 17


AMD Athlon Processor Architecture Block Diagram

Advanced Computer Architectures 18


II. High-Performance Cache Design

 Integrated, dual-ported 64KB split-L1 data and instruction caches with


separate snoop port, with eight banks to support concurrent access by two
64-bit loads or stores,
 Multi-level translation look-aside buffers (TLBs),
 A scalable L2 cache controller with a 72-bit interface, and
 An integrated tag for cost-effective 512KB L2 configurations

 First to incorporate a system-based MOESI (Modify, Owner, Exclusive,


Shared, Invalid) cache control protocol for x86 multiprocessing platforms,
thus deliver exceptional performance in both uni and multi processor systems
 It supports error correction code (ECC) protection

Advanced Computer Architectures 19


III. The Industry's First 200-MHz System Bus for x86 Platforms
 ADVANTAGES OFFERED

Intel Previous
AMD Seventh Generation
Generation

Advanced Computer Architectures 20


IV. Leading-Edge Floating Point and
3D Multimedia Technology

 The three execution units (Fmul, Fadd,


and Fstore) in the AMD Athlon processor's
floating point pipeline handle all x87
(floating point) instructions, MMX
instructions, and enhanced 3DNow!
Instructions.
(Using a data format and single-instruction
multiple-data (SIMD) operations, it can deliver
as many as four 32-bit, single-precision floating
point results per clock cycle, resulting in a peak
performance of 4.0 Gigaflops at 1000 MHz).
 Use of the AMD's original 21 3DNow! instructions plus 24 new ones:
 12, that improve multimedia-enhanced integer math calculations,
 7, that accelerate data movement, and Internet functionality,
 5 DSP instructions, that enhance the performance of communications
applications (!!! unique in Athlon)
Advanced Computer Architectures 21
Examples of applications that benefit
from Athlon’s processor capabilities

 Advanced imaging software for processing digital imaging


 Enhanced Internet browsing using next-generation browser features
 Architectural 3D rendering systems
 CAD/CAE software packages
 Near real-time MPEG-2 video encoding/editing for higher quality video
 Speech recognition in Web browsing and word processing
 Financial modelling and trading software
 Realistic 3D software, including 3D games and flight simulators.

Advanced Computer Architectures 22


Summarizing about AMD’s Athlon processor

 It uses the latest microarchitecture innovation and system bus


technology to deliver the industry’s highest performance for x86-
compatible platforms.
 Compatible with x86 versions of Microsoft Windows, as well as other
operating systems.
 Provides a new level of performance and data movement capabilities for
the next generation computation-intensive software.
 Powers the Next Generation for the growing fields of
digital imaging,
the Internet,
enterprise computing,
CAD/CAE packages,
scientific/technical applications, and
3D gaming.

Advanced Computer Architectures 23


 CONCLUSION

 AMD’s microprocessors seem always to compete against


Intel’s performance and market dominance,
 Each model manages to accept the challenge of offering:
 High performance at the same clock rate,
 Advanced techniques of register renaming,
out-of-order execution and of superscalar
design,
 Support for high-end desktop products,
uniprocessor and multiprocessing
workstations and servers.

 AMD’s forthcoming optimized chipsets are planned to enable multiprocessing


system design based on 2 or more AMD family processors.
Advanced Computer Architectures 24

You might also like