Superscalar Processor Pentium Pro

I NTRODUCTION
As it is clear from the name of the topic , the topic deals with the techniques associated with
superscalar architecture in computers and its application in modern Pentium pro processors.
This is my attempt to provide all the information as a zest or summary. I have tried to explain
each and every term associated with superscalar architecture and their application. I hope that
the reader will be able to know the fundamentals of superscalar processing technique after
reading this document.
1
SUPERSCALAR PROCESSOR PENTIUM PRO
At first we should have knowledge of the following terms associated with the term
superscalar architecture.
PIPELINING-
pipelining is a technique of decomposing a sequential process into sub operations, with each
sub process being executed in a special dedicated segment that operates concurrently with all
other segments. A pipeline can be visualized as a collection of processing segments through
which binary information flows. Each segment performs partial processing dictated by the
way the task is partitioned. The result obtained from the computation in each segment is
transferred to the next segment in the pipeline. The final result is obtained after the data have
passed through all segments. The name “pipeline” implies a flow of information analogous to
an industrial assembly line. It is characteristic of pipelines that several computations can be in
progress in distinct segments at the same time. The overlapping of computation is made
possible by associating a register with each segment in the pipeline.
SUPERSCALAR PROCESSORS-
A super scalar architecture has a form of parallelism on a single chip allowing the system as a
whole to run much faster than it would otherwise be able to at a given clock speed. A
superscalar architecture fetches, executes and returns results from more than one instruction
during a single pipeline stage . A scalar processor processes one data item at a time. In a
vector processor, by contrast, a single instruction operates simultaneously on multiple data
items. The difference is analogous to the difference between scalar and vector arithmetic. A
superscalar processor is a sort of a mixture of the two. Each instruction processes one data
item, but there are multiple processing units so that multiple instructions can be processing
separate data items at the same time.
PARALLEL COMPUTING-
• Traditionally, software has been written for Serial computation:

o To be run on a single computer having a single Central Processing Unit
(CPU);
o A problem is broken into a discrete series of instructions.
o Instructions are executed one after another.
o Only one instruction may execute at any moment in time.
2
• In the simplest sense, Parallel computing is the simultaneous use of multiple compute
resources to solve a computational problem:
o To be run using multiple CPUs
o A problem is broken into discrete parts that can be solved concurrently
o Each part is further broken down to a series of instructions
o Instructions from each part execute simultaneously on different CPUs
VECTOR PROCESSING-
A computer with built-in instructions that perform multiple calculations on vectors (one-
dimensional arrays) simultaneously. It is used to solve the same or similar problems as an
array processor; however, a vector processor passes a vector to a functional unit, whereas an
array processor passes each element of a vector to a different arithmetic unit.
SCALAR VS VECTOR PROCESSING-
3
RISC- (REDUCED INSTRUCTION SET COMPUTING)
A RISC (reduced instruction set computer) is a microprocessor that is designed to perform a

smaller number of types of computer instruction so that it can operate at a higher speed
(perform more million instructions per second, or millions of instructions per second). Since
each instruction type that a computer must perform requires additional transistors and
circuitry, a larger list or set of computer instructions tends to make the microprocessor more
complicated and slower in operation.
CISC-( COMPLEX INSTRUCTION SET COMPUTING)
It stands for Complex Instruction Set Computer, is a philosophy for designing chips that are
easy to program and which make efficient use of memory. Each instruction in a CISC
instruction set might perform a series of operations inside the processor. This reduces the
number of instructions required to implement a given program, and allows the programmer to
learn a small but flexible set of instructions.
PENTIUM PRO PROCESSORS-
200 MHz Pentium Pro with a 512 KB L2 cache in PGA package
4
The Pentium Pro is a sixth-generation x86 microprocessor developed and manufactured by
Intel and was introduced in November 1995. It introduced the P6 micro architecture
(sometime referred as i686) and was originally intended to replace the original Pentium in a
full range of applications. While the Pentium and Pentium MMX had 3.1 and 4.5 million
transistors, respectively, the Pentium Pro contained 5.5 million transistors. Later, it was
reduced to a more narrow role as a server and high-end desktop processor and was used in
supercomputers like ASCI Red. The Pentium Pro was capable of both dual- and quad-
processor configurations. It only came in one form factor, the relatively large rectangular
Socket 8.
200 MHz Pentium Pro with a 1 MB L2 cache in PPGA package
Opposite to its name, the Pentium Pro had a completely new micro architecture, a departure
from the Pentium rather than an extension of it. The Pentium Pro (P6) featured many
advanced concepts not found in the Pentium, although it wasn't the first or only x86
processor to implement them. The Pentium Pro pipeline had extra decode stages to
5
dynamically translate IA-32 (Intel Architecture, 32-bit) instructions into buffered micro-
operation sequences which could then be analysed, reordered, and renamed in order to detect
parallelizable operations that may be issued to more than one execution unit at once. The
Pentium Pro thus featured out of order execution, including speculative execution via register
renaming. It also had a wider 36-bit address bus usable by PAE (Physical Address
Extension).
The Pentium Pro has an 8 KB instruction cache, from which up to 16 bytes are fetched on
each cycle and sent to the instruction decoders. There are three instruction decoders. The
decoders are not equal in capability: only one can decode any x86 instruction, while the
other two can only decode simple x86 instructions. This restricts the Pentium Pro's ability to
decode multiple instructions simultaneously, limiting superscalar execution. x86 instructions
are decoded into 118-bit micro-operations (micro-ops). The micro-ops are RISC-like; that is,
they encode an operation, two sources, and a destination. The general decoder can generate
up to four micro-ops per cycle, whereas the simple decoders can generate one micro-op each
per cycle. Thus, x86 instructions that operate on the memory (e.g. add this register to this
location in the memory) can only be processed by the general decoder, as this operation
requires at a minimum of three micro-operations. Likewise, the simple decoders are limited
to instructions that can be translated into one micro-operation. Instructions that require more
micro-operations than four are translated with the assistance of a sequencer, which generates
the required micro-operations over multiple clock cycles.
Uncapped Pentium Pro 256 KB
6
Micro-operations exit the ROB and enter a reserve station, where they await dispatch to the
execution units. In each clock cycle, up to five micro-operations can be dispatched to five
execution units. The Pentium Pro has two integer units and one floating-point unit (FPU).
One of the integer units shares the same ports as the FPU, and therefore the Pentium Pro can
only dispatch two integer micro-operations and one floating-point micro-operation per a
cycle. Of the two integer units, only one has the full complement of functions such as a barrel
shifter, multiplier and divider. The second integer unit, which shares paths with the FPU,
does not have these facilities and is limited to simple operations such as add, subtract, and the
calculation of branch target addresses.
The FPU executes floating-point operations. Addition and multiplication are pipelined and
have a latency of three and five cycles, respectively. Division and square-root are not
pipelined and are executed in separate units that share the FPU's ports. Division and
multiplication have a latency of 18 to 36 and 29 to 69 cycles, respectively. The smallest
number is for single precision (32-bit) floating-point numbers and the largest for extended
precision (80-bit) numbers. Division and square root can operate simultaneously with adds
and multiplies, preventing them from executing only when the result has to be stored in the
ROB.
After the microprocessor was released, a bug was discovered in the floating point unit,
commonly called the "Pentium Pro and Pentium II FPU bug" and by Intel as the "flag
erratum". The bug occurs under some circumstances during floating point-to-integer
conversion when the floating point number won't fit into the smaller integer format, causing
the FPU to deviate from its documented behavior. The bug is considered to be minor and
occurs under such special circumstances that very few, if any, software programs are
affected.
The Pentium Pro micro architecture was used in one form or another by Intel for more than a
decade. The pipeline would scale from its initial 150 MHz start, all the way up to 1.4 GHz
with the "Tualatin" Pentium III. The design's various traits would continue after that in the
derivative core called "Banias" in Pentium M and Intel Core (Yonah), which itself would
evolve into the Core architecture (Core 2 processor) in 2006 and onward.
PERFORMANCE-
Performance with 32-bit code was excellent and well ahead of the older Pentiums at the time,
by 25-35%; however, the Pentium Pro's 16-bit performance was approximately only 20%
faster than that of a Pentium due to the fact that register renaming was done on full 32-bit
registers only (this was fixed in the Pentium-II). It was this, along with the Pentium Pro's
high price, that caused the rather lackluster reception among PC enthusiasts, given the
dominance at the time of the 16-bit MS-DOS, 16/32-bit Windows 3.1x, and 32/16-bit
Windows 95 (parts of Windows 95, such as USER.exe, were still mostly 16-bit). To gain the
full advantages of Pentium Pro's micro architecture, one needed to run a fully 32-bit
Operating Systems such as Windows NT 3.51, Unix, Linux or OS/2.
Compared to RISC microprocessors, the Pentium Pro, when introduced, slightly

outperformed the fastest RISC microprocessors on integer performance when running the
SPECint95 benchmark. Floating-point performance was significantly lower, half of some
RISC microprocessors.
7
AN INNOVATION IN CACHE-
Likely Pentium Pro's most noticeable addition was its on-package L2 cache, which ranged
from 256 KB at introduction to 1 MB in 1997. At the time, manufacturing technology did not
feasibly allow a large L2 cache to be integrated into the processor core. Intel instead placed
the L2 die(s) separately in the package which still allowed it to run at the same clock speed as
the CPU core. Additionally, unlike most motherboard-based cache schemes that shared the
main system bus with the CPU, the Pentium Pro's cache had its own back-side bus (called
dual independent bus by Intel). Because of this, the CPU could read main memory and cache
concurrently, greatly reducing a traditional bottleneck. The cache was also "non-blocking",
meaning that the processor could issue more than one cache request at a time (up to 4),
reducing cache-miss penalties. (This is an example of MLP, Memory Level Parallelism.)
These properties combined to produce an L2 cache that was immensely faster than the
motherboard-based caches of older processors. This cache alone gave the CPU an advantage
in input/output performance over older x86 CPUs. In multiprocessor configurations, Pentium
Pro's integrated cache skyrocketed performance in comparison to architectures which had
each CPU sharing a central cache.
However, this far faster L2 cache did come with some complications. The Pentium Pro's "on-
package cache" arrangement was unique. The processor and the cache were on separate dies
in the same package and connected closely by a full-speed bus. The two or three dies had to
be bonded together early in the production process, before testing was possible. This meant
that a single, tiny flaw in either die made it necessary to discard the entire assembly, which
was one of the reasons for the Pentium Pro's relatively low production yield and high cost.
All versions of the chip were expensive, those with 1024 KB being particularly so, since it
required two 512 KB cache dies as well as the processor die.
AVAILABLE MODELS-
Pentium Pro clock speeds were 150, 166, 180 or 200 MHz with a 60 or 66 MHz external bus
clock. Some users chose to overclock their Pentium Pro chips, with the 200 MHz version
often being run at 233 MHz, and the 150 MHz version often being run at 166 MHz. The chip
was popular in symmetric multiprocessing configurations, with dual and quad SMP server
and workstation setups being commonplace.
EVOLUTION IN FABRICATION-
As time progressed, the process used to fabricate the Pentium Pro processor die and its
separate cache memory die changed, leading to a combination of processes used in the same
package:
• The 133 MHz Pentium Pro prototype processor die was fabricated in a 0.6 µm
BiCMOS process.
• The 150 MHz Pentium Pro processor die was fabricated in a 0.50 µm BiCMOS
process.
• The 166, 180, and 200 MHz Pentium Pro processor die was fabricated in a 0.35 µm
BiCMOS process.
• The 256 KB L2 cache die was fabricated in a 0.50 µm BiCMOS process.
• The 512 and 1024 KB L2 cache die was fabricated in a 0.35 µm BiCMOS process.
8
PACKAGING-
The Pentium Pro is packaged in a ceramic multi-chip module (MCM). The MCM contains
two underside cavities in which the microprocessor die and its companion cache die reside.
The dies are bonded to a heat slug, whose exposed top helps enables the heat from the dies to
be transferred more directly to cooling apparatus such as a heat sink. The dies are connected
to the package using conventional wire bonding. The cavities are capped with a ceramic
plate. The Pentium Pro with 1 MB of cache uses a plastic MCM. Instead of two cavities,
there is only one, in which the three dies reside, bonded to the package instead of a heat slug.
The cavities are filled in with epoxy.
The MCM has 387 pins, of which approximately half are arranged in a pin grid array (PGA)
and half in an interstitial pin grid array (IPGA). The packaging was designed for Socket 8.
Upgrade paths
In 1998, the 300/333 MHz Pentium II Overdrive processor for Socket 8 was released.
Featuring 512 KB of full-speed cache, it was produced by Intel as a drop-in upgrade option
for owners of Pentium Pro systems. However, it only supported two-way glueless
multiprocessing, not four-way or higher, which did not make it a usable upgrade for quad-
processor systems.
As Slot 1 motherboards became prevalent, several manufacturers released socket adapters,

such as the Tyan M2020, Asus C-P6S1, Tekram P6SL1, and the Abit KP6. The sockets
allowed Pentium Pro processors to be used with Slot 1 motherboards. The Intel 440FX
chipset explicitly supported both Pentium Pro and Pentium II processors, but the Intel 440BX
and later Slot 1 chipsets did not explicitly support the Pentium Pro, so the Socket 8 sockets
did not see wide use. Sockets, in the form of Socket 370 to Slot 1 adapters, saw renewed
popularity when Intel introduced Socket 370 Celeron and Pentium III processors.
9
SUMMARY
Intel's successor to the Pentium is called the Pentium Pro. The Pentium Pro was the first chip
in the P6 or sixth-generation processor family. It was introduced in November 1995 and
became widely available in 1996. The chip is a 387-pin unit that resides in Socket 8, so it is
not pin compatible with earlier Pentiums. The chip is unique among processors because it is
constructed in a multichip module (MCM) physical format, which Intel calls a dual cavity
PGA package. Inside the 387-pin chip carrier are two dies. One contains the actual Pentium
Pro processor, and the other contains a 256KB, 512KB, or 1MB L2 cache. The processor die
contains 5.5 million transistors, the 256KB cache die contains 15.5 million transistors, and
the 512KB cache die(s) have 31 million transistors each for a potential total of nearly 68
million transistors in a Pentium Pro with 1MB of internal cache! A Pentium Pro with 1MB
cache has two 512KB cache die and a standard P6 processor die.
The main processor die includes a 16KB split L1cache with an 8KB two-way set associative
cache for primary instructions and an 8KB four-way set associative cache for data.
Another sixth-generation processor feature found in the Pentium Pro is the DIB architecture,
which addresses the memory bandwidth limitations of previous-generation processor
architectures. Two buses make up the DIB architecture: the L2 cache bus (contained entirely
within the processor package) and the processor-to-main memory system bus. The speed of
the dedicated L2 cache bus on the Pentium Pro is equal to the full-core speed of the
processor. This was accomplished by embedding the cache chips directly into the Pentium
Pro package. The DIB processor bus architecture addresses processor-to-memory bus
bandwidth limitations. It offers up to three times the performance bandwidth of the single-
bus, "Socket 7" generation processors, such as the Pentium.
Pentium Pro Family Processor

Specifications
Introduced November 1995
Maximum rated speeds 150MHz, 166MHz, 180MHz, 200MHz
CPU clock 2x, 2.5x, 3x, 3.5x, 4x
Internal registers 32-bit
External data bus 64-bit
Memory address bus 36-bit
Addressable memory 64GB
Virtual memory 64TB
Integral L1 cache size 8KB code, 8KB data (16KB total)
Integrated L2 cache bus 64-bit, full-core speed
Socket/Slot Socket 8
Physical package 387-pin dual cavity PGA
Package dimensions 2.46 (6.25cm)x2.66 (6.76cm)
Math coprocessor Built-in FPU
Power management SMM
10
Pentium Pro Family Processor
Specifications
Introduced November 1995
Operating voltage 3.1V or 3.3V
11

Superscalar Processor Pentium Pro

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Superscalar Processor Pentium Pro

Uploaded by

Copyright:

Available Formats

I NTRODUCTION

• Traditionally, software has been written for Serial computation:

SCALAR VS VECTOR PROCESSING-

A RISC (reduced instruction set computer) is a microprocessor that is designed to perform a

CISC-( COMPLEX INSTRUCTION SET COMPUTING)

PENTIUM PRO PROCESSORS-

200 MHz Pentium Pro with a 512 KB L2 cache in PGA package

200 MHz Pentium Pro with a 1 MB L2 cache in PPGA package

Uncapped Pentium Pro 256 KB

Compared to RISC microprocessors, the Pentium Pro, when introduced, slightly

As Slot 1 motherboards became prevalent, several manufacturers released socket adapters,

Pentium Pro Family Processor

Maximum rated speeds 150MHz, 166MHz, 180MHz, 200MHz

CPU clock 2x, 2.5x, 3x, 3.5x, 4x

Internal registers 32-bit

External data bus 64-bit

Memory address bus 36-bit

Addressable memory 64GB

Virtual memory 64TB

Integral L1 cache size 8KB code, 8KB data (16KB total)

Integrated L2 cache bus 64-bit, full-core speed

Physical package 387-pin dual cavity PGA

Package dimensions 2.46 (6.25cm)x2.66 (6.76cm)

Math coprocessor Built-in FPU

Power management SMM

Operating voltage 3.1V or 3.3V

You might also like