Professional Documents
Culture Documents
26
direct hardware about what to do. The Microsoft office - a kind of program that one
can use to write letters calculate audit reports, preparing presentation using the
computer is one type of software. The operating system is also software which helps
user in operating the computer and managing resources. Microprocessor is the main
and vital part of Computer. It is made up of many transistors fabricated on a single
chip. The size of microprocessor vary according to the length of the data and as per
the data format it may of of 4-bit , 8, 16,32 or 64-bit microprocessor. As the human
need is grown, as the technology has been grown the size of microprocessor in terms
of the data length (Maximum data bits processed at a time) is also grown.
2.1.1 The First Arrival of Microprocessor on Globe
The first fully functional processor from 4004 in 1971 to the Intel core i3 processor
journey is full of ups and downs but though it is literally succeeded in keeping
consistent growth both the speed as well as performance wise. In parallel to Intel the
other industries like Motorola, DEC (Digital Equipment Corporation) and TI (Texas
Instruments) had also launched the microprocessor in sequence of time. The various
vendors have designed microcomputers based on these processors and they were like
IBM, MIPS, Apple, and Sun etc.
Microcomputers arrived at the service of People :
Before 1980 the manual semiconductor chip design was existing. By invention of
VLSI design which was introduced in 1980 by Carver Mead and Lynn Convay. It
was a revolutionary point and later on they explained forming complex circuitry on a
single chip like pipeline and prefetch buffers etc. Analysis tools such as switch level
simulator and static timing analyzers have provided ease to designers for designing
27
the modules. The decade which started with 3 micron technology has reached upto
1.25 micron by the time 1985 arrived. The Intel 386DX was launched in 1985
having 1 micron gate length. The entire CPU was on a single die however FPU and
MMU still was on different chips. As time p assed CMOS became dominant on
NMOS due to its low power consumption. In mid-1980 GaAS was experimented
over silicon. The Desktop technology than started from hobby to business
applications and the first operating system supporting this was MS-DOS. Some basic
Desktop featured BASIC as the primary programming Language. Later on the wide
use of Unix and C-Programming accelerated the need of advance processors. Apple
introduced Apple-III in 1980 using 6502 having MODEM, Hard Disk and floppy
drive. In 1981 IBM launched a desktop using Intel 8088 having 64KB RAM and
40KB ROM and 4.77MHz clock with PC DOS 1.0 OS. In June 1982 IBM PC clone
was released. The availability of BIOS was the theme behind clone and once BIOS
is made available anyone can assemble PC was the concept. IBM continued to
develop with XT in 1983 having 10MB hard drive, 128KB RAM. IBM introduced
AT in 1984 with 6MHZ 80286 processor, 256KB RAM, 1.25MB floppy drive and
PC DOS 3.0 OS. The first Desktop based on 68000 was Apple Lisa in 1983 having
5MHz clock, 1MB RAM, 2MB ROM and 5MB hard drive. Lisa was the first
introducing GUI. Later on Apple introduce Macintosh based on 8MHz 68000 CPU.
The era began for IBM compatible and IBM clone PCs. The advances in application
were possible by having WordPerfect in 1982, Lotus1-2-3 in 1984 and first Desktop
publishing by Aldus Page Maker.
28
internal execution. Zilog variant Z8000 chosen to leave the compatibility with Z80
for taking better use of 16-bit registers and bus. The concept of RISC was already
evident in system like IBM 801 non microprocessor architecture. RISC used
Load/Store architecture and no memory to memory direct transfer with fixed length
29
instructions. In CISC there was a multi cycle execution but in contrast to that the
RISC processor employed single cycle execution. VAX IBM 370 was a CISC based
having small frequently used many instructions. The Research team at Berkeley and
Stanford have analysed and based on result designed RISC-I and RISC-II based on
large register file. RISC-I was having two stage pipeline and RISC-II with three
stage pipeline. This idea was than utilized in Suns SPARC architecture. The same
team started working on making compiler for performance enhancement and
considering system and compiler as a one system. They have started working on
Microprocessor without interlocking pipe stages(MIPS) for optimizing compiler
technology further.MIPS Architecture requires a kind of compiler capable of
managing all interlocking and data dependencies between instructions and control
dependencies of branches. They also introduced VLIW kind of structure where two
instructions incorporated in one 32-bit long instruction.
2.1.4 RISC on way proving superiority over CISC
Berkeley RISC and Stanford MIPS shown path for RISC based processor
development. The MIPS R2000 based on Stanford MIPS was launched in 1986. The
RISC processor was basically focused on performance and hence there was a big
competition between lower price CISC systems and higher performing UNIX based
RISC systems. The IBM Compatible PCs were lower in price as compared to Apple
Branded Macintosh. The upper advance processors as 80386 and 80486 were grown
in market due to the low cost and open standard.
Additional Architectural Features evolved in second Generation Advance processors
.The Pipeline is more deepened and designed having 5 stages. The Inclusion of Data
and Code cache Memory, on Chip floating point Unit along with Memory
30
management Unit are the additional features making the processors more powerful
and having improved performance over the past designs. All these were possible due
to advancement in Microelectronics and fabrication technology. The no. of
transistors increased from 275000 in 80386DX to 120000 in Intel 80486DX. The
Intel 80386 and Motorola 68020 are considered as the second generation CISC
processors with limited capacity pipeline. More than 30 new instructions were added
with MMU and 4 modes of privilege. Motortola launched 68030 in 1987 with 3
stage pipeline having 20MHz clock. Still the FPU was there as a separate chip to be
interfaced as coprocessor. The first CPU with on chip FPU and Cache Memory came
with 1 million transistors in one micron CMOS technology operating at clock speed
of 25MHz. In 1991 Motorola again came with 68040 with 1.2 million transistor, two
4K cache and on chip FPU.
2.1.5 RISC from various vendors
RISC processors basically inspired from Stanford MIPS and Berkeley RISC having
32-bit long instructions and single cycle execution. The Load and Store are the only
instructions to access Memory and Memory mapped I/O and hence it is also referred
as Load and Store architecture. Two separate source Registers from destination
register allowed reuse of registers unlike in CISC where one of the operand is
overwritten by the result. Big register file and two read and one write operation in
one cycle was the unique feature of RISC. The instruction decoding is made simple
and faster avoiding complex instructions. MIPS R2000 was the first commercial
available RISC microprocessor. Interlocks avoided and registers are ensured having
the latest value always and thus there is an insertion of one clock cycle delay by
compiler to ensure correct operation. The best feature that MIPS offered was loading
31
and storing misaligned data using only two instructions. Two special registers HI
and LO for holding quotient and remainder. The theme was to have processor with
efficient pipeline as there was no interlocks in MIPS. MIPS3000 was launched in
1989 having 144 pins and 54 sq-mm die clocked at 25 MHz. During which Intel
introduced 80486 having on chip FPU and 8K Cache on chip with 168 pins and 33
MHz clock with 165 sq-mm die size. But though Intel was cheaper than MIPS3000.
Parallel to that sun Microsystem had introduce SPARC named RISC based
architecture having unique feature of window register file which allows saving and
restoring registers while call of the program routine. It also had tagged addition
feature to support Artificial Intelligence languages.
2.1.6 RISC era & deeper pipeline Architecture
The middle generation microprocessor cam with deeper pipeline in RISC era. Intel,
AMD Motorola all have thought to introduce RISC and going on that line they gave
80960K, AMD 29000 and 88100 from Motorola for embedded feature support. But
the Motorola failed in competition with Intel and AMD. During this era RISC has
already been popular and looking to that New Advanced RISC Machine (ARM)
Architecture came in market. In general purpose computing the success of X86 was
extraordinary and hence other vendors like AMD also started building clones of
X86. The number of instructions per second and the clock frequency was the major
issue at that time. IBM, Motorola and Apple had joined hands to build RISC
processor and was named Power PC to cope up the threat of Intel but it this alliance
failed and during which Intel has just concentrated on market and launched designs
which is performance oriented keeping the theme aside whether it is RISC or CISC?
In the middle generation microprocessors it was necessary to highlight the
32
33
of most recent as well as frequent branching. The Power PC was poor in this area
having simple static branch prediction.
2.1.7 Super pipeline architecture and Intel dominancy
In year 1992 MIPS 4000 was launched having super pipelined 64-bit architecture.
Address and data bus doubled with high clock frequency of 100MHz. It was mainly
used in Graphics and game machines. Sun also came with super Sparc having dual
processor but remained non popular. The SUN came back into the picture by launch
of UltraSparc having 167MHZ clock, two integer Units, 5 Floating point Units , A
branch Unit and A load/Store unit . The Cache were 16K having direct mapped
cache for data and 2-way set associative for address. Visual Instruction set for pixel
processing was the unique feature introduced first time. It has supported Motion
Picture Expert Group (MPEG) standards in estimating the speed of motions
computations. In this race HP also joined and designed PA-RISC 7100 and 7200 32bit processor series having high speed external cache to be used using static RAMs.
The PA-RISC 8000 180MHz was first 64-bit processor from HP launched in 1996.
All new processor generally were dominating older one only the Alpha which
became poular for a decade or long over its newly arrived processors.
The Intel dominancy :
After invention of Mouse Microsoft designed interactive OS called Windows 3.o for
IBM pC. The large market was covered by the windows based IBM PC. 80386 and
80486 were cloned by manufacturer called AMD and CYRIX. To be distinguished
from all these clones and other trademark problem Intel had changed name of its
80586 processor as Pentium. Pentium was launched having Super scaled pipeline, 60
34
MHz clock , two integer pipes, dual instruction issue, deep pipelines, separate data
and code Cache and made up of BiCMOS technology to get more speed than
CMOS. The clock was than increased to 100MHz without major changes. The
Macintosh system based on Power PC at that time was costlier and that is why apple
slowly loose the market. In year 1995 Microsoft launched Windows- 95 OS having
device recognition and with facility for installation of CD-ROM drive , printers ,
modems etc. Windows NT was prior introduction to the workstation having database
management and spread sheets applications with multi-tasking. The Pentium was
further improved and launched as Pentium pro in 1996.Many new pixel recessing
instructions were added for animation and graphics processing improvement which
was a feature declared as MMX. Having more facilities and inspite of that cut in
price put Intel in the leading processor manufacturer and seller of that time. Still
AMD and CYRIX of followed but was a bit lagging to Intel all the time.
2.1.8 Speed increasing mechanism and thermal Problem
The method which was employed in year 1995 and ahead to increase speed was by
increasing the clock frequency and making devices faster by using improved
fabrication technologies. Reducing delay in flow of data between gates and faster
switching of transistors. But the great loop hall found of this approach was more
power consumption and in turn heat generation. The solution found at that time was
to drop the voltage to 2 to 3 volts which had reduced the power consumption by
considerable amount. The solution of heat was solved by proper heat sink and fan
cooling methods. Out of order issue, Speculative Execution, Branch prediction,
Register Renaming and prefetch buffers all have increased complexity. It was
challenge to have such kind of processor having no bugs. Floating point division was
35
a kind of bug which Intel was not at all happy with and Intel was in big trouble for
some time due to this bug.
The shrinking line width of the IC of the next generation IC was in need of better
and more sophisticated lithographic techniques to draw more and finer lines. It was
made possible at that time to reduce an intrinsic delay and hence increasing clock by
same proportion. Later on Resistance Capacitance delay became prominent at high
clock. Attempt was made to lower resistance by using aluminium instead of copper
and using lower dielectric constant insulator for reducing capacitance. Now delay
remained significance is only due to interconnects. Again after saturation designer
have looked towards VLIW approach making processor more complex and
exploring the rigorous parallelism. Multiple operations in one very long instruction
word brought extensive parallelism into action. The problem of requirement of fast
buses for feeding multiple processors was the challenge.
2.1.9 Chip Multi Processing (CMP) and Multicore Era
The traditional way of increasing speed of processor was to increase the clock
frequency and by higher package densities including more hardware to cope up with
the more work in less time. The increased power dissipation due to increasing clock
put the physical limitation and hence it became difficult to get speed enhancement
simply by this traditional approach. CMP came to existence when it was decided to
add more transistors and fabricating more number of processor threads on same chip
to carry out parallel processing and exchange of hardware, later on this architecture
was known as Multi Core processor Technology. According to flynns classification
it is real MOMD architecture. On the lowest level, the execution unit itself can have
a super-scalar architecture. A hardware-controlled parallel usage of execution unit
36
37
more benefited by similar parallel cores but in each case it will not be the experience
at all. Many applications are lagging in providing inherent TLP making
parallelization impossible. In such cases the advantage of Multicore processor will
only be possible if heterogeneous cores are used according to requirement. The
parallel applications divided into several similar operations and can be operated
using homogeneous cores.
38
chip. The size reduction is also going on in parallel which in turn increasing the
device density. It had lead to many problems which had been tackled and still we
are struggling.
The reduction manufacturing process size: The Fig.2.1 shows the trend how
manufacturing process size reduction is achieved from 182 nm to 22 nm and the
research is going on for getting 14nm soon till the end of year 2014.
39
40
41
include
register
renaming,
trace
caches,
reorder
buffers,
dynamic/software scheduling, and data value prediction. There have also been
advances in power- and temperature-aware architectures. There are two flavors of
power-sensitive architectures: low-power and power aware designs. Low power
architectures minimize power consumption while satisfying performance constraints,
e.g. embedded systems where low-power and real-time performance are vital.
Power-aware architectures maximize performance parameters while satisfying power
constraints. Temperature aware design uses simulation to determine where hot spots
42
lie on the chips and revises the architecture to decrease the number and effect of hot
spots.
2.2.3 Why switching Multicore technology?
It is well-recognized that computer processors have increased in speed and decreased
in cost at a tremendous rate for a very long time. This observation was first made
popular by Gordon Moore in 1965, and is commonly referred to as Moores Law.
Specifically, Moores Law states that the advancement of electronic manufacturing
technology makes it possible to double the number of transistors per unit area about
every 12 to 18 months. It is this advancement that has fueled the phenomenal growth
in computer speed and accessibility over more than four decades. Smaller transistors
have made it possible to increase the number of transistors that can be applied to
processor functions and reduce the distance signals must travel, allowing processor
clock frequencies to soar. This simultaneously increases system performance and
reduces system cost. All of this is well-understood. But lately Moores Law has
begun to show signs of failing. It is not actually Moores Law that is showing
weakness, but the performance increases people expect and which occur as a side
effect of Moores Law. One often associates performance with high processor clock
frequencies. In the past, reducing the size of transistors has meant reducing the
distances between the transistors and decreasing transistor switching times.
Together, these two effects have contributed significantly to faster processor clock
frequencies. Another reason processor clocks could increase is the number of 2
transistors available to implement processor functions. Most processor functions, for
example, integer addition, can be implemented in multiple ways. One method uses
very few transistors, but the path from start to finish is very long. Another method
shortens the longest path, but it uses many more transistors. Clock frequencies are
43
limited by the time it takes a clock signal to cross the longest path within any stage.
Longer paths require slower clocks. Having more transistors to work with allows
more sophisticated implementations that can be clocked more rapidly. But there is a
down side. As processor frequencies climb, the amount of waste heat produced by
the processor climbs with it. The ability to cool the processor inexpensively within
the last few years has become a major factor limiting how fast a processor can go.
This is offset, somewhat, by reducing the transistor size because smaller transistors
can operate on lower voltages, which allows the chip to produce less heat.
Unfortunately, transistors are now so small that the quantum behavior of electrons
can affect their operation. According to quantum mechanics, very small particles
such as electrons are able to spontaneously tunnel, at random, over short distances.
The transistor base and emitter are now close enough together that a measurable
number of electrons can tunnel from one to the other, causing a small amount of
leakage current to pass between them, which causes a small short in the transistor.
As transistors decrease in size, the leakage current increases. If the operating
voltages are too low, the difference between a logic one and a logic zero becomes
too close to the voltage due to quantum tunneling, and the processor will not operate.
In the end, this complicated set of problems allows the number of transistors per unit
area to increase, but the operating frequency must go down in order to be able to
keep the processor cool. This issue of cooling the processor places processor
designers in a dilemma. The approach toward making higher performance has
changed. The market has high expectations that each new generation of processor
will be faster than the previous generation, if not why one should buy it? But
quantum mechanics and thermal constraints may actually make successive
generations slower. On the other hand, later generations will also have more
44
transistors to work with and they will require less power. Speeding up processor
frequency had run its course in the earlier part of this decade, computer architects
needed a new approach to improve performance. Adding an additional processing
core to the same chip would, in theory, result in twice the performance and dissipate
less heat; though in practice the actual speed of each core is slower than the fastest
single core processor. In September 2005 the IEE Review noted that power
consumption increases by 60% with every 400MHz rise in clock speed. So, what is
a designer to do? Manufacturing technology has now reached the point where there
are enough transistors to place two processor cores a dual core processor on a
single chip. The tradeoff that must now be made is that each processor core is slower
than a single-core processor, but there are two cores, and together they may be able
to provide greater throughput even though the individual cores are slower. Each
following generation will likely increase the number of cores and decrease the clock
frequency. The slower clock speed has significant implications for processor
performance, especially in the case of the AMD Opteron processor. The fastest dualcore Opteron processor will have higher throughput than the fastest single core
Opteron, at least for workloads that are processor core limited, but each task may be
completed more slowly. The application does not spend much time waiting for data
to come from memory or from disk, but finds most of its data in registers or cache.
Since each core has its own cache, adding the second core doubles the available
cache, making it easier for the working set to fit. For dual core to be effective, the
work load must also have parallelism that can use both cores. When an application is
not multi threaded, or it is limited by memory performance or by external devices
such as disk drives, dual core may not offer much benefit, or it may even deliver less
performance. Opteron processors use a memory controller that is integrated into the
45
same chip and is clocked at the same frequency as the processor. Since dual core
processors use a slower clock, memory latency will be slower for dual core Opteron
processors than for single core, because commands take longer to pass through the
memory controller. Applications that perform a lot of random access read and write
operations to memory, applications that are latency bound, may see lower
performance using dual core. On the other hand, memory bandwidth increases in
some cases. Two cores can provide more sequential requests to the memory
controller than can a single core, which allows the controller to intern eave
commands to memory more efficiently. Another factor that affects system
performance is the operating system. The memory architecture is more complex, and
an operating system not only has to be aware that the system is NUMA (that is, it has
Non- Uniform Memory Access), but it must also be prepared to deal with the more
complex memory arrangement. It must be dual-core-aware. The performance
implications of operating systems that are dual core aware will not be explored here,
but we state without further justification that operating systems without such
awareness show considerable variability when used with dual core processors.
Operating systems that are dual-core-aware show better performance, though there is
still room for improvement.
2.2.4 The study of Multicore Fundamentals
The following isnt specific to any one multicore design, but rather is a basic
overview of multi-core architecture. Although manufacturer designs differ from one
another, multicore architectures need to look at certain aspects. The basic
configuration of a microprocessor is seen in Fig.2.5 with multicore concept and
Fig.2.6 with a distributed memory model.
46
47
Processor
Processor
Processor
Processor
Processor
Processor
Cache
or
Cache
or
Cache
or
or
or
Cache
Cache
Cache
Memory
Memory
Memory
or
or
or
Single Bus
Memory
I/O
Network
48
discussed below for Intels Core 2 Duo, Advanced Micro Devices Athlon 64 X2,
Sony-Toshiba- IBMs CELL Processor, and finally Tileras TILE64.
49
request to main memory. In contrast, the Athlon follows a distributed memory model
with discrete L2 caches. These L2 caches share a system request interface,
eliminating the need for a bus. The system request interface also connects the cores
with an on-chip memory controller and an interconnect called Hyper Transport.
Hyper Transport effectively reduces the number of buses required in a system,
reducing bottlenecks and increasing bandwidth. The Core 2 Duo instead uses a bus
interface. The Core 2 Duo also has explicit thermal and power control units on-chip.
There is no definitive performance advantage of a bus v/s an interconnect, and the
Core 2 Duo and Athlon 64 X2 achieve similar performance measures, each using a
different communication protocol.
2.3.2 The CELL processor
A Sony-Toshiba-IBM partnership (STI) built the CELL processor for use in Sonys
PlayStation 3, therefore, CELL is highly customized for gaming/graphics rendering
which means superior processing power for gaming applications. The CELL is a
heterogeneous multicore processor consisting of nine cores, one Power Processing
Element (PPE) and eight Synergistic Processing Elements (SPEs), as can be seen in
Fig.2.8 with CELLs real-time broadband architecture, 128 concurrent Transactions
to memory per processor are possible. The PPE is an extension of the 64-bit
PowerPC architecture and manages the operating system and control functions. Each
SPE has simplified instruction sets which use 128-bit SIMD instructions and have
256KB of local storage. Direct Memory Access is used to transfer data between local
storage and main memory which allows for the high number of concurrent memory
transactions. The PPE and SPEs are connected via the Element Interconnect Bus
providing internal communication. Other interesting features of the CELL are the
Power Management Unit and Thermal protection. The ability to measure and
50
account for power and temperature changes has a great advantage in that the
processor should never overheat or draw too much power.
51
power consumption. The TILE64 uses a 3- way VLIW (very long instruction word)
pipeline to deliver 12 times the instructions as a single-issue, single-core processor.
When VLIW is combined with the MIMD (multiple instructions, multiple data)
processors, multiple operating systems can be run simultaneously and advanced
multimedia applications such as video conferencing and video-on-demand can be
run efficiently.
52
53
environment. Several factors can potentially affect the internal scalability of multiple
cores, such as the system compiler as well as architectural considerations including
memory, I/O, front side bus (FSB), chip set, and so on. For instance, enterprises can
buy a dual-processor server today to run Microsoft Exchange and provide e-mail,
calendaring, and messaging functions. Dual-processor servers are designed to deliver
excellent price/performance for messaging applications
54
overhead from multiprocessing activities. As a result, the two processors do not scale
linearly that is, a dual processor system does not achieve a 200 percent performance
increase over a single-processor system, but instead provides approximately 180
percent of the performance that a single-processor system provides. In this article,
the single-core scalability factor is referred to as external, or socket-to-socket,
scalability. When comparing two single-core processors in two individual sockets,
the dual 3.6 GHz processors would result in an effective performance level of
approximately 6.48 GHz. For multicore processors, administrators must take into
account not only socket-to-socket scalability but also internal, or core to core,
scalability, the scalability between multiple cores that reside within the same
processor module. In this example, core to core scalability is estimated at 70 percent,
meaning that the second core delivers 70 percent of its processing power. Thus, in
the example system using 2.8 GHz dual-core processors, each dual core processor
would behave more like a 4.76 GHz processor when the performance of the two
cores 2.8 GHz plus 1.96 GHz is combined. For demonstration purposes, this
example assumes that, in a server that combines two such dual-core processors
within the same system architecture, the socket to socket scalability of the two dual
core processors would be similar to that in a server containing two single core
processors80 percent scalability. This would lead to an effective performance
level of 8.57 GHz. To continue the example comparison by postulating that socket
to socket scalability would be the same for these two architectures, a multicore
architecture could enable greater performance than single-core processor
architecture, even if the processor cores in the multicore architecture are running at a
lower clock speed than the processor cores in the single core architecture. In this
way, a multicore architecture has the potential to deliver higher performance than
55
56
limit the amount of power. By powering off unused cores and using clock gating the
amount of leakage in the chip is reduced. To lessen the heat generated by multiple
cores on a single chip, the chip is architected so that the number of hot spots doesnt
grow too large and the heat is spread out across the chip. The majority of the heat in
the CELL processor is dissipated in the Power Processing Element and the rest is
spread across the Synergistic Processing Elements. The CELL processor follows a
common trend to build temperature monitoring into the system, with its one linear
sensor and ten internal digital sensors.
2.5.2 Cache Coherence
Coherence is a concern in a multicore environment because of distributed L1 and L2
cache. Since each core has its own cache, the copy of the data in that cache may not
always be the most up-to-date version. For example, imagine a dual-core processor
where each core brought a block of memory into its private cache. One core writes a
value to a specific location; when the second core attempts to read that value from its
cache it wont have the updated copy unless its cache entry is invalidated and a
cache miss occurs. This cache miss forces the second cores cache entry to be
updated. If this coherence policy wasnt in place garbage data would be read and
invalid results would be produced, possibly crashing the program or the entire
computer. In general there are two schemes for cache coherence, a snooping protocol
and a directory-based protocol. The snooping protocol only works with a bus based
system, and uses a number of states to determine whether or not it needs to update
cache entries and if it has control over writing to the block. The directory-based
protocol can be used on an arbitrary network and is, there-fore, scalable to many
processors or cores, in contrast to snooping which isnt scalable. In this scheme a
directory is used that holds information about which memory locations are being
57
shared in multiple caches and which are used exclusively by one cores cache. The
directory knows when a block needs to be updated or invalidated. Intels Core 2 Duo
tries to speed up cache coherence by being able to query the second cores L1 cache
and the shared L2 cache simultaneously. Having a shared L2 cache also has an added
benefit since a coherence protocol doesnt need to be set for this level. AMDs
Athlon 64 X2, however, has to monitor cache coherence in both L1 and L2 caches.
This is sped up using the Hyper Transport connection, but still has more overhead
than Intels model.
2.5.3 Multithreading
The last, and most important, issue is using multithreading or other parallel
processing techniques to get the most performance out of the multicore processor.
Also to get the full functionality we have to have program that support the feature of
TLP. Rebuilding applications to be multithreaded means a complete rework by
programmers in most cases. Programmers have to write applications with
subroutines able to be run in different cores, meaning that data dependencies will
have to be resolved or accounted for (e.g. latency in communication or using a
shared cache). Applications should be balanced. If one core is being used much more
than another, the programmer is not taking full advantage of the multi-core system.
Some companies have heard the call and designed new products with multicore
capabilities; Microsoft and Apples newest operating systems can run on up to 4
cores, for example.
2.5.4 Crucial design Issues
With numerous cores on a single chip there is an enormous need for increased
memory. 32-bit processors, such as the Pentium 4, can address up to 4GB of main
memory. With cores now using 64-bit addresses the amount of addressable memory
58
59
60
to produce since the same instruction set is used across all cores and each core
contains the same hardware. But are they the most efficient use of multicore
technology? Each core in a heterogeneous environment could have a specific
function and run its own specialized instruction set. Building on the CELL example,
a heterogeneous model could have a large centralized core built for generic
processing and running an OS, a core for graphics, a communications core, an
enhanced mathematics core, an audio core, a cryptographic core, and the list goes on.
[33] This model is more complex, but may have efficiency, power, and thermal
benefits that outweigh its complexity. With major manufacturers on both sides of
this issue, this debate will stretch on for years to come; it will be interesting to see
which side comes out on top.
2.5.5 Multicore Advantages
Although the most important advantage of having multi-core architecture is already
been discussed there are many more advantages of multicore processors stated here.
Power and cooling advantages of multicore processors: Although the preceding
example explains the scalability potential of multicore processors, scalability is only
part of the challenge for IT organizations. High server density in the data center can
create significant power consumption and cooling requirements. A multicore
architecture can help alleviate the environmental challenges created by high clock
speed, single core processors. Heat is a function of several factors, two of which are
processor density and clock speed. Other drivers include cache size and the size of
the core itself. In traditional architectures, heat generated by each new generation of
processors has increased at a greater rate than clock speed. In contrast, by using a
shared cache (rather than separate dedicated caches for each processor core) and low
clock speed processors, multicore processors may help administrators minimize heat
61
while maintaining high overall performance. This capability may help make future
multicore processors attractive for IT deployments in which density is a key factor,
such as high-performance computing (HPC) clusters, Web farms, and large clustered
applications. Environments in which blade servers are being deployed today could
be enhanced by potential power savings and potential heat reductions from multicore
processors. Currently, technologies such as demand-based switching (DBS) are
beginning to enter the mainstream, helping organizations reduce the utility power
and cooling costs of computing. DBS allows a processor to reduce power
consumption (by lowering frequency and voltage) during periods of low computing
demand. In addition to potential performance advances, multicore designs also hold
great promise for reducing the power and cooling costs of computing, given DBS
technology. DBS is available in single-core processors today, and its inclusion in
multicore processors may add capabilities for managing power consumption and,
ultimately, heat output. This potential utility cost savings could help accelerate the
movement from proprietary platforms to energy efficient industry standard
platforms.
Significance of sockets in a multicore architecture:
As they become available, multicore processors will require IT organizations to
consider system architectures for industry-standard servers from a different
perspective. For example, administrators currently segregate applications into single
processor, dual processor, and quad processor classes. However, multicore
processors will call for a new mind-set that considers processor cores as well as
sockets. Single threaded applications that perform best today in a single processor
environment will likely continue to be deployed on single-processor, single core
system architectures. For single threaded applications, which cannot make use of
62
63
64
65
would then have eight concurrent threads to distribute and manage workloads,
leading to potential performance increases in processor utilization and processing
efficiency.
Operating frequency
7.8 GHz
4GHz
7.8 Gb/s
4 Gb/s
Bandwidth
125 Gbytes
1 Tera Bytes
Power
429.78 W
107.39 watt
3840
9000 Estimated
2840
4500 Estimated
Processor
No.of
Speed
Cores
Dual Core
2.70GHz
Power
Consumption
65W
66
Core 2 duo
2.93GHz
65W
Core i3 540
3.06GHz
73W
Core i5 660
3.33GHz
87W
Core i7 950
3.06GHz
130W
3.33GHz
130W
2.90GHz
60W
3.00GHz
95W
3.00GHz
95W
3.00GHz
125W
2.7 Conclusion
Shift in focus toward multi-core technology: Before multicore processors the
performance increase from generation to generation was easy to see, an increase in
frequency. This model broke when the high frequencies caused processors to run at
speeds that caused increased power consumption and heat dissipation at unpractical
levels. Adding multiple cores within a processor gave the solution of running at
lower frequencies, but added interesting new problems. Multicore processors are
architectured to cope up with the increased power consumption, heat dissipation, and
cache coherence protocols. However, many issues remain unsolved. In order to use a
multicore processor at full capacity the applications run on the system must be
multithreaded. There are relatively few applications (and more importantly few
programmers with the knowhow) written with any level of parallelism. The memory
systems and interconnection net-works also need improvement. And finally, it is still
67
68