Department of Computer Science & Engineering

We feel a great honour in presenting this paper in techno-expert conducted at College of engineering Bandera. We especially thank the IEEE for organizing such a national level symposia. This paper presentation competition has helped us in gaining knowledge and has provided us with a deep insight in the computer science field. This paper has made us to take interest in all recent developments in the computer science field.

Title-Advance research in processor

Topic:Core 2duo Processor Technology
Name of authors
Sampada V. Bawane
Phone: 9423100360 Email Postal ID:Jijaoo Girls Hostel, Government college of Engineering, Engineering , Amravati. Postal ID: Jijaoo Girls Hostel Government college of Amravati.

Dipali R.Chawre
Phone: 9766575230

• Index 1. Abstract………………………………………………… ……. 2. Introduction……………………………………………… ….. 3. Core Details……………………………………………… ….. 4. Development…………………………………………… …… 5. Advantages……………………………………………… ……. 6. Disadvantages…………………………………………… …… 7. Multi-Chip Module……………………………………… …… 8. Features………………………………………………… …….. 9. The 64-bit Advantage…………………………………… ……. 10.Conclusion……………………………………………… …… 11.References……………………………………………… …


The Core 2 brand refers to a range of Intel's consumer 64-bit dual-core and 2x2 MCM quadcore CPUs with the x86-64 instruction set, based on the Intel Core microarchitecture, derived from the 32-bit dual-core. The 2x2 MCM dual-die quad-core CPU had two separate dual-core dies (CPUs)—next to each other—in one quad-core MCM package. The Core 2 relegated the Pentium brand to a mid-end market, and reunified laptop and desktop CPU lines The Core microarchitecture returned to lower clock speeds and improved processors' usage of both available clock cycles and power compared with preceding NetBurst of the Pentium 4/Dbranded CPUs. Core microarchitecture provides more efficient decoding stages, execution units, caches, and buses, reducing the power consumption of Core 2-branded CPUs, while increasing their processing capacity. Intel's CPUs have varied very wildly in power consumption according to clock speed, architecture and semiconductor process, shown in the CPU power dissipation tables. The Core 2 brand was introduced on July 27, 2006 comprising the Solo (single-core), Duo (dual-core), Quad (quad-core), and Extreme (dual- or quad-core CPUs for enthusiasts) branches, during 2007 Intel Core 2 processors with vPro technology (designed for businesses) include the dual-core and quad-core branches.


Diagram of a generic dual core processor, with CPU-local Level 1 caches, and a shared, on-die Level 2 cache.
A multi-core CPU (or chip-level multiprocessor, CMP) combines two or more independent cores into a single package composed of a single integrated circuit (IC), called a die, or more dies packaged together. A dual-core processor contains two cores, and a quad-core processor contains four cores. A multi-core microprocessor implements multiprocessing in a single physical package. A processor with all cores on a single die is called a monolithic processor. Cores in a multicore device may share a single coherent cache at the highest on-device cache level or may have separate caches The processors also share the same interconnect to the rest of the system. Each "core" independently implements optimizations such as superscalar execution, pipelining, and multithreading. A system with n cores is effective when it is presented with n or more threads concurrently. The most commercially significant multi-core

processors are those used in personal computers and game consoles In this context, "multi" typically means a relatively small number of cores. However, the technology is widely used in other technology areas, especially those of embedded processors, such as network processors and digital signal processors The amount of performance gained by the use of a multicore processor depends on the problem being solved and the algorithms used, as well as their implementation in software. For so-called "embarrassingly parallel" problems, a dual-core processor with two cores at 2GHz may perform very nearly as fast as a single core of 4GHz. Other problems though may not yield so much speedup. This all assumes however that the software has been designed to take advantage of available parallelism. If it hasn't, there will not be any speedup at all. However, the processor will multitask better since it can run two programs


Core is a pipelined architecture, where instructions move through a number of internal stages between entering the processor As an instruction exits a stage another can enter, minimising the idle time for each internal component. Core has around fourteen stages in its pipeline: as with most modern architectures, there are a number of complications, such as early completion and out of order execution, which make it hard to define exactly how many stages there are.

Intel's Core micro-architecture.

The front end of the machine fetches instructions and does preliminary analysis and reconstruction work on them. Core is a four-wide machine, with portions of five or six wide, meaning it can execute at least four instructions at once. That's wider than any previous x86 architecture. Internally, Core has its own microcode, and the first stage in dealing with x86 instructions is translating them to micro-ops in that microcode while working out which instructions can be safely combined into single operations -- 'macrofusion'. As with all chip designers, Intel spends a lot of time analysing software, looking for common combinations of instructions -- for example, a mathematical comparison followed by a switch to a different section of code depending on the result of that comparison. By fusing those two x86 operations into a single micro-op, the chip can complete them much faster. Core also does 'microfusion', where it does something similar but for those occasions when a single x86 instruction translates into multiple micro-ops. Where possible, the processor binds two of those micro-ops together and treats them as one; again, this can reduce the number of processing steps by around ten percent in some cases. Once we've got streams of micro-ops rattling through the pipelines, considerable performance gains can be achieved by spotting those instructions that'll take some time to complete and starting them as early as possible.

Typically, these involve reads or writes to memory: if you know that ten steps down the pipeline you'll need to load some information in, it's best to send the request out to the relatively slow memory system as early in the pipeline as possible. Unfortunately, instructions already in the pipeline may change the data at the memory location that you've preloaded, making your version out of date by the time it comes into play. Core copes with this by using prediction hardware that allows a read from memory to happen even if there's a write already in progress, provided the predictor thinks that the write is unlikely to cause a problem. Checking afterwards catches the times that this prediction is wrong, when there's a relatively slow process of recovering the right information; however, on balance the gains from guessing right outweigh the losses when it gets it wrong.

By the time instructions reach the end of the pipeline, they will have been operated on in any order that the chip deems most efficient. It has a single unified scheduler that decides what happens when, and that controls every execution unit on the chip.


While manufacturing technology continues to improve, reducing the size of single gates, physical limits of semiconductor-based microelectronics have become a major design concern. Some effects of these physical limitations can cause significant heat dissipation and data synchronization problems. The demand for more capable microprocessors causes CPU designers to use various methods of increasing performance. Some instruction-level parallelism (ILP) methods like superscalar pipelining are suitable for many applications, but are inefficient for others that tend to contain difficult-to-predict code. Many applications are better suited to thread level parallelism (TLP) methods, and multiple independent CPUs is one common method used to increase a system's overall TLP. A combination of increased available space due to refined manufacturing processes and the demand for increased TLP is the logic behind the creation of multi-core CPUs.


1. The proximity of multiple CPU cores on the same die allows the cache coherency circuitry to operate at a much higher clock rate than is possible if the signals have to travel off-chip. Assuming that the die can fit into the package, physically, the multi-core CPU designs require much less Printed Circuit Board (PCB) space than multi-chip SMP designs. 2. Also, a dual-core processor uses slightly less power than two coupled single-core processors, principally because of the decreased power required to drive signals external to the chip and

because the smaller silicon process geometry allows the cores to operate at lower voltages; such reduction reduces latency. 3. Furthermore, the cores share some circuitry, like the L2 cache and the interface to the front side bus (FSB). In terms of competing technologies for the available silicon die area, multi-core design can make use of proven CPU core library designs and produce a

product with lower risk of design error than devising a new wider core design. Also, adding more cache suffers from diminishing returns.


1. In addition to operating system (OS) support, adjustments to existing software are required to maximize utilization of the computing resources provided by multi-core processors. Also, the ability of multi-core processors to increase application performance depends on the use of multiple threads within applications 2. Integration of a multi-core chip drives production yields down and they are more difficult to manage thermally than lower-density single-chip designs. Intel has partially countered this first problem by creating its quad-core designs by combining two dualcore on a single die with a unified cache, hence anytwo working dual-core dies can be used, as opposed to producing four cores on a single die and requiring all four to work to produce a quad-core. 3. From an architectural point of view, ultimately, single CPU designs may make better use of the silicon surface area than multiprocessing cores, so a development commitment to this architecture may carry the risk of obsolescence. 4. Finally, raw processing power is not the only constraint on system performance. Two processing cores sharing the same system bus and memory bandwidth limits the realworld performance advantage. 5. If a single core is close to being memory bandwidth limited, going to dual-core might only give 30% to 70% improvement. If memory bandwidth is not a problem, 90% improvement can be expected. It would be possible for an application that used two CPUs to end up running faster on • one dual-core

Multi-Chip Module (MCM):-

It is a specialized electronic package where multiple integrated circuits (ICs), semiconductor dies or other modules are packaged in such a way as to facilitate their use as a single IC. The MCM itself will often be referred to as a "chip" in designs, thus illustrating its integrated nature.

Multi-Chip Modules come in a variety of forms depending on the complexity and development philosophies of their designers. These can range from using pre-packaged ICs on a small

printed circuit board (PCB) meant to mimic the package footprint of an existing chip package to fully custom chip packages integrating many chip dies on a High Density Interconnection (HDI) substrate. Multi-Chip Module packaging is an important facet of modern electronic miniaturization and micro-electronic systems. MCMs are classified according to the technology used to create the HDI (High Density Interconnection) substrate.

MCM-L - laminated MCM. The substrate is a multi-layer laminated PCB (Printed circuit board). MCM-D - deposited MCM. The modules are deposited on the base substrate using thin film technology. MCM-C - ceramic substrate MCMs, such as LTCC.

• •

POWER5 MCM with four processors

• Features Dual-core processing Intel® Dynamic Execution Wide


Benefits Two independent processor cores in one physical package run at the same frequency, and share up to 6 MB of L2 cache as well as up to a 1333 MHz Front Side Bus, for truly parallel computing. Improves execution speed and efficiency, delivering more instructions per clock cycle. Each core can complete up to four full instructions simultaneously.


Smart Optimizes the use of the data bandwidth from the memory subsystem to accelerate

Memory Access out-of-order execution.

A newly designed prediction mechanism reduces the time in-flight instructions have to wait for data. New pre-fetch algorithms move data from system memory into fast L2 cache in advance of execution. These functions keep the pipeline full, improving instruction throughput and performance. Intel® Advanced Smart Cache The shared L2 cache is dynamically allocated to each processor core based on workload. This efficient, dual-core optimized implementation increases the probability that each core can access data from fast L2 cache,

Intel®64 architecture

Enables the processor to access larger amounts of memory. With appropriate 64-bit supporting hardware and software, platforms based on an Intel processor supporting Intel 64 architecture can allow the use of extended virtual and physical memory. Provides enhanced virus protection when deployed with a supported operating system.

Execute Disable The Execute Disable Bit allows memory to be marked as executable or nonBit4 executable, allowing the processor to raise an error to the operating system if malicious code attempts to run in non-executable memory, thereby preventing the code from infecting the system.

The 64-bit advantage:-

The Core 2 Duo, like the Pentium D before it, is based on Intel's Extended Memory 64 Technology (EM64T), also called AMD64 by AMD and known generically as x86-64. Basically it is the old x86 architecture (called "general purpose instructions" and commonly referred to as the IA32 instruction set architecture (ISA))
• • • •

Increased number of general purpose registers 64-bit addressing 128-bit (SSE, SSE2, SSE3) media instructions Improved physical and virtual memory management

The EM64T ISA includes twice as many general purpose registers as the old x86 design, and all of them are twice as wide due to 64-bit addressing The instruction pointers also increase

from 32 to 64.Having more and wider general purpose registers means that memory can be used much more efficiently and memory traffic can be minimized, which in turn allows compilers to compile programs to work much faster on your machine. 64-bit addressing means that the physical memory limitation rises to 1TB (that's 1000GB) from the 32-bit limit of 4GB. The processor can also work with longer instructions. To really notice this advantage, you have to stress the system to a degree that most desktop users don't with current software, but as desktop applications demand more from processing hardware, this advantage will become much more important. 128-bit media instructions refer specifically to Intel's SSE, SSE2, and SSE3 (Streaming SIMD -- Single Instruction Multiple-Data -- Extensions) technologies. These instructions are very useful for working with large blocks of data, which benefits anyone who deals with a lot of scientific data or high-performance media or anything that uses floating-point math. EM64T deals with both physical and virtual memory in a much more sensible manner than x86, treating the entire virtual memory space as one unsegmented block and eliminating a lot of translation layers from the process of addressing physical memory. Previously x86 would segment virtual memory into small blocks for use with different programs and functions, but this ended up being inefficient and rarely used by software. EM64T eliminates that inefficiency by letting the software choose how it will handle virtual memory •

Conclusion :-

At last we conclude that the Combining equivalent CPUs on a single die significantly improves the performance of cache snoop operations. Put simply, this means that signals between different CPUs travel shorter distances, and therefore those signals degrade less. These higher quality signals allow more data to be sent in a given time period since individual signals can be shorter and do not need to be repeated as often. it is "a frequency limited processor with additional support for ratio overrides higher than the maximum Intel-tested bus-to-core ratio." • THANK YOU