You are on page 1of 17

Report: Intel Nehalem architecture - BeHardware

>> Processors Written by Franck Delattre Published on September 17, 2008 URL: http://www.behardware.com/art/lire/733/ Page 1 Introduction

After the big splash made with Core 2 architecture easily dominating two generations of AMD processors and handing leading position back to Intel, the manufacturer is now getting ready to put the Core i7 and Core i7 Extreme Edition, the first representatives of Nehalem architecture, on the market.

In vino veritas ”Nehalem” is henceforth a name synonymous with Core i7 architecture and its different apparitions. Nehalem is a river in Oregon, and a project manager who was also a connoisseur of the region’s wine, chose the name. Wine making is decidedly a rich source of inspiration, as Asus recently showed, giving the label “Pinot Noir” to one of its motherboards... As is now our habit, we are going to try to understand Intel’s goals in the design of this new architecture and see what the new range of processors is going to bring in terms of how we use our PCs. For once, we won’t have to wait too long as the first Core i7s are expected to be available at the end of the year. In 2006, Intel left its mark on the world of PCs when it released a processor that had high performance, low energy consumption and was relatively well adapted for all platforms. From the moment it was put on the market, the Core 2 eclipsed AMD’s Athlon 64, which had in turn fairly easily imposed itself on the Pentium 4 and its Netburst architecture. AMD more or less missed its chance for a successful challenger with a weak Phenom and moreover Intel made matters worse with the 45nm Core 2 models that raised

Nehalem is above all a flexible and adaptable architecture. Also. Intel had to push several innovations back. . only preceded the first AMD native quad-cores by 12 months. More Lego than processor. initiated under the Pentium M name (Banias et Dothan) and up to the Core Duo (Yonah). A fundamentally mobile architecture. the Core 2 suffered from a certain technological out-datedness: an aging processor bus (in comparison with the K8 HyperTransport bus). Once Mobile. which firmly reinstated Intel as the leader was not by any means a totally new creation. this gives an idea of what is at stake with Nehalem. the goal being to cover that which is lacking in the Core 2. even according to Intel engineers. Indeed. the Core 2 didn’t represent th e same sort of technological leap as did the Pentium 4 when it was introduced in 2000 (even if this didn’t guarantee its success). the Core 2 never dominated the server platform. it is a direct descendant of Mobile processors. This goes to show that no processor can be best at everything! So. As good as it was in the absolute. servers and of course portable technology). Of course. Israel. Core 2 was able to adapt to the demands of other platforms. an external memory controller (integrated directly to the K8 since 2004!) as well as its dual-core architecture. However. Nehalem also introduces the “uncore” notion to Intel’s products. Unlike the Core 2. Nehalem uses a complex clock distribution. the nomenclature (the commercial naming that is) will be a real headache. note that this wasn’t without consequences. although going further than previous implementations. in contrast to the Opteron Barcelona. so as to get the Core 2 released as quickly as possible with the least amount of modification. the first native dual-core on the market. always Mobile Core 2 clearly has mobile origins. which was designed especially for the job. This “miracle” processor. Each core can thus run at its own frequency and this goes as well for the entire “uncore” part of the processor. This means that many of the different improvements of Nehalem that we will talk about below will not equip all models. which. performances weren’t disappointing but in terms of innovation. there will be so many versions of the architecture that.the bar another notch. Keep in mind that these models are the fruit of Intel’s research centres in Haifa. 64-bit performances fell slightly (some processor capabilities were not activated in this mode). its design does keep the main strengths and weaknesses of this mobile heritage: obviously design specifications for a mobile processor are different from those of a desktop or server processor. in particular in terms of server use as a quad core “assembly” had difficulty with the old bus processor technology or the single memory bus shared by several processors. from its introduction. Yet how is it possible to come up with an architecture that meets the demands of all platforms? The key is modularity. and although it gives excellent results on all types of platforms (desktop PCs. Page 2 Quad and SMT What’s new? Nehalem is a “monolithic” quad core architecture meaning that it doesn’t result from the fusion of two dual cores. which was based on a single clock distribution (the entire processor functions on a single clock cycle value). this term designating any part of the processor that isn’t directly part of the instruction processing engine. In fact.

Some new and some old The Nehalem’s cores benefit from SMT technology (Simultaneous Multi-Threading) which appeared with the Pentium 4 equipped with Hyperthreading (the non-commercial name of SMT on Netburst) and that we also find on the first generations of Atom processors. . The constant transition from one thread to another gives an illusion that they are being executed simultaneously but in actuality a lot of time is devoted to these transitions. SMT is technique which aims to facilitate handling several threads by the same execution core. the buffer is separated into two identical parts) or dynamic (threads access the resource in a competitive way depending on their specific need). In the absence of SMT. Each time the core must save the context in which the execution of a thread was carried out (state of the registers and stack) and load the context of a new thread. The concept of SMT is to offer the core the possibility to have not one but two contexts at the same time which will thus enable processing two threads but this time in a real simultaneous manner. a core successively processes the pieces of the different threads that it is in charge of at any given moment. The core’s resources (c aches and execution units) are shared between the two threads in a static manner (for example.

will SMT be advantageous for the Nehalem? Yes. So in the end. In this way. Nehalem inherits its execution engine from the Core 2 and it’s only legitimate to wonder about the gain added by SMT on execution cores that are already reputed to be efficient. SMT allows the Nehalem’s 4-wide engine (in other words. it interprets the presence of the two contexts as two distinct logic processors in the same way as two cores. And so as not to waste a thing. In a non-OOO execution engine that cannot re-order instructions (as is the case with the Atom). speed is directly related to the dependence of successive instructions on each other. Netburst has problems with its very long pipeline (20 and then 30 stages). it’s capable of simultaneously processing up to four instructions) to fully use its width. the concept of SMT resides in increasing the efficiency of an execution core and the possible gain is thus all the more significant if the starting efficiency of the architecture in question is low. For this reason the architecture benefits from SMT and the gain can attain 40% in certain applications with the Pentium 4 Prescott. The possibility to more efficiently fill the execution pipeline is thus greater and in the end the efficiency of the execution core increases. SMT enables practically double performance. The other defect of SMT resides in the competing access of threads to cache. The effect is even more advantageous on the Atom whose in-order execution engine is strongly penalized by dependency. As the operating system only handles threads. in particular to L1. In fact. According to Intel.Besides the time spent in transitions in context. because we will see in our study that many of the improvements added to the Nehalem’s core were made with consideration for the optimal . SMT’s performance gain comes from the best use of a core’s execution units. and is therefore difficult to fill in an optimal manner. All of this sounds quite advantageous but SMT technology isn’t exempt from defects. This may be considered a somewhat “horizontal” optimization compared to the “vertical” one obtained from the added length of the Netburst pipeline. The Nehalem’s L1 are fortunately sufficiently large to easily accommodate two threads and at any rate are better equipped than those of the Pentium 4 in this domain. the addition of SMT to a core is economical compared to the added benefits in terms of performance as soon as more than one thread is handled in the processor. In the first place. The Nehalem’s four cores thus appear in the Windows task manager in the form of 8 logic processors. the flux of instructions from each of the two threads are independent which notably is of benefit to the out-of-order execution engine (OOO) one of whose constraints in functioning is due to the interdependency of instructions.

Of course. These two TLB levels are capable of efficiently handling a much larger quantity of data than the Core 2. Maybe Intel didn't find it wise to invest in this standard in favor a more recent memory server technology? A lot of questions and we will follow these developments closely. Will the Nehalem’s modularity go as far as support for dif ferent types of memory controllers? This is planned but we know already that server versions of the processor will no longer support FB-DIMM. The integration of the memory controller also means a drastic reduction of access latency to memory. it’s probable that versions of the processor reserved for ultra -mobile use will do without the integrated controller in favor of a reduced thermal envelope. A reviewed TLB hierarchy TLB (Translation Lookaside Buffers) are buffers that store the translations of the virtual addresses used by programs and their physical address equivalents. two innovations immediately stand out: the memory controller is integrated in the processor and there is the presence of a third cache level. We heard a lot said about this recently due to the famous bug found in the Phenom. IMC = Integrated Memory Controller The Nehalem’s integrated memory controller is capable of supporting up to three DDR3-1333 memory channels thus offering a maximum bandwidth of 32 GB/s (compared to the 21 GB/s that the discreet memory controllers of the Core 2 could provide). Nehalem keeps two TLB levels: two first level buffers for code and data (192 entries in total) and a unified buffer offering no less than 512 entries. Otherwise. it’s especially server platforms that will benefit the most. The Core 2’s TLB structure gives high performances. While the interest of the integrated memory controller is real for desktop PCs. notably due to SMT. it was even planned that for a certain time that non-server versions of the Nehalem would not be equipped with SMT but Intel have come back on this decision and the Bloomfield (a high end version for PC desktops) will have it.functioning of SMT on the new architecture and this in order to maximize performance gains. Intel had to review its copy. This should enable feeding the eight threads that the Nehalem can process thanks to SMT. The advantage for a solution with a shared memory bus is enormous and Intel thus hopes to recuperate some of the PC server market on which the Opteron (K8 and K10) shines thanks to this capability of adapting the bandwidth. TLB An in-depth review of memory hierarchy The Nehalem’s memory pipeline was the object of numerous evolutions compared to the Core 2. the gain will necessarily be variable depending on the application. Whatever the case. What remains is that the interest of SMT will of course be less on desktop PCs. The integration of the memory controller to a processor also means less manoeuverability in the support of DRAM technologies. Page 3 Memory controller. Moreover. of a very small and fast micro-TLB which is solely devoted to reading. SMT remains optional and it will therefore always be possible to decide if its presence is desired. And once again. notably in the framework of office or gaming use (and even more so on the mobile platform). and the micro-TLB had to be abandoned on the Nehalem in favor of a classic TLB which is more capable of holding the addresses of two threads. it’s the server market that is first and foremost the target of this . SMT works miracles in server and database management environments or at least this is what resulted from its use on the Xeon with its Netburst architecture. due to the presence. Otherwise. On the other hand. in addition to a classic TLB of 288 entries. First off. notably in configurations with several sockets for which the available memory bandwidth then increases proportionally to the number of processors present in the system.

Indeed. However. The economical solution thus consists of reducing the number of requests that come to the shared cache. The implementation of quadruple cores on the Core 2 relies on the processor bus to maintain this coherence which isn’t optimal for performances. a cache can only respond to requests from the four cores that solicit it in an intensive manner and without any significant latency – unless the technical characteristics of the cache are improved but this implies complexity beyond that of a consumer processor. And with only two cores. we already know that this works . Here the interest is mostly intended for professional use. In addition. Page 4 Cache hierarchy The new cache hierarchy Maintaining the coherence of data manipulated by each of the cores in monolithic architecture is accomplished via a shared cache. created by VMWare) means flushing the TLB. every other request will not reach the shared cache and things happen as if the requests only make it from two of the four cores. this trick enables the Nehalem to accelerate the transitions between local and virtual machines.new characteristic. To do this Intel inserted a small cache of 256 KB between the L1 of each core and the shared cache. If each L2 offers a success rate of only 50% (which is pessimistic). things are much more complicated with four cores instead of two. These four caches of 256 KB do not take up too much space on the chip and their smallness in size is a guarantee of speed. While on the Core 2 the transition to a virtual host (for example. Source : Chip Architect It’s therefore almost natural to find a large cache shared between the four cores on the Nehalem. such a size does not translate into record success rates but this is not the goal. The Core 2 thus integrates a large cache L2 shared between two cores. the Nehalem’s TLB have the innovation of the presence of a virtual processor ID that enables defining an entry which is specific to the processing in a virtual machine. in particular the management of large data bases. On the other hand.

which means verifying all the caches of each core. The increase proves to be particularly interesting for SMT as two threads generate more cache misses than a single one. This slightly distorts latency measurements expressed in processor cycles therefore to 2. we are sure that this data is not in the private caches of each of the cores (otherwise it would be in L3 due to the inclusive relationship). Why this choice? Because an instructions cache is more sensitive to latency than a data cache. . This characteristic distinguishes it from AMD’s choice on the Phenom whose L3 has a pseudo-exclusive relationship with other cache levels (data cannot be found in the two cache levels at the same time. The first latency tests showed an average of 40 processor cycles for the L3 cache of current Nehalem models (4 cycles for L1 and roughly 10 cycles for L2). This is due to the gain in bandwidth offered by the integrated memory controller: a cache miss signifies a memory access and the average time between two memory requests diminishes. This necessary step in the coherence of cache is called “cache snooping” and can be a significant source of latency. the Nehalem’s L1 is capable of handling more parallel cache misses than the Core 2. An inclusive cache relationship generally translates into higher performances but to the detriment of the total size of useful cache (due to the redundancy of certain data in two successive levels). which enables avoiding verification and immediately creating a reading request in memory. For L1 devoted to instructions. unfortunately translates into a slight increase in access time to L1. it also has the advantage of affecting the private L1 and L2 caches less. gives the Nehalem another special characteristic. Doing away with the micro-TLB. although when we say “pseudo” this means that there are a few exceptions). the storage of these flags adds a little weight to the structure of L3. L3 runs at 2. Thus. Things become more complicated in the case of an L3 hit because verification is then required to see if the data is already present in one of the private caches.fine. the resemblance stops at the number of levels because cache does not function in the same way for the two architectures. Why? Because in the case of an L3 cache miss. In multi-core architecture. it can keep a latency of 3 cycles like on the Core 2 and this despite the absence of a micro-TLB. more than 1 MB is occupied by copies of L1 and L2 caches.93 GHz model. on the 2. In the end it’s a compromise. The instructions L1 thus has less to lose when reducing its associativity instead of increasing its latency. To overcome this problem. While the gain in time is appreciable. in particular access carried out by branching prediction mechanisms. Access latency on the latter can be (or at least partially) compensated for by the work of the OOO engine which reorganizes instructions in order to mask latency (the Nehalem’s has been considerably improved). By reducing the associativity of L1 instruction cache from 8 to 4 ways. Finally.93 GHz. Intel chose to favor latency to the detriment of associativity. Indeed. Inclusive cache The Nehalem’s cache hierarchy necessarily reminds us of the Phenom’s. L1 The first level L1 caches of each core of the Nehalem have the same size characteristics as the Core 2: 32 KB for data and 32 KB for instructions. this inclusive relationship amplifies the defect: of the 8 MB of L3. The data L1’s latency thus goes to 4 cycles (versus the 3 with Core 2). which we mentioned above. The separate frequencies and voltages add more flexibility to the processor’s design and notably avoid having to align the processor’s overall frequency with other slower elements. Such a value can be partly explained by the fact that L3 cache functions at a different frequency (as well as voltage) from that of the rest of the processor and this like the “uncore” part of the Nehalem. In addition. However. which as we will see later.66 GHz. However. while each access to the instructions cache is directly affected by the effects of higher latency. meaning that it contains a copy of the contents of L1 and L2. this enables better control of overall thermal dissipation of the socket. managing cache ways takes time and this is all the more so the greater the number of ways. This begins with the fact that the Nehalem’s shared L3 cache has an inclusive relation with all of the other cache levels. the Nehalem has for each line of L3 cache a flag that indicates in private cache in which core(s) the data is found.

On a monoprocessor machine. the size of the Nehalem’s L3 is easily adaptable depending on the capabilities of each processor version and also with each evolution in manufacturing. QPI offers large flexibility in its implementation and systems will be able to integrate as many QPI links as required by bandwidth. the core A new processor bus One of the defects of the Core 2 resides in the use of a processor bus of older design.8 to 6. The same goes for the processing cores and this along the entire pipeline of which the stages were more or less slightly improved compared to what was found on the Core 2. The QPI bus is announced with transfers of 4. In this area. play the role of interconnection between the processors and also between each processor and an IOH (input/output hub that for example serves as an interface with the PCI-Express bus). a single QPI lane between the processor and IOH (in this case an X58) is of course necessary. With a bus width that can attain 20 bits. each processor is capable of handling four QPI lanes. In this example. Branching prediction . Page 5 QPI Bus. While mobile and desktop platforms have no problem with it. Improvements to the core Compared to the Core 2. QPI lanes. or 32 GB/s for a dual directional link.In terms of flexibility. in blue on the diagram. This new point to point dual directional bus shares numerous characteristics with the HyperTransport bus and the principle is fundamentally similar. Nehalem abandons the FSB for a more modern interconnection bus called the QPI (Quick Path Interconnect).4 x 20 / 8 = 16 GB/s. the Opteron and its HyperTransport bus have been without serious competition up until now.4 GT/s (Giga-transfers per second). this isn’t the case for servers where the old FSB is a bottleneck in the interconnection between sockets. Just like its rival. many improvements to the Nehalem were motivated by the support of SMT and in general by the new memory hierarchy (the three cache levels and the increase in available memory bandwidth). This gives us a maximum speed of 6. The transition to 32 nm engraving will probably be accompanied by an L3 cache of 12 MB as was the case for the Core 2. The first implementations of QPI on Nehalem provide a lane of 25. or the double of that which is offered by a classic FSB at 1600 MHz.6 GB/s.

Macro-fusion was one of the innovations of Core 2 architecture which consists of detecting pairs of predefined x86 instructions such as “compare + jump” (CMP + JCC) and transforming them into a single micro-operation. Nehalem inherits the mechanisms already found on the Core 2: a loop detector and the management of direct and indirect branches. In addition to this. In addition. It’s interesting to note the similarity in concept with the trace cache of the Pentium 4. has one of the most significant influences on the performances of the processing pipeline. The code in question is then placed in a dedicated buffer and put on a special path that avoids certain redundant phases of processing. Nehalem uses the same concept except that loop detection is carried out after decoding in microoperations and this with the goal of saving the decoding phase of the loop at each step. while the first BTB is devoted to “local” addresses. it is one of the mechanisms that. Indeed it is useless. Intel has added a new mechanism that relies on the storing of return addresses (and not on destination addresses like the BTB) called the RSB (Return Stack Buffer). or LSD. Note that each thread has its own RSB in order to avoid any conflict in the management of this buffer when SMT is activated. which functions as code cache containing instructions that have already been decoded. Therefore the buffer doesn’t store x86 instructions but rather micro-operations that have already been decoded. for example. to resort to branching prediction on every step of the loop. like the management of data bases). . Nehalem keeps the four decoders already found on the Core 2 but improves certain mechanisms brought by its predecessor. Branching prediction’s goal is to avoid cuts in the flux of code as these slow traffic ins the pipeline and thus lower speeds. as we saw in our look at Core 2. the new architecture integrates a second BTB (Branch Target Buffer) address buffer whose role is to stock a history of destination addresses that were efficiently taken. You may recall that this phase consists of transforming x86 instructions into elementary micro-operations that are comprehensible to the rest of the processing pipeline. the second is meant for addresses further away that we can find in heavier applications (yes. This is just another example that shows how technology is recycled. The technique enables to both increase decoding capacity and reduce the number of micro-operations that are generated . The principle is based on the detection of loops in the code flux.Starting with branching prediction.and this all the more so with numerous appearances of these instruction pairs. Fusion The instruction decoding step was also reviewed. Page 6 The core (continued) Loop detector Core 2 also introduced a control mechanism that optimized decode loops called the Loop Stream Detector. Nehalem adds new instructions pairs capable of “macro-fusing” and especially enables macro-fusion in 64 bit mode (which is unfortunately not the case with the Core 2 that does thus benefit from its potential when it runs with a 64 bit operating system).

The following step consists of the actual execution of instructions by the calculation units. .OOO engine The Nehalem’s out-of-order (OOO) execution engine underwent several modifications mostly destined to support SMT. This buffer now contains 36 entries (32 on the Core 2) and uses a dynamic sharing policy between the two threads. Thus. the other thread can benefit from more entries in the RS. The micro-instructions of the two threads are then sent to a buffer called the “Reservation Station” which is responsible for dividing them up over the calculation units. This is not directly affected by SMT whose influence on this level is only in terms of the instruction speed of processing. Thus. The Nehalem’s units are therefore identical to those of the Core 2’s in every way. if one of the two threads is waiting for a memory operand. the size of the buffer for re-ordering instructions (ROB : Re-order Buffer) was increased to 128 entries (96 for the Core 2’s) and is shared in two equal parts by the two threads.

SSE4. Core i7’s caches are also a significant source of thermal dissipation. Page 7 TDP and Turbo Mode A higher TDP. you may recall. . If we only c onsider dissipation related to leaks. we should keep in mind that the processor now integrates a memory controller whose dissipation is now included in these 130W which isn’t the case for a machine based on the Core 2 where the memory controller is integrated to the Northbridge.2 are therefore broken down into STTNI (String and Text New Instructions). for the sole purpose of relieving L3 cache but does not increase the total size of cache) increases the cache sub-system’s dissipation by almost 13% (4 x 256 KB divided by 8 MB).SSE4. The variation in the number of clock cycle and voltage areas luckily enables better control of the processor’s overall thermal dissipation – or at least this is more efficient than the Core 2 (which moreover does not really give off excessive heat even in its quad core configuration) – and the big flexibility in design will be the key for low power Core i7 models (or whatever their names are). used in compression algorithms) and others that involve data searches. This is almost 35% more than the Core 2 Quad “Yorkfield” (45 nm) and its TDP of 95W for up to 3 GHz .2 Nehalem introduces a new instruction set or more precisely a complement to the SSE4. the Core i7 concentrates this dissipation on a smaller surface and close to the cores which are the hottest part of the system. Intel has made some effort to communicate on this new instruction set and now presents them in a more concrete form for users.20 GHz. This only meant that these watts were consumed elsewhere. Note that the Intel C++ version 10 compiler already supports these new instructions. Is Nehalem architecture economical in energy use? Not really. just the presence of L2 (added. whose purpose is to accelerate the processing of character chains and into ATA (Application Targeted Instructions) that groups together instructions specialized in the calculation of control sums (for example.66 and 3. or rather not all models. The first Core i7 “Bloomfields” that will soon be available have an announced TDP of 130W for clock frequencies between 2. However.1 instructions of the Core 2 45 nm. On the other hand.

Up until now this control was handled by the operating system and the processeur offered the possibility of external control of the multiplying coefficient (FID: Frequency Identifier) and voltage (VID: Voltage Identifier) which . Nehalem is an evolutionary step in the control of internal voltage and clock frequency. This mechanism consists of accelerating in a dynamic and temporary way the clock speed of one or several of the cores when others are not called for. The concept is based on the fact that many applications consist of one or two threads and therefore do not use all of the multi-thread processing potential of a multi-core processor. When the case arises. Thus. the Turbo mode comes into play (under control of the operating system) and increases the multiplying coefficient of the one or several cores in question.Turbo mode and overlocking The Turbo mode of the Core i7 is not an architectural characteristic but a functionality that Intel has already implemented on certain versions of the Core 2 Mobile under the name IDA (Intel Dynamic Acceleration).

86 GHz: TDP[14] = (14/22) x 110 + 20 = 48 Watts. Should the need arise. A priori. With the Core i7. However. 3 3 Page 8 The different versions Different versions of Nehalem architecture Nehalem’s flexibility enables the existence of several versions of this processor which best fit the demands of each platform. it will thus be possible to modify the maximum multiplying coefficient as well as the TDP but this will not mean that the processor will function the entire time under these parameters. As an example. Software (the BIOS and operating system) no longer controls the FID/VID combo as it did with previous generations but rather « Power-States » (or P-State). The absence of activity of one or several cores results in a lowering of the overall TDP and thereby offers the Turbo mode the opportunity to accelerate cores that are called for.forms the foundation for EIST (Enhanced Intel SpeedStep Technology). Only variations of Core i7 “Extreme Edition” models will enable modifying the TDP ceiling. Indeed. Intel will finally not set any limitations here and it will be possible to go beyond this value. there is the obvious question of overclocking as the processor’s maximum TDP is quickly surpassed. Therefore. The Turbo mode thus operates within the framework of this internal control of the processor’s overall TDP. the Core i7 has an announced TDP of 130 Watts at 2. a Core i7 model will no longer be characterized by its maximum FID and VID but by its maximum TDP. With this new protection mechanism via control of the TDP. The multiplying coefficient varies between 12x and 22x which gives us ten P-states between 37 and 130 Watts.93 GHz. note that these modifiable parameters on the Core i7 XE will only concern the Turbo mode. It’s possible to create a profile for each of these variations: The desktop platform . the estimated TDP equals: TDP[coeff] = (coeff / max_coeff) x TDP_core + TDP_uncore The “uncore” part of the processor (IMC and L3 cach e) is not subject to P-states and thus 20 of the 130 Watts is constant. these parameters are no longer external and it alone can change them in order to have control over its thermal dissipation. or 22 x 133. These power levels are defined based on the overall TDP of the processor at the maximum frequency outside of Turbo mode. For each intermediate multiplying coefficient. but of course the Turbo mode will not be in effect. We therefore obtain for example at 14 x 133 = 1. the processor is capable of estimating the power it consumes at each moment (voltage x intensity of the current consumed) and of course to control it with the help of frequency and voltage parameters.

SMT. Most motherboard manufacturers will release LGA1366 models at the same time these first Core i7 versions will be available.93 GHz for a price of roughly 500 Euros.Core i7 940: characteristics identical to the 920 but set at 2. set at 3. Evidently the supplementary MHz have an added price! Note that these models will finally have DDR31066 support and not 1333.66 GHz. four cores. At first we will see high end versions (the Bloomfield) which will soon be launched. integrated DDR3-1066 memory controller. socket LGA 1366 for an estimated price of around 260 euros. an L3 cache of 8 MB.2 GHz for a price of around 1000 Euros.the Core i7 920: 2.The desktop PC market is big enough to accommodate several Nehalem models. Based on Intel’s X58 “Tylersburg” chipset and c ombined with an Intel ICH10. . . Intel seems to have reserved this memory frequency for server models in order to make the difference a little bigger. There are already three models planned: . . these mobos will have six memory slots and at least two PCI-Express 16x slots.Core i7 Extreme Edition 965: version XE.

Planned for release at the same time as the Bloomfield. the other improvements of the Nehalem should largely compensate for this. The Beckton will not arrivere before 2009 and will be equipped with three or four QPI links. . known under the codename “Gainestown” (DP) and “Beckton” (MP) will be the best equipped. Destined for the new Socket LGA1160. it will have DDR3 memory support “only” on two channels and will be equipped with a PCI-Express controller which will enable simplifying the chipset design (the Ibex Peak will only be composed of a single chip). Indeed. it is less adapted to the dual core and certainly less to the Core 2 which is supposed to function exclusively with two cores. while the Nehalem’s cache hierarchy offers optimal performances in a quad core configuration. the Gainestown will share the same characteristics but have a supplementary QPI bus destined for an intra-CPU connection. the Lynnfield. The server platform Server versions of Nehalem.We will have to wait until the third quarter of 2009 for a more affordable model. We should keep in mind that there may be lower gains in dual core versions over current Core2 Duos compared to those in the transition from quad core Core i7s over Core2 Quads. Next will come the Havendale in early 2010. Of course. a dual core that will directly integrate the graphic component which up until now was found on the chipset.

Intel largely based it on the Core 2 while adding modifications in many areas. Page 9 Conclusion Conclusion In designing the Nehalem. Intel also mentions a more aggressive Turbo mode on mobile processors with frequency increases of more than 50%. From simple “touch ups” to in-depth changes. Thus. nothing really went unchanged. Intel has strived to renew its bond with server machines. So will you and I as individual users benefit from the transition to the new architecture? Certainly because applications. tend to process a growing volume of data and what is the better suited for processing such large volumes than a processor designed for server use? . even current ones.The mobile platform The PC desktop Nehalem Lynnfield and Havendale will have their mobile equivalents planned to be released at the same time in the 3rd quarter of 2009 and early 2010. all the while having a certain modularity enabling it to offer the architecture in all sectors. Taking a closer look at the Nehalem reveals Intel’s desire to release its new architecture from constraints related to its mobile lineage and align it more to server use. The Clarksfield are quad cores and the Auburndale will be a dual core with an integrated GPU. a domain in which AMD is still very present.

After having lost some groun d in the home PC sector. Intel redefines the concept of a multi-platform processor and this time without any compromise. While AMD has put up a fight up until now in the professional world. AMD may now lose some more market share to its giant rival. . the Texan manufacturer may have some problems with Intel’s latest creation.With Nehalem.