You are on page 1of 7

Nehalem Architecture

You can look at the Nehalem microprocessor as a chip that has two main sections: a core and
then the surrounding components called the un-core. The core of the microprocessor contains
the following elements:
• The processors, which do the actual number crunching. This can include anything from simple
mathematical operations like adding and subtracting to much more complex functions.
• A section devoted to out-of-order scheduling and retirement logic. In other words, this part lets the
microprocessor tackle instructions in whichever order is fastest, making it more efficient.
• Cache memory takes up about one-third of the microprocessor's core. The cache allows the
microprocessor to store information temporarily on the chip itself, decreasing the need to pull information from
other parts of the computer. There are two sections of cache memory in the core.
• A branch prediction section on the core allows the microprocessor to anticipate functions based on
previous input. By predicting functions, the microprocessor can work more efficiently. If it turns out the
predictions are wrong, the chip can stop working and change functions.
• The rest of the core orders functions, decodes information and organizes data.
The un-core section has an additional 8 megabytes of memory contained in the L3 cache. The reason the L3 cache isn't in the
core is because the Nehalem microprocessor is scalable and modular. That means Intel can build chips that have multiple
cores. The cores all share the same L3 memory cache. That means multiple cores can work from the same information at the
same time.

Why create scalable microprocessors? It's an elegant solution to a tricky problem -- building more processing power without
having to reinvent the processor itself. In a way, it's like connecting several batteries in a series. Intel plans on building
Nehalem microprocessors in dual, quad and eight-core configurations. Dual-core processors are good for small devices like
smartphones. You're more likely to find a quad-core processor in a desktop orlaptop computer. Intel designed the eight-core
processors for machines like servers -- computers that handle heavy workloads.
Intel says that it will offer Nehalem microprocessors that incorporate a graphics processing unit (GPU) in the un-core. The
GPU will function much the same way as a dedicated graphics card.

Next, we'll look at the way the Nehalem transmits information.

Nehalem and QuickPath

Courtesy Intel
Intel built the Core i7 chip series using the Nehalem microarchitecture.
According to Intel, the Nehalem microarchitecture uses a system the company calls QuickPath. QuickPath encompasses the
connections between the processors, memoryand other components.
In older Intel microprocessors, commands come in through an input/output (I/O) controller to a centralized memory
controller. The memory controller contacts a processor, which may request data. The memory controller retrieves this data
from memory storage and sends it to the processor. The processor makes computations based upon that data and sends the
results back through the memory controller to the I/O controller. As microprocessors become more complex with multiple
processors on a single chip, this model becomes less efficient.
Using the old microarchitecture, Intel's chips had a memory bandwidth of up to 21 gigabytes per second. QuickPath
connectivity improves the memory bandwidth, allowing more information to pass each second.

Processors using the new technology decentralize communication between processors and memory. That means that instead
of a centralized memory controller, each processor has its own memory controller, dedicated memory and cache memory.
The processors communicate directly with the I/O controller. Commands come from the I/O controllers to the processors.
Because each processor has a dedicated memory controller, memory and cache, information flows more freely. Each
processor can communicate with its dedicated memory at a speed of 32 gigabytes per second.
Nehalem-based processors also have point-to-point interconnections between each other. That means if one processor needs
to access data within another processor's cache, it can send a request directly to the respective processor and get a
response. Within each interconnection are distinct data pathways. Data can flow in both directions at the same time, speeding
up data transfers. Transfer speeds between the multiple processors and the I/O controller can be up to 25.6 gigabytes per
QuickPath allows processors to take shortcuts when they ask other processors for information. Imagine a quad-core
microprocessor with processors A, B, C and D. There are links between each processor. In older architectures, if processor A
needed information from D, it would send a request. D would then send a request to processors B and C to make sure D had
the most recent instance of that data. B and C would send the results to D, which would then be able to send information back
to A. Each round of messages is called a hop -- this example had four hops.
QuickPath skips one of these steps. Processor A would send its initial request -- called a "snoop" -- to B, C and D, with D
designated as the respondent. Processors B and C would send data to D. D would then send the result to A. This method
skips one round of messages, so there are only three hops. It seems like a small improvement, but over billions of calculations
it makes a big difference.

In addition, if one of the other processors had the information A requests, it can send the data directly to A. That reduces the
hops to 2. QuickPath also packs information in more compact payloads.
Nehalem Branches and Loops

Courtesy Intel
The Core i7 chip with the heatspreader removed.
In a microprocessor, everything runs on clock cycles. Clock cycles are a way to measure how long a microprocessor takes to
execute an instruction. Think of it as the number of instructions a microprocessor can execute in a second. The faster
the clock speed, the more instructions the microprocessor will be able to handle per second.
One way microprocessors like the Core i7 try to increase efficiency is to predict future instructions based on old instructions.
It's called branch prediction. When branch prediction works, the microprocessor completes instructions more efficiently. But
if a prediction turns out to be inaccurate, the microprocessor has to compensate. This can mean wasted clock cycles, which
translates into slower performance.
Nehalem has two branch target buffers (BTB). These buffers load instructions for the processors in anticipation of what the
processors will need next. Assuming the prediction is correct, the processor doesn't need to call up information from
the computer's memory. Nehalem's two buffers allow it to load more instructions, decreasing the lag time in the event one set
turns out to be incorrect.
Another efficiency improvement involves software loops. A loop is a string of instructions that the software repeats as it
executes. It may come in regular intervals or intermittently. With loops, branch prediction becomes unnecessary -- one
instance of a particular loop should execute the same way as every other. Intel designed Nehalem chips to recognize loops
and handle them differently than other instructions.
Microprocessors without loop stream detection tend to have a hardware pipeline that begins with branch predictors, then
moves to hardware designed to retrieve -- or fetch -- instructions, decode the instructions and execute them. Loop stream
detection can identify repeated instructions, bypassing some of this process.

Intel used loop stream detection in its Penryn microprocessors. Penryn's loop stream detection hardware sits between the
fetch and decode components of older microprocessors. When the Penryn chip's detector discovers a loop, the
microprocessor can shut down the branch prediction and fetch components. This makes the pipeline shorter. But Nehalem
goes a step farther. Nehalem's loop stream detector is at the end of the pipeline. When it sees a loop, the microprocessor can
shut down everything except the loop stream detector, which sends out the appropriate instructions to a buffer.

The improvements to branch prediction and loop stream detection are all part of Intel's "tock" strategy. The transistors in
Nehalem chips are the same size as Penryn's, but Nehalem's design makes more efficient use of the hardware.
Next, we'll take a look at how Nehalem microprocessors handle data streams.

Nehalem and Multithreading

Courtesy Intel
The back side of the Core i7 chip with Nehalem microarchitecture.
As software applications become more sophisticated, sending instructions to processors becomes complicated. One way to
simplify the process is through threading. Threading starts on the software side of the equation. Programmers build
applications with instructions that processors can split into multiple streams or threads. Processors can work on individual
threads of instructions, teaming up to complete a task. In the world of microprocessors, this is called parallelism because
multiple processors work on parallel threads of data at the same time.
Nehalem's architecture allows each processor to handle two threads simultaneously. That means an eight-core Nehalem
microprocessor can process 16 threads at the same time. This gives the Nehalem microprocessor the ability to process
complex instructions more efficiently. According to Intel, the multithreading capability is more efficient than adding more
processing cores to a microprocessor. Nehalem microprocessors should be able to meet the demands of sophisticated
software like video editing programs or high-end video games.

Another benefit to multithreading is that the processor can handle multiple applications at the same time. This lets you work on
complex programs while running other applications like virus scanners in the background. With older processors, these
activities could cause a computer to slow down or even crash.

Rock Around the Overclock

Nehalem's turbo boost feature is similar to an old hacking trick
called overclocking. To overclock a microprocessor is to increase
its processing frequency beyond the normal parameters of the
chip. Some gamers overclock the processors on their machines to
get better performance when playing sophisticated video games.
But overclocking isn't always a good idea -- it can cause chips to
Intel has incorporated an additional technology the company calls turbo boost within Nehalem's architecture. If the processor
is running below its limits on power consumption, processing capacity and temperature levels, it can increase its clock
frequency. This makes the active processors work faster. With older applications that have a single thread, the chip can
increase clock speeds even more.
The turbo boost feature is dynamic -- it makes the Nehalem microprocessor work harder as the workload increases, provided
the chip is within its operating parameters. As workload decreases, the microprocessor can work at its normal clock
frequency. Because the microchip has a monitoring system, you don't have to worry about the chip overheating or working
beyond its capacity. And when you aren't placing heavy demands on your processor, the chip conserves power.

If Nehalem is Intel's latest "tock," what will be the next "tick?" And what comes after that? Find out in the next section.

ntel's Tick Tock

Courtesy Intel
Intel Executive Vice President Sean Maloney demonstrates the power of the Nehalem microarchitecture using a
touchscreen interface at a press conference.
Developing a microprocessor takes years. While Intel unveiled Nehalem in 2008, the project was more than five years old at
the time. That means even as people wait for an announced microchip to make its way into various electronic devices
and computers, manufacturers like Intel are working on the next step in microprocessor evolution. They have to, if they want
to keep up with Moore's Law.
The next step for Intel is another "tick" development. That means reducing transistors down to 32 nanometers wide. Producing
one microprocessor with transistors that size is an amazing achievement. But what's even more daunting is finding a way to
mass produce millions of chips with transistors that small in an efficient, reliable and cost-effective way.

The codename for the next Intel chip is Westmere. Westmere will use the same microarchitecture as Nehalem but will have
the 32-nanometer transistors. That means Westmere will be more powerful than Nehalem. But that doesn't mean Westmere's
architecture will make the most sense for a microprocessor with transistors that small. That will fall to the next "tock"
And the tock already has a name: Sandy Bridge. The Sandy Bridge microchip will have an architecture optimized for 32-
nanometer transistors. It may take a couple of years before we see Sandy Bridge rolled out into the commercial market, but
when it does it will likely be just as revolutionary as Nehalem is today.

Where will Intel go after that? It's hard to say. While transistors have shrunk down to sizes practically unimaginable a decade
ago, we're getting close to hitting some fundamental laws of physics that could put a halt to rapid development. That's
because as you work with smaller materials, you begin to enter the realm of quantum mechanics. The world of quantum
mechanics can seem strange to someone only familiar with classic physics. Particles and energy behave in ways that seem
counterintuitive from a classic perspective.
One of those behaviors is particularly problematic when it comes to microprocessors: electron tunneling. Normally,
transistors can funnel electrons without much risk of leakage. But as barriers get thinner, the possibility for electron tunneling
becomes more likely. When an electron encounters a very thin barrier -- something on the order of a single nanometer in
width -- it can pass from one side of the barrier to the other even if the electron's energy levels seem too low for that to
happen normally. Scientists call the phenomenon tunneling even though the electron doesn't make a physical hole in the
This is a big problem for microprocessors. Microprocessors work by channeling electrons through transistor switches.
Microprocessors with transistors on the nanoscale already have to deal with some levels of electron leakage. Leakage makes
microprocessors less efficient. Without a dramatic change to the way Intel designs transistors, there's a danger that Moore's
Law will finally become moot.

Still, engineers tend to think of ways around problems that seem completely insurmountable. Even if transistors can't get any
smaller after one or two more generations, it won't be the end of electronics. It just might mean we advance a little more
slowly than we're accustomed to.

To learn more about microprocessors and related subjects, take a look at the links on the next page.