You are on page 1of 41

, How a CPU Works Introduction Even though every microprocessor has its own internal design, all microprocessors

share a same basic concept – which we will explain in this tutorial. We will take a look inside a generic CPU architecture, so you will be able to understand more about ntel and !"# products and the di$$erences between them. %he CPU &Central Processing Unit' – which is also called microprocessor or processor – is in charge o$ processing data. (ow it will process data will depend on the program. %he program can be a spreadsheet, a word processor or a game) $or the CPU it makes no di$$erence, since it doesn*t understand what the program is actually doing. t +ust $ollows the orders &called commands or instructions' contained inside the program. %hese orders could be to add two numbers or to send a piece o$ data to the video card, $or example. When you double click on an icon to run a program, here is what happens) ,. %he program, which is stored inside the hard disk drive, is trans$erred to the -!" memory. ! program is a series o$ instructions to the CPU. .. %he CPU, using a circuit called memory controller, loads the program data $rom the -!" memory. /. %he data, now inside the CPU, is processed. 0. What happens next will depend on the program. %he CPU could continue to load and executing the program or could do something with the processed data, like displaying something on the screen.

click to enlarge Figure 1: (ow stored data is trans$erred to the CPU. n the past, the CPU controlled the data trans$er between the hard disk drive and the -!" memory. 1ince the hard disk drive is slower than the -!" memory, this slowed down the system, since the CPU would be busy until all the data was trans$erred $rom the hard disk drive to the -!" memory. %his method is called P 2, Processor 32 &or Programmed 32'. 4owadays data trans$er between the hard disk drive and the -!" memory in made without using the CPU, thus making the system $aster. %his method is called bus mastering or #"! &#irect "emory !ccess'. n order to simpli$y our drawing we didn*t put the north bridge chip between the hard disk drive and the -!" memory on 5igure ,, but it is there. Processors $rom !"# based on sockets 670, 8/8 and 809 &!thlon :0, !thlon :0 ;., !thlon :0 5;, 2pteron and some 1empron models' have an embedded memory controller. %his means that $or these processors the CPU accesses the -!" memory directly, without using the north bridge chip shown on 5igure ,.

.

Clock 1o, what is clock anyway< Clock is a signal used to sync things inside the computer. %ake a look on 5igure ., where we show a typical clock signal) it is a s=uare wave changing $rom >9? to >,? at a $ixed rate. 2n this $igure you can see three $ull clock cycles &>ticks?'. %he beginning o$ each cycle is when the clock signal goes $rom >9? to >,?@ we marked this with an arrow. %he clock signal is measured in a unit called (ertA &(A', which is the number o$ clock cycles per second. ! clock o$ ,99 "(A means that in one second there is ,99 million clock cycles.

Figure 2: Clock signal. n the computer, all timings are measured in terms o$ clock cycles. 5or example, a -!" memory with a >7? latency means that it will delay $ive $ull clock cycles to start delivering data. nside the CPU, all instructions delay a certain number o$ clock cycles to be per$ormed. 5or example, a given instruction can delay seven clock cycles to be $ully executed. -egarding the CPU, the interesting thing is that the CPU knows how many clock cycles each instruction will take, because it has a table which lists this in$ormation. 1o i$ it has two instructions to be executed and it knows that the $irst will delay seven clock cycles to be executed, it will automatically start the execution o$ the next instruction on the Bth clock tick. 2$ course this is a generic explanation $or a CPU with +ust one execution unit – modern processors have several execution units working in parallel and it could execute the second instruction at the same time as the $irst, in parallel. %his is called superscalar architecture and we will talk more about this later. 1o, what clock has to do with per$ormance< %o think that clock and per$ormance is the same thing is the most common misconception about processors. $ you compare two completely identical CPUs, the one running at a higher clock rate will be $aster. n this case, with a higher clock rate, the time between each clock cycle will be shorter, so things are going to be per$ormed in less time and the per$ormance will be higher. Cut when you do compare two di$$erent processors, this is not necessarily true. $ you get two processors with di$$erent architectures – $or example, two di$$erent manu$acturers, like ntel and !"# – things inside the CPU are completely di$$erent. !s we mentioned, each instruction takes a certain number o$ clock cycles to be executed. Det*s say that processor >!? takes seven clock cycles to per$orm a given instruction, and that processor >C? takes $ive clock cycles to per$orm this same instruction. $ they are running at the same clock rate, processor >C? will be $aster, because it can process this

/

instruction is less time. 5or modern CPUs there is much more in the per$ormance game, as CPUs have di$$erent number o$ execution units, di$$erent cache siAes, di$$erent ways o$ trans$erring data inside the CPU, di$$erent ways o$ processing the instructions inside the execution units, di$$erent clock rates with the outside world, etc. #on*t worry@ we will cover all that in this tutorial. !s the processor clock signal became very high, one problem showed up. %he motherboard where the processor is installed could not work using the same clock signal. $ you look at a motherboard, you will see several tracks or paths. %hese tracks are wires that connect the several circuits o$ the computer. %he problem is that with higher clock rates, these wires started to work as antennas, so the signal, instead o$ arriving at the other end o$ the wire, would simply vanish, being transmitted as radio waves.

Figure 3: %he wires on the motherboard can work as antennas. External Clock 1o the CPU manu$acturers started using a new concept, called clock multiplication, which started with 0B:#;. processor. Under this scheme, which is used in all CPUs nowadays, the CPU has an external clock, which is used when trans$erring data to and $rom the -!" memory &using the north bridge chip', and a higher internal clock. %o give a real example, on a /.0 E(A Pentium 0 this >/.0 E(A? re$ers to the CPU internal clock, which is obtained multiplying by ,6 its .99 "(A external clock. We illustrated this example on 5igure 0.

0

click to enlarge Figure 4: nternal and external clocks on a Pentium 0 /.0 E(A. %he huge di$$erence between the internal clock and the external clock on modern CPUs is one ma+or roadblock to overcome in order to increase the computer per$ormance. Continuing the Pentium 0 /.0 E(A example, it has to reduce its speed by ,6x when it has to read data $rom -!" memoryF #uring this process, it works as i$ it were a .99 "(A CPUF 1everal techni=ues are used to minimiAe the impact o$ this clock di$$erence. 2ne o$ them is the use o$ a memory cache inside the CPU. !nother one is trans$erring more than one data chunk per clock cycle. Processors $rom both !"# and ntel use this $eature, but while !"# CPUs trans$er two data per clock cycle, ntel CPUs trans$er $our data per clock cycle.

Figure 5: %rans$erring more than one data per clock cycle. Cecause o$ that, !"# CPUs are listed as having the double o$ their real external clocks. 5or example, an !"# CPU with a .99 "(A external clock is listed as 099 "(A. %he same happens with ntel CPUs) an ntel CPU with a .99 "(A external clock is listed as having an B99 "(A external clock. %he techni=ue o$ trans$erring two data per clock cycle is called ##- &#ual #ata -ate', while the techni=ue o$ trans$erring $our data per clock cycle is called G#- &Guad #ata -ate'. lock !iagra" o# a CPU 2n 5igure : you can see a basic block diagram $or a modern CPU. %here are many di$$erences between !"# and ntel architectures&read nside Pentium 0 !rchitecture $or a

7 detailed view on Pentium 0 architecture'. Understanding the basic block diagram o$ a modern CPU is the $irst step to understand how CPUs $rom ntel and !"# work and the di$$erences between them.

Figure $: Casic block diagram o$ a CPU. %he dotted line on 5igure : represents the CPU body, as the -!" memory is located outside the CPU. %he datapath between the -!" memory and the CPU is usually :0Hbit wide &or ,.BHbit when dual channel memory con$iguration is used', running at the memory clock or the CPU external clock, which one is lower. %he number o$ bits used and the clock rate can be combined in a unit called trans$er rate, measured in "C3s. %o calculate the trans$er rate, the $ormula is number o$ bits x clock 3 B. 5or a system using ##-099 memories in single channel con$iguration &:0 bits' the memory trans$er rate will be /,.99 "C3s, while the same system using dual channel memories &,.B bits' will have a :,099 "C3s memory trans$er rate. !ll the circuits inside the dotted box run at the CPU internal clock. #epending on the CPU some o$ its internal parts can even run at a higher clock rate. !lso, the datapath between the CPU units can be wider, i.e. trans$er more bits per clock cycle than :0 or ,.B. 5or example, the datapath between the D. memory cache and the D, instruction cache on modern processors is usually .7:Hbit wide. %he higher the number the bits trans$erred per clock cycle, the $ast the trans$er will be done &in other words, the trans$er rate will be higher'. 2n 5igure : we used a red arrow between the -!" memory and the D. memory cache and green arrows between all other blocks to express the di$$erent clock rates and datapath width used. %e"or& Cac'e "emory cache is a high per$ormance kind o$ memory, also called static memory. %he kind

: o$ memory used on the computer main -!" memory is called dynamic memory. 1tatic memory consumes more power, is more expensive and is physically bigger than dynamic memory, but it is a lot $aster. t can work at the same clock as the CPU, which dynamic memory is not capable o$. 1ince going to the >external world? to $etch data makes the CPU to work at a lower clock rate, memory cache techni=ue is used. When the CPU loads a data $rom a certain memory position, a circuit called memory cache controller &not drawn on 5igure : in the name o$ simplicity' loads into the memory cache a whole block o$ data below the current position that the CPU has +ust loaded. 1ince usually programs $low in a se=uential way, the next memory position the CPU will re=uest will probably be the position immediately below the memory position that it has +ust loaded. 1ince the memory cache controller already loaded a lot o$ data below the $irst memory position read by the CPU, the next data will be inside the memory cache, so the CPU doesn*t need to go outside to grab the data) it is already loaded inside in the memory cache embedded in the CPU, which it can access at its internal clock rate. %he cache controller is always observing the memory positions being loaded and loading data $rom several memory positions a$ter the memory position that has +ust been read. %o give you a real example, i$ the CPU loaded data stored in the address ,,999, the cache controller will load data $rom >n? addresses a$ter the address ,,999. %his number >n? is called page@ i$ a given processor is working with 0 IC pages &which is a typical value', it will load data $rom 0,98: addresses below the current memory position being load &address ,,999 in our example'. Cy the way, , IC e=uals to ,,9.0 bytes, that*s why 0 IC is 0,98: not 0,999. 2n 5igure 6 we illustrate this example.

Figure (: (ow the memory cache controller works. %he bigger the memory cache, the higher the chances o$ the data re=uired by the CPU are already there, so the CPU will need to directly access -!" memory less o$ten, thus increasing the system per$ormance &+ust remember that every time the CPU needs to access

6

the -!" memory directly it needs to lower its clock rate $or this operation'. We call a >hit? when the CPU loads a re=uired data $rom the cache, and we call a >miss? i$ the re=uired data isn*t there and the CPU needs to access the system -!" memory. D, and D. means >Devel ,? and >Devel .?, respectively, and re$ers to the distance they are $rom the CPU core &execution unit'. 2ne common doubt is why having three separated cache memories &D, data cache, D, instruction cache and D. cache'. Pay attention to 5igure : and you will see that D, instruction cache works as an >input cache?, while D, data cache works as an >output cache?. D, instruction cache – which is usually smaller than D. cache – is particularly e$$icient when the program starts to repeat a small part o$ it &loop', because the re=uired instructions will be closer to the $etch unit. 2n the specs page o$ a CPU the D, cache can be $ound with di$$erent kinds o$ representation. 1ome manu$acturers list the two D, cache separately &some times calling the instruction cache as > ? and the data cache as >#?', some add the amount o$ the two and writes >separated? – so a >,.B IC, separated? would mean :0 IC instruction cache and :0 IC data cache –, and some simply add the two and you have to guess that the amount is total and you should divide by two to get the capacity o$ each cache. %he exception, however, goes to the Pentium 0 and newer Celeron CPUs based on sockets 06B and 667. Pentium 0 processors &and Celeron processors using sockets 06B and 667' don*t have a D, instruction cache, instead they have a trace execution cache, which is a cache located between the decode unit and the execution unit. 1o, the D, instruction cache is there, but with a di$$erent name and a di$$erent location. We are mentioning this here because this is a very common mistake, to think that Pentium 0 processors don*t have D, instruction cache. 1o when comparing Pentium 0 to other CPUs people would think that its D, cache is much smaller, because they are only counting the B IC D, data cache. %he trace execution cache o$ Pentium 0 and Celeron CPUs is o$ ,79 IC and should be taken in account, o$ course. ranc'ing !s we mentioned several times, one o$ the main problems $or the CPU is having too many cache misses, because the $etch unit must access directly the slow -!" memory, thus slowing down the system. Usually the use o$ the memory cache avoids this a lot, but there is one typical situation where the cache controller will miss) branches. $ in the middle o$ the program there is an instruction called J"P &>+ump? or >go to?' sending the program to a completely di$$erent memory position, this new position won*t be loaded in the D. memory cache, making the $etch unit to go get that position directly in the -!" memory. n order to solve this issue, the cache controller o$ modern CPUs analyAe the memory block it loaded and whenever it $inds a J"P instruction in there it will load the memory block $or that position in the D. memory cache be$ore the CPU reaches that J"P instruction.

B

Figure ): Unconditional branching situation. %his is pretty easy to implement, the problem is when the program has a conditional branching, i.e. the address the program should go to depends on a condition not yet known. 5or example, i$ a KL b go to address ,, or i$ a M b go to address .. We illustrate this example on 5igure 8. %his would make a cache miss, because the values o$ a and b are unknown and the cache controller would be looking only $or J"PHlike instructions. %he solution) the cache controller loads both conditions into the memory cache. Dater, when the CPU processes the branching instruction, it will simply discard the one that wasn*t chosen. t is better to load the memory cache with unnecessary data than directly accessing the -!" memory.

Figure *: Conditional branching situation. Processing Instructions %he $etch unit is in charge o$ loading instructions $rom memory. 5irst, it will look i$ the

8 instruction re=uired by the CPU is in the D, instruction cache. $ it is not, it goes to the D. memory cache. $ the instruction is also not there, then it has to directly load $rom the slow system -!" memory. When you turn on your PC all the caches are empty, o$ course, but as the system starts loading the operating system, the CPU starts processing the $irst instructions loaded $rom the hard drive, and the cache controller starts loading the caches, and the show begins. !$ter the $etch unit grabbed the instruction re=uired by the CPU to be processed, it sends it to the decode unit. %he decode unit will then $igure out what that particular instruction does. t does that by consulting a -2" memory that exists inside the CPU, called microcode. Each instruction that a given CPU understands has its own microcode. %he microcode will >teach? the CPU what to do. t is like a stepHbyHstep guide to every instruction. $ the instruction loaded is, $or example, add aNb, its microcode will tell the decode unit that it needs two parameters, a and b. %he decode unit will then re=uest the $etch unit to grab the data present in the next two memory positions, which $it the values $or a and b. !$ter the decode unit >translated? the instruction and grabbed all re=uired data to execute the instruction, it will pass all data and the >stepHbyHstep cookbook? on how to execute that instruction to the execute unit. %he execute unit will $inally execute the instruction. 2n modern CPUs you will $ind more than one execution unit working in parallel. %his is done in order to increase the processor per$ormance. 5or example, a CPU with six execution units can execute six instructions in parallel, so in theory it could achieve the same per$ormance o$ six processors with +ust one execution unit. %his kind o$ architecture is called superscalar architecture. Usually modern CPUs don*t have several identical execution units@ they have execution units specialiAed in one kind o$ instructions. %he best example is the 5PU, 5loat Point Unit, which is in charge o$ executing complex math instructions. Usually between the decode unit and the execution unit there is an unit &called dispatch or schedule unit' in charge o$ sending the instruction to the correct execution unit, i.e. i$ the instruction is a math instruction it will send it to the 5PU and not to one >generic? execution unit. Cy the way, >generic? execution units are called !DU, !rithmetic and Dogic Unit. 5inally, when the processing is over, the result is sent to the D, data cache. Continuing our add aNb example, the result would be sent to the D, data cache. %his result can be then sent back to -!" memory or to another place, as the video card, $or example. Cut this will depend on the next instruction that is going to be processed next &the next instruction could be >print the result on the screen?'. !nother interesting $eature that all microprocessors have $or a long time is called >pipeline?, which is the capability o$ having several di$$erent instructions at di$$erent stages o$ the CPU at the same time. !$ter the $etch unit sent the instruction to the decode unit, it will be idle, right< 1o, how

,9

about instead o$ doing nothing, put the $etch unit to grab the next instruction< When the $irst instruction goes to the execution unit, the $etch unit can send the second instruction to the decode unit and grab the third instruction, and so on. n a modern CPU with an ,,Hstage pipeline &stage is another name $or each unit o$ the CPU', it will probably have ,, instructions inside it at the same time almost all the time. n $act, since all modern CPUs have a superscalar architecture, the number o$ instructions simultaneously inside the CPU will be even higher. !lso, $or an ,,Hstage pipeline CPU, an instruction to be $ully executed will have to pass thru ,, units. %he higher the number o$ stages, the higher the time an instruction will delay to be $ully executed. 2n the other hand, keep in mind that because o$ this concept several instructions can be running inside the CPU at the same time. %he very $irst instruction loaded by the CPU can delay ,, steps to get out o$ it, but once it goes out, the second instruction will get out right a$ter it &and not another ,, steps later'. %here are several other tricks used by modern CPUs to increase per$ormance. We will explain two o$ them, outHo$Horder execution &222' and speculative execution. +ut,+#,+rder Execution -+++. -emember that we said that modern CPUs have several execution units working in parallel< We also said that there are di$$erent kinds o$ execution units, like !DU, which is a generic execution unit, and 5PU, which is a math execution unit. Just as a generic example in order to understand the problem, let*s say that a given CPU has six execution engines, $our >generic? and two 5PUs. Det*s also say that the program has the $ollowing instruction $low in a given moment) 1. generic instruction 2. generic instruction 3. generic instruction 4. generic instruction 5. generic instruction 6. generic instruction 7. math instruction 8. generic instruction 9. generic instruction 10. math instruction What will happen< %he schedule3dispatch unit will send the $irst $our instructions to the $our !DUs but then, at the $i$th instruction, the CPU will need to wait $or one o$ their !DUs to be $ree in order to continue processing, since all its $our generic execution units are busy. %hat*s not good, because we still have two math units &5PUs' available, and they are idle. 1o, a CPU with outHo$Horder execution &all modern CPUs have this $eature' will look at the next instruction to see i$ it can be sent to one o$ the idle units. n our example, it can*t, because the sixth instruction also needs one !DU to be processed. %he outHo$Horder

,,

engine continues its search and $inds out that the seventh instruction is a math instruction that can be executed in one o$ the available 5PUs. 1ince the other 5PU will still be available, it will go down the program looking $or another math instruction. n our example, it will pass the eight and the ninth instructions and will load the tenth instruction. 1o, in our example, the execution units will be processing, at the same time, the $irst, the second, the third, the $ourth, the seventh and the tenth instructions. %he name outHo$Horder comes $rom the $act that the CPU doesn*t need to wait@ it can pull an instruction $rom the bottom o$ the program and process it be$ore the instructions above it are processed. 2$ course the outHo$Horder engine cannot go $orever looking $or an instruction i$ it cannot $ind one. %he outHo$Horder engine o$ all CPUs has a depth limit on which it can crawl looking $or instructions &a typical value would be 7,.'. /0eculati1e Execution Det*s suppose that one o$ this generic instructions is a conditional branching. What will the outHo$Horder engine do< $ the CPU implements a $eature called speculative execution &all modern CPUs do', it will execute both branches. Consider the example below) 1. generic instruction 2. generic instruction 3. if a=<b go to instruction 15 4. generic instruction 5. generic instruction 6. generic instruction 7. math instruction 8. generic instruction 9. generic instruction 10. math instruction … 15. math instruction 16. generic instruction … When the outHo$Horder engine analyses this program, it will pull instruction ,7 into one o$ the 5PUs, since it will need one math to $ill one o$ the 5PUs that otherwise would be idle. 1o at a given moment we could have both branches being processed at the same time. $ when the CPU $inishes processing the third instruction a is greater than b, the CPU will simple discard the processing o$ instruction ,7. Oou may think this is a waste o$ time, but in $act it is not. t doesn*t cost anything to the CPU to execute that particular instruction, because the 5PU would be otherwise idle anyway. 2n the other hand, i$ aKLb the CPU will have a per$ormance boost, since when instruction / asks $or instruction ,7 it will be already processed, going straight to instruction ,: or even $urther, i$ instruction ,: has also already been processed by the outHo$Horder engine.

,.

2$ course everything we explained on this tutorial is an over simpli$ication in order to make this very technical sub+ect easier to understand.&read nside Pentium 0 !rchitecture in order to study the architecture o$ a speci$ic processor'. Inside Pentiu" 4 2rc'itecture Introduction n this tutorial we will explain you how Pentium 0 works in an easy to $ollow language. Oou will learn exactly how its architecture works so you will be able to compare it more precisely to previous processors $rom ntel and competitors $rom !"#. Pentium 0 and new Celeron processors use ntel*s seventh generation architecture, also called 4etburst. ts overall look you can see on 5igure ,. #on*t get scared. We will explain deeply what this diagram is about. n order to continue, however, you need to have read >(ow a CPU Works?.%her we explain the basics about how a CPU works. n the present tutorial we are assuming that you have already read it, so i$ you didn*t, please take a moment to read it be$ore continuing, otherwise you may $ind yoursel$ a little bit lost.

Figure 1: Pentium 0 block diagram.

(ere are the basic di$$erences between Pentium 0 architecture and the architecture $rom other CPUs)

,/

Externally, Pentium 0 trans$ers $our data per clock cycle. %his techni=ue is called G#- &Guad #ata -ate' and makes the local bus to have a per$ormance $our times its actual clock rate, see table below. 2n 5igure , this is shown on >/.. EC3s 1ystem nter$ace?@ since this slide was produced when the very $irst Pentium 0 was released, it mentions the >099 "(A? system bus.

P

3eal Clock ,99 "(A ,// "(A .99 "(A

Per#or"ance 099 "(A 7// "(A B99 "(A

4rans#er 3ate /.. EC3s 0.. EC3s :.0 EC3s

.:: "(A ,,9:: "(A B.7 EC3s %he datapath between the D. memory cache &>D. cache and control? on 5igure ,' and D, data cache &>D, #HCache and #H%DC? on 5igure ,' is .7:Hbit wide. 2n previous processors $rom ntel this datapath was o$ only :0 bits. 1o this communication can be $our times $aster than processors $rom previous generations when running at the same clock. %he datapath between D. memory cache &>D. cache and control? on 5igure ,' and the preH$etch unit &>C%C Q H%DC? on 5igure ,', however, continues to be :0Hbit wide. %he D, instruction cache was relocated. nstead o$ being be$ore the $etch unit, the D, instruction cache is now a$ter the decode unit, with a new name, >%race Cache?. %his trace cache can hold up to ,. I microinstructions. 1ince each microinstruction is ,99Hbit wide, the trace cache is o$ ,79 IC &,. I x ,99 3 B'. 2n o$ the most common mistakes people make when commenting Pentium 0 architecture is saying that Pentium 0 doesn*t have any instruction cache at all. %hat*s absolutely not true. t is there, but with a di$$erent name and a di$$erent location. 2n Pentium 0 there are ,.B internal registers, on ntel*s :th generation processors &like Pentium and Pentium ' there were only 09 internal registers. %hese registers are in the -egister -enaming Unit &a.k.a. -!%, -egister !lias %able, shown as >-ename3!lloc? on 5igure ,'. Pentium 0 has $ive execution units working in parallel and two units $or loading and storing data on -!" memory.

2$ course this is +ust a summary $or those who already has some knowledge on the architecture $rom other processors. $ all this look like Ereek to you, don*t worry. We will explain everything you need to know about Pentium 0 architecture in an easy to $ollow language on the next pages. Pentiu" 4 Pi0eline

,0 Pipeline is a list o$ all stages a given instruction must go thru in order to be $ully executed. 2n :th generation ntel processors, like Pentium , their pipeline had ,, stages. Pentium 0 has .9 stagesF 1o, on a Pentium 0 processor a given instruction takes much longer to be executed then on a Pentium , $or instanceF $ you take the new 89 nm Pentium 0 generation processors, codenamed >Prescott?, the case is even worse because they use a /,H stage pipelineF (oly cowF %his was done in order to increase the processor clock rate. Cy having more stages each individual stage can be constructed using $ewer transistors. With $ewer transistors is easier to achieve higher clock rates. n $act, Pentium 0 is only $aster than Pentium because it works at a higher clock rate. Under the same clock rate, a Pentium CPU would be $aster than a Pentium 0 because o$ the siAe o$ the pipeline. Cecause o$ that, ntel has already announced that their Bth generation processors will use Pentium " architecture, which is based on ntel*s :th generation architecture &Pentium architecture' and not on 4etburst &Pentium 0' architecture. 2n 5igure . you can see Pentium 0 .9Hstage pipeline. 1o $ar ntel didn*t disclosure Prescott*s /,Hstage pipeline, so we can*t talk about it.

Figure 2: Pentium 0 pipeline. (ere is a basic explanation o$ each stage, which explains how a given instruction is processed by Pentium 0 processors. $ you think this is too complex $or you, don*t worry. %his is +ust a summary o$ what we will be explaining on the next pages.

• • • •

%C 4xt P) %race cache next instruction pointer. %his stage looks at branch target bu$$er &C%C' $or the next microinstruction to be executed. %his step takes two stages. %C 5etch) %race cache $etch. Doads, $rom the trace cache, this microinstruction. %his step takes two stages. #rive) 1ends the microinstruction to be processed to the resource allocator and register renaming circuit. !lloc) !llocate. Checks which CPU resources will be needed by the microinstruction – $or example, the memory load and store bu$$ers. -ename) $ the program uses one o$ the eight standard xB: registers it will be renamed into one o$ the ,.B internal registers present on Pentium 0. %his step takes two stages. Gue) Gueue. %he microinstructions are put in =ueues accordingly to their types &$or example, integer or $loating point'. %hey are held in the =ueue until there is an open

,7 slot o$ the same type in the scheduler. 1ch) 1chedule. "icroinstructions are scheduled to be executed accordingly to its type &integer, $loating point, etc'. Ce$ore arriving to this stage, all instructions are in order, i.e. on the same order they appear on the program. !t this stage, the scheduler reHorders the instructions in order to keep all execution units $ull. 5or example, i$ there is one $loating point unit going to be available, the scheduler will look $or a $loating point instruction to send it to this unit, even i$ the next instruction on the program is an integer one. %he scheduler is the heart o$ the outHo$Horder engine o$ ntel 6th generation processors. %his step takes three stages. #isp) #ispatch. 1ends the microinstructions to their corresponding execution engines. %his step takes two stages. -5) -egister $ile. %he internal registers, stored in the instructions pool, are read. %his step takes two stages. Ex) Execute. "icroinstructions are executed. 5lgs) 5lags. %he microprocessor $lags are updated. Cr Ck) Cranch check. Checks i$ the branch taken by the program is the same predicted by the branch prediction circuit. #rive) 1ends the results o$ this check to the branch target bu$$er &C%C' present on the processor*s entrance.

• • • • •

%e"or& Cac'e and Fetc' Unit Pentium 0*s D. memory cache can be o$ .7: IC, 7,. IC, , "C or . "C, depending on the model. D, data cache is o$ B IC or ,: IC &on 89 nm models'. !s we explained be$ore, the D, instruction cache was moved $rom be$ore the $etch unit to a$ter the decode unit using a new name, >trace cache?. 1o, instead o$ storing program instructions to be loaded by the $etch unit, the trace cache stores microinstructions already decoded by the decode unit. %he trace cache can store up to ,. I microinstructions and since Pentium 0 microinstructions are ,99Hbit wide, the trace cache is o$ ,79 IC &,.,.BB x ,99 3 B'. %he idea behind this architecture is really interesting. n the case o$ a loop on the program &a loop is a part o$ a program that needs to be repeated several times', the instructions to be executed will be already decoded, because they are stored already decoded on the trace cache. 2n other processors, the instructions need to be loaded $rom D, instruction cache and decoded again, even i$ they were decoded a $ew moments be$ore. %he trace cache has also its own C%C &Cranch %arget Cu$$er' o$ 7,. entries. C%C is a small memory that lists all identi$ied branches on the program. !s $or the $etch unit, its C%C was increased to 0,98: entries. 2n ntel :th generation processors, like Pentium , this bu$$er was o$ 7,. entries and on ntel 7th generation processors, like the $irst Pentium processor, this bu$$er was o$ .7: entries only.

,:

2n 5igure / you see the block diagram $or what we were discussing. %DC means %ranslation Dookaside Cu$$er.

Figure 3: 5etch and decode units and trace cache. !ecoder 1ince previous generation &:th generation', ntel processors use a hybrid C 1C3- 1C architecture. %he processor must accept C 1C instructions, also known as xB: instructions, since all so$tware available today is written using this kind o$ instructions. ! - 1CHonly CPU couldn*t be create $or the PC because it wouldn*t run so$tware we have available today, like Windows and 2$$ice. 1o, the solution used by all processors available on the market today $rom both ntel and !"# is to use a C 1C3- 1C decoder. nternally the CPU processes - 1CHlike instructions, but its $rontHend accepts only C 1C xB: instructions. C 1C xB: instructions are re$erred as >instructions? as the internal - 1C instructions are re$erred as >microinstructions? or >Rops?. %hese - 1C microinstructions, however, cannot be accessed directly, so we couldn*t create so$tware based on these instructions to bypass the decoder. !lso, each CPU uses its own - 1C instructions, which are not public documented and are incompatible with microinstructions $rom other CPUs. .e., Pentium microinstructions are di$$erent $rom Pentium 0 microinstructions, which are di$$erent $rom !thlon :0 microinstructions. #epending on the complexity o$ the xB: instruction, it has to be converted into several - 1C microinstructions. Pentium 0 decoder can decode one xB: instruction per clock cycle, as long as the instruction decodes in up to $our microinstructions. $ the xB: instruction to be decoded is complex and will be translated in more than $our microinstructions, it is routed to a -2"

,6

memory &>"icrocode -2"? on 5igure /' that has a list o$ all complex instructions and how they should be translated. %his -2" memory is also called " 1 &"icrocode nstruction 1e=uencer'. !s we said earlier, a$ter being decoded microinstructions are sent to the trace cache, and $rom there they go to a microinstructions =ueue. %he trace cache can put up to three microinstructions on the =ueue per clock cycle, however ntel doesn*t tell the depth &siAe' o$ this =ueue. 5rom there, the instructions go to the !llocator and -egister -enamer. %he =ueue can also deliver up to three microinstructions per clock cycle to the allocator. 2llocator and 3egister 3ena"er What the allocator does)

• •

-eserves one o$ the ,.: reorder bu$$ers &-2C' to the current microinstruction, in order to keep track o$ the microinstruction completion status. %his allows the microinstructions to be executed outHo$Horder, since the CPU will be able to put them in order again by using this table. -eserves on o$ the ,.B register $iles &-5' in order to store there the data resulted $rom the microinstruction processing. $ the microinstruction is a load or a store, i.e. it will read &load' or write &store' data $rom3to -!" memory, it will reserve one o$ the 0B load bu$$ers or one o$ the .0 store bu$$ers accordingly. -eserves an entry on the memory or general =ueue, depending on the kind o$ microinstruction it is.

!$ter that the microinstruction goes to the register renaming stage. C 1C xB: architecture has only eight /.Hbit registers &E!;, EC;, EC;, E#;, ECP, E1 , E# and E1P'. %his number is simply too low, especially because modern CPUs can execute code outHo$Horder, what would >kill? the contents o$ a given register, crashing the program. 1o, at this stage, the processor changes the name and contents o$ the registers used by the program into one o$ the ,.B internal registers available, allowing the instruction to run at the same time o$ another instruction that uses the exact same standard register, or even outH o$Horder, i.e. this allows the second instruction to run be$ore the $irst instruction even i$ they mess with the same register. t is interesting to note that Pentium 0 has actually .7: internal registers, ,.B registers $or integer instructions and ,.B registers $or $loating point and 11E instructions. Pentium 0 renamer is capable o$ processing three microinstructions per clock cycle. 5rom the renamer the microinstructions go to a =ueue, accordingly to its type) memory =ueue, $or memoryHrelated microinstructions, or nteger35loating Point Gueue, $or all other

,B

instruction types.

Figure 4: !llocator and -egister -enamer.

!is0atc' and Execution Units !s we*ve seen, Pentium 0 has $our dispatch ports numbered 9 thru /. Each port is connected to one, two or three execution units, as you can see on 5igure :.

Figure $: #ispatch and execution units. %he units marked as >clock x.? can execute two microinstructions per clock cycle. Ports 9 and , can send two microinstructions per clock cycle to these units. 1o the maximum number o$ microinstructions that can be dispatched per clock cycle is six)
• • • •

%wo microinstructions on port 9@ %wo microinstructions on port ,@ 2ne microinstruction on port .@ 2ne microinstruction on port /.

Ieep in mind that complex instructions may take several clock cycles to be processed. Det*s take an example o$ port ,, where the complete $loating point unit is located. While this unit is processing a very complex instruction that takes several clock ticks to be executed, port , dispatch unit won*t stall) it will keep sending simple instructions to the !DU &!rithmetic and Dogic Unit' while the 5PU is busy. 1o, even thought the maximum dispatch rate is six microinstructions, actually the CPU can have up to seven microinstructions being processed at the same time.

,8

!ctually that*s why ports 9 and , have more then one execution unit attached. $ you pay attention, ntel put on the same port one $ast unit together with at least one complex &and slow' unit. 1o, while the complex unit is busy processing data, the other unit can keep receiving microinstructions $rom its corresponding dispatch port. !s we mentioned be$ore, the idea is to keep all execution units busy all the time. %he two doubleHspeed !DUs can process two microinstructions per clock cycle. %he other units need at least one clock cycle to process the microinstructions they receive. 1o, Pentium 0 architecture is optimiAed $or simple instructions. !s you can see on 5igure :, dispatch ports . and / are dedicated to memory operations) load &read data $rom memory' and store &write data to memory', respectively. !s $or memory operation, it is interesting to note that port 9 is also used during store operations &see 5igure 7 and the list o$ operations on 5igure :'. 2n such operations, port / is used to send the memory address, while port 9 is used to send the data to be stored at this address. %his data can be generated by either the !DU or the 5PU, depending on the kind o$ data to be stored &integer or $loating point311E'. 2n 5igure : you have a complete list o$ the kinds o$ instructions each execution unit deals with. 5;C( and DE! &Doad E$$ective !ddress' are two xB: instructions. !ctually ntel*s implementation $or 5;C( instruction on Pentium 0 caused a great deal o$ surprise to all experts, because on processors $rom previous generation &Pentium ' and processors $rom !"# this instruction can be executed at Aero clock cycle, while on Pentium 0 it takes some clock cycles to be executed.

C'i0sets Introduction !$ter all, what is a chipset< What are its $unctions< What is its importance< What is its in$luence in the computer per$ormance< n this tutorial we will answer all these =uestions and more. Chipset is the name given to the set o$ chips &hence its name' used on a motherboard. 2n the $irst PCs, the motherboard used discrete integrated circuits. 1o a lot o$ chips were needed to create all the necessary circuitry to make the computer work. 2n 5igure , you can see a motherboard $rom a PC ;%.

.9

Figure 1: PC ;% motherboard. !$ter some time the chip manu$acturers started to integrate several chips into larger chips. 1o, instead o$ re=uiring doAens o$ small chips, a motherboard could now be built using only a hal$HdoAen big chips. %he integration continued and around the midH,889*s motherboards using only two or even one big chip could be built. 2n 5igure . you can see a 0B: motherboard circa ,887 using only two big chips with all necessary $unctions to make the motherboard work.

.,

Figure 2: ! 0B: motherboard, this model uses only two big chips. With the release o$ the PC bus, a new concept, which is still used nowadays, could be used $or the $irst time) the use o$ bridges. Usually motherboards have two big chips) north bridge and south bridge. 1ometimes some chip manu$acturers can integrate the north and south bridges into a single chip@ in this case the motherboard will have +ust one big integrated circuitF With the use o$ bridges chipsets could be better standardiAed, and we will explain the role o$ these chips on the next pages. Chipsets can be manu$actured by several companies, like UDi &new name $or !Di', ntel, S !, 1i1, !% and nSidia. n the past other players were at the market, like U"C and 2P%i. ! common con$usion is to mix the chipset manu$acturer with the motherboard manu$acturer. 5or example, only because a motherboard uses a chipset manu$actured by ntel, this not means that ntel manu$actured this board. !1U1, EC1, Eigabyte, "1 , #5 , Chaintech, PCChips, 1huttle and also ntel are +ust some o$ the many motherboard manu$acturers present in the market. 1o, the motherboard manu$acturer buys the chipsets $rom the chipsets manu$acturer and builds them. !ctually there is a very interesting aspect o$ this relationship. %o build a motherboard, the manu$acturer can $ollow the chipset manu$acturer standard pro+ect, also known as >re$erence design?, or can create its own pro+ect, modi$ying some things here and there in order to provide better per$ormance or more $eatures. 5ort' ridge %he north bridge chip, also called "C( &"emory Controller (ub' is connect directly to the CPU and has basically the $ollowing $unctions)
• • • •

"emory controller &T' !EP bus controller &i$ available' PC Express x,: controller &i$ available' nter$ace $or data trans$er with south bridge

&T' Except $or socket 670, socket 8/8 and socket 809 CPUs &CPUs $rom !"# like !thlon :0', because on these CPUs the memory controller is located in the CPU itsel$, not in the north bridge. 1ome north bridge chips also controls PC Express x, lanes. 2n other PC Express chipsets it is the south bridge that controls the PC Express x, lanes. n our explanations we will assume that the south bridge is the component in charge o$ controlling the PC Express x, lanes, but keep in mind that this can vary accordingly to the chipset model.

..

2n 5igure / you can see a diagram explaining the role o$ the north bridge in the computer.

Figure 3: 4orth bridge. !s you can see, the CPU does not directly accesses the -!" memory or the video card, it is the north bridge that accesses these devices. Cecause o$ that, the north bridge chip has an ultimate role in the computer per$ormance. $ a north bridge chip has a better memory controller than another north bridge, the per$ormance o$ the whole computer will be better. %hat*s one explanation why you can have two motherboards targeted to the same class o$ processors achieving di$$erent per$ormances. !s we mentioned, on !thlon :0 CPUs the memory controller is embedded in the CPU and that*s why there is almost no per$ormance di$$erence among motherboards $or this plat$orm. 1ince the memory controller is in the north bridge, is this chip that limits the types and maximum amount o$ memory you can have in our system &on !thlon :0 it is the CPU that sets these limits'. %he connection between the north bridge and the south bridge is done through a bus. !t $irst the PC bus was used, but later it was replaced by a dedicated bus. We will explain more about this later, since the kind o$ bus used on this connection can a$$ect the computer per$ormance. /out' ridge %he south bridge chip, also called C( & 32 Controller (ub' is connected to the north bridge and is in charge basically o$ controlling 32 devices and onHboard devices, like)
• • •

(ard disk drive ports &Parallel and 1erial !%! ports' U1C ports 2nHboard audio &T'

./
• • • • • •

2nHboard D!4 &TT' PC bus PC Express lanes &i$ available' -eal time clock &-%C' C"21 memory Degacy devices like interrupt controller and #"! controller

&T' $ the south bridge has a builtHin audio controller, it will need an external chip called codec &short $or coder3decoder' to operate. &TT' $ the south bridge has a builtHin network controller, it will need an external chip called phy &short $or physical' to operate. %he south bridge is also connected to two other chips available on the motherboard) the -2" chip, more known as C 21, and the 1uper 32 chip, which is in charge o$ controlling legacy devices like serial ports, parallel port and $loppy disk drive. 2n 5igure 0 you can see a diagram explaining the role o$ the south bridge in the computer.

Figure 4: 1outh bridge. !s you can see, while south bridge can have some in$luence on hard disk drive per$ormance, this component is not so critic to per$ormance as the north bridge. !ctually, south bridge has more to do with the $eatures your motherboard will have than with per$ormance. t is the south bridge that sets the number &and speed' o$ U1C ports and the number and types &regular !%! or 1erial !%!' o$ hard disk drive ports that your motherboard has, $or example. Inter, ridge 2rc'itecture When the bridge concept started to be used, the communication between the north bridge and the south bridge was done thru this bus, as we show on 5igure 7. %he problem o$ this approach is that the bandwidth available $or the PC bus – ,/. "C3s – will be shared

.0 between all PC devices in the system and devices hooked to the south bridge – especially hard disk drives. !t that time, this wasn*t a problem, since hard drives maximum trans$er rates were o$ B "C3s and ,: "C3s.

Figure 5: Communication between north and south bridges using the PC bus. Cut when highHend video cards &at that time, the video cards were PC ' and highH per$ormance hard disk drives were launched, a bottleneck situation arouse. Just think o$ modern !%!3,// hard disk drives, which have the same theoretical maximum trans$er rate as the PC busF 1o, in theory, an !%!3,// hard drive would >kill? and the entire bandwidth, slowing down the communication speed o$ all devices connected to the PC bus. 5or the highHend video cards, the solution was the creation o$ a new bus connected directly to the north bridge, called !EP &!ccelerated Eraphics Port'. %he $inal solution came when the chipset manu$acturers started using a new approach) using a dedicated highHspeed bus between north and south bridges and connecting the PC bus devices to the south bridge.

.7

Figure $: Communication between north and south bridges using a dedicated bus. When ntel started using this architecture it started calling the bridges as >hubs?, the north bridge became "C( &"emory Controller (ub' and the south bridge became C( & 32 Controller (ub'. t is +ust a matter o$ nomenclature in order to clari$y the architecture that is being used. Using this new architecture, which is the architecture that motherboards use nowadays, when the CPU reads data $rom a hard drive, the data is trans$erred $rom the hard drive to the south bridge, then to the north bridge &using the dedicated bus' and then to the CPU &or directly to memory, i$ the Cus "astering – a.k.a. #"! – method is being used'. !s you can see, the PC bus is not used at all on this trans$er, what didn*t happen on the previous architecture, since the PC bus was in the middle o$ the road. %he speed o$ this dedicated bus depends on the chipset model. 5or example, on ntel 8.7; chipset this bus has a maximum trans$er speed o$ . EC3s. !lso, the manu$acturers call this bus with di$$erent names)
• • • • • •

ntel) #" &#irect "edia nter$ace' or ntel (ub !rchitecture &T' UDi3!Di) (yper%ransport S !) SHDink 1i1) "u% 2D &TT' !% ) !HDink or PC Express nSidia) (yper%ransport &TT'

&T' #" inter$ace is newer, used on i8,7 and i8.7 chipsets on and uses two separated data paths, one $or data transmission and another $or reception &$ullHduplex communication'. ntel (ub !rchitecture, used by previous chipsets, uses the same data path $or both transmission and reception &hal$Hduplex communication'. &TT' 1ome nSidia and 1i1 chipsets use only one chip, i.e. i.e. the $unctionalities o$ both

.:

north and south bridges are integrated into a single chip. !lso, on -adeon ;press .99 $rom !% , the communication between north and south bridges uses two PC Express lanes. %his doesn*t a$$ect the per$ormance o$ the system, because contrary to PC , PC Express bus is not shared between all PC Express devices. t is a pointHtoHpoint solution, which means that the bus only connect two devices, the receiver and the transmitter@ no other device can be attached to this connection. 2ne lane is used $or data transmission and the other $or data reception &$ullHduplex communication'. (yper%ransport bus also uses separated data paths, one $or data transmission and another $or reception &$ullHduplex communication'.. $ you want to know the details o$ a given chipset, +ust go to the chipset manu$acturer website.. !s a last comment, you may be wondering what is >onHboard PC devices? listed on 5igures 7 and :. 2nHboard devices like D!4 and audio can be controlled by the chipset &south bridge' or by an extra controller chip. When this second approach is used, this controller chip is connected to the PC bus.

PC MOTHERBOARDS :
If you've ever taken the case off of a computer, you've seen the one piece of equipment that ties everything together -- the motherboard. A motherboard allows all the parts of your computer to receive power and communicate with one another. Motherboards have come a long way in the last twenty years. he first motherboards held very few actual components. he first I!M "# motherboard had only a processor and card slots. $sers plugged components like floppy drive controllers and memory into the slots. oday, motherboards typically boast a wide variety of built-in features, and they directly affect a computer's capabilities and potential for upgrades. In this article, we'll look at the general components of a motherboard. hen, we'll closely e%amine five points that dramatically affect what a computer can do.

Form Factor
A motherboard by itself is useless, but a computer has to have one to operate. he motherboard's main &ob is to hold the computer's microprocessor chip and let everything else connect to it. 'verything that runs the computer or enhances its performance is either part of the motherboard or plugs into it via a slot or port.

.6

A modern motherboard.. he shape and layout of a motherboard is called the form factor. he form factor affects where individual components go and the shape of the computer's case. here are several specific form factors that most "# motherboards use so that they can all fit in standard cases. (or a comparison of form factors, past and present, check out Motherboards.org. he form factor is &ust one of the many standards that apply to motherboards. )ome of the other standards include* he socket for the micro rocessor determines what kind of #entral "rocessing $nit +#"$, the motherboard uses. • he chi set is part of the motherboard's logic system and is usually made of two parts -- the northbridge and the southbridge. hese two -bridges- connect the #"$ to other parts of the computer. • he !asic Input./utput )ystem +!I/), chip controls the most basic functions of the computer and performs a self-test every time you turn it on. )ome systems feature dual !I/), which provides a backup in case one fails or in case of error during updating. • he rea! time c!ock chi is a battery-operated chip that maintains basic settings and the system time. he slots and ports found on a motherboard include* "eripheral #omponent Interconnect +"#I,- connections for video, sound and video capture cards, as well as network cards • Accelerated 0raphics "ort +A0", - dedicated port for video cards. • Integrated 1rive 'lectronics +I1', - interfaces for the hard drives • $niversal )erial !us or (ire2ire - e%ternal peripherals • Memory slots )ome motherboards also incorporate newer technological advances* • •

.B
• • • 3edundant Array of Independent 1iscs +3AI1, controllers allow the computer to recogni4e multiple drives as one drive. PC" E# ress is a newer protocol that acts more like a network than a bus. It can eliminate the need for other ports, including the A0" port. 3ather than relying on plug-in cards, some motherboards have on$board sound, networking, video or other peripheral support.

A Socket %&' motherboard

Many people think of the #"$ as one of the most important parts of a computer. 2e'll look at how it affects the rest of the computer in the ne%t section.

Sockets and CP(s
he #"$ is the first thing that comes to mind when many people think about a computer's speed and performance. he faster the processor, the faster the computer can think. In the early days of "# computers, all processors had the same set of pins that would connect the #"$ to the motherboard, called the Pin )rid Arra* +"0A,. hese pins fit into a socket layout called Socket %. his meant that any processor would fit into any motherboard.

.8

A Socket +,+ motherboard

oday, however, #"$ manufacturers Intel and AM1 use a variety of "0As, none of which fit into )ocket 5. As microprocessors advance, they need more and more pins, both to handle new features and to provide more and more power to the chip. #urrent socket arrangements are often named for the number of pins in the "0A. #ommonly used sockets are* • • • • • Socket '%- - for older "entium and #eleron processors Socket %&' - for AM1 )empron and some AM1 Athlon processors Socket +,+ - for newer and faster AM1 Athlon processors Socket AM. - for the newest AM1 Athlon processors Socket A - for older AM1 Athlon processors

/9

A Socket /)A%&& motherboard

he newest Intel #"$ does not have a "0A. It has an 60A, also known as )ocket . 60A stands for 6and 0rid Array. An 60A is different from a "0A in that the pins are actually part of the socket, not the #"$. Anyone who already has a specific #"$ in mind should select a motherboard based on that #"$. (or e%ample, if you want to use one of the new multi-core chips made by Intel or AM1, you will need to select a motherboard with the correct socket for those chips. #"$s simply will not fit into sockets that don't match their "0A. he #"$ communicates with other elements of the motherboard through a chipset. 2e'll look at the chipset in more detail ne%t.

Chi sets
he chipset is the -glue- that connects the microprocessor to the rest of the motherboard and therefore to the rest of the computer. /n a "#, it consists of two basic parts -- the northbrid0e and the so1thbrid0e. All of the various components of the computer communicate with the #"$ through the chipset.

/,

The northbrid0e and so1thbrid0e

he northbridge connects directly to the processor via the front side bus +()!,. A memory controller is located on the northbridge, which gives the #"$ fast access to the memory. he northbridge also connects to the A0" or "#I e%press bus bus and to the memory itself. he southbridge is slower than the northbridge, and information from the #"$ has to go through the northbridge before reaching the southbridge. /ther busses connect the southbridge to the "#I bus, the $)! ports and the I1' or )A A hard disk connections. #hipset selection and #"$ selection go hand in hand, because manufacturers optimi4e chipsets to work with specific #"$s. he chipset is an integrated part of the motherboard, so it cannot be removed or upgraded. his means that not only must the motherboard's socket fit the #"$, the motherboard's chipset must work optimally with the #"$. 7e%t, we'll look at busses, which, like the chipset, carry information from place to place.

B1s S eed
A bus is simply a circuit that connects one part of the motherboard to another. he more data a bus can handle at one time, the faster it allows information to travel. he s eed of the bus, measured in megahert4 +M84,, refers to how much data can move across the bus simultaneously.

/.

B1sses connect different arts of the motherboard to one another !us speed usually refers to the speed of the front side b1s +()!,, which connects the #"$ to the northbridge. ()! speeds can range from 99 M84 to over :;; M84. )ince the #"$ reaches the memory controller though the northbridge, ()! speed can dramatically affect a computer's performance. 8ere are some of the other busses found on a motherboard* he back side b1s connects the #"$ with the level < +6<, cache, also known as secondary or e%ternal cache. he processor determines the speed of the back side bus. • he memor* b1s connects the northbridge to the memory. • he I1' or ATA bus connects the southbridge to thedisk drives. • he A0" bus connects the video card to the memory and the #"$. he speed of the A0" bus is usually 99 M84. • he "#I bus connects "#I slots to the southbridge. /n most systems, the speed of the "#I bus is == M84. Also compatible with "#I is "#I '%press, which is much faster than "#I but is still compatible with current software and operating systems. "#I '%press is likely to replace both "#I and A0" busses. he faster a computer's bus speed, the faster it will operate -- to a point. A fast bus speed cannot make up for a slow processor or chipset. 7ow let's look at memory and how it affects the motherboard's speed. •

//

Memor* and Other Feat1res
2e've established that the speed of the processor itself controls how quickly a computer thinks. he speed of the chipset and busses controls how quickly it can communicate with other parts of the computer. he speed of the 3AM connection directly controls how fast the computer can access instructions and data, and therefore has a big effect on system performance. A fast processor with slow 3AM is going nowhere. he amount of memory available also controls how much data the computer can have readily available. 3AM makes up the bulk of a computer's memory. he general rule of thumb is the more 3AM the computer has, the better. Much of the memory available today is d1a! data rate +113, memory. his means that the memory can transmit data twice per cycle instead of once, which makes the memory faster. Also, most motherboards have space for multiple memory chips, and on newer motherboards, they often connect to the northbridge via a dual bus instead of a single bus. his further reduces the amount of time it takes for the processor to get information from the memory.

.22$ in DDR SOD"MM RAM A motherboard's memory slots directly affect what kind and how much memory is supported. >ust like other components, the memory plugs into the slot via a series of pins. he memory module must have the right number of pins to fit into the slot on the motherboard.

/0
In the earliest days of motherboards, virtually everything other than the processor came on a card that plugged into the board. 7ow, motherboards feature a variety of onboard accessories such as 6A7 support, video, sound support and 3AI1 controllers. 3'MB SDRAM S"MM Motherboards with all the bells and whistles are convenient and simple to install. here are motherboards that have everything you need to create a complete computer -- all you do is stick the motherboard in a case and add a hard disk, a #1 driver and a power supply. ?ou have a completely operational computer on a single board. (or many average users, these built-in features provide ample support for video and sound. (or avid gamers and people who do high-intensity graphic or computer-aided design +#A1, work, however, separate video cards provide much better performance.

Introduction to I54E6/7s new "icro0rocessors arc'itecture

1andy Cridge is the name o$ the new microarchitecture ntel CPUs is using starting in .9,,. t is an evolution o$ the 4ehalem microarchitecture that was $irst introduced in the Core i6 and also used in the Core i/ and Core i7 processors. $ you don*t $ollow the CPU market that closely, let*s make a =uick recap. !$ter the Pentium 0, which was based on ntel*s 6th generation microarchitecture, called 4etburst, ntel decided to go back to their :th generation microarchitecture &the same one used by Pentium Pro, Pentium , and Pentium , dubbed P:', which proved to be more e$$icient. 5rom the Pentium " CPU &which is a :th generation ntel CPU', ntel developed the Core architecture, which was used on the Core . processor series &Core . #uo, Core . Guad, etc'. %hen, ntel got this architecture, tweaked it a little bit more &the main innovation was the addition o$ an integrated memory controller', and released the 4ehalem microarchitecture, which was used on the Core i/, Core i7, and Core i6 processor series. !nd, $rom this microarchitecture, ntel developed the 1andy Cridge microarchitecture, which was used by the new generation o$ Core i/, Core i7, and Core i6 processors in .9,, and .9,.. 5or better understanding the present tutorial, we recommend you to read the $ollowing tutorials, in this particular order)
• • •

nside Pentium " !rchitecture nside ntel Core "icroarchitecture nside ntel 4ehalem "icroarchitecture

%he main speci$ications $or the 1andy Cridge microarchitecture are summariAed below. We will explain them in more detail in the next pages.

/7

• • • • • • • • • • • •

%he north bridge &memory controller, graphics controller and PC Express controller' is integrated in the same chip as the rest o$ the CPU. n 4ehalemHbased CPUs, the north bridge is located in a separate silicon chip packed together with the CPU silicon chip. n $act, with /.Hnm 4ehalemHbased CPUs the north bridge is manu$actured under 07Hnm process. 5irst models use a /.Hnm manu$acturing process -ing architecture 4ew decoded microinstructions cache &D9 cache, capable o$ storing ,,7/: microinstructions, which translates in more or less to : kC' /. kC D, instruction and /. kC D, data cache per CPU core &no change $rom 4ehalem' D. memory cache was renamed to >midHlevel cache? &"DC' with .7: kC per CPU core D/ memory cache is now called DDC &Dast Devel Cache', it is not uni$ied anymore, and is shared by the CPU cores and the graphics engine 4ext generation %urbo Coost technology 4ew !S; &!dvanced Sector Extensions' instruction set mproved graphics controller -edesigned ##-/ dualHchannel memory controller supporting memories up to ##-/H,/// ntegrated PC Express controller supporting one x,: lane or two xB lanes &no change $rom 4ehalem' 5irst models use a new socket with ,,77 pins

click to enlarge

Figure 1: 1andy Cridge microarchitecture summary En'ance"ents to t'e CPU Pi0eline Det*s start our +ourney talking about what is new the way instructions are processed in the 1andy Cridge microarchitecture. %here are $our instruction decoders, meaning that the CPU can decode up to $our instructions per clock cycle. %hese decoders are in charge o$ decoding !/. &a.k.a. xB:'

/:

/6

/B

/8

09

0,