You are on page 1of 11

How a CPU Works

Author: Gabriel Torres

Even though every microprocessor has its own internal design, all microprocessors share a same basic concept – which we will explain in this tutorial. We will take a look inside a generic CPU architecture, so you will be able to understand more about Intel and AMD products and the differences between them. The CPU (Central Processing Unit) – which is also called microprocessor or processor – is in charge of processing data. How it will process data will depend on the program. The program can be a spreadsheet, a word processor or a game: for the CPU it makes no difference, since it doesn’t understand what the program is actually doing. It just follows the orders (called commands or instructions) contained inside the program. These orders could be to add two numbers or to send a piece of data to the video card, for example. When you double click on an icon to run a program, here is what happens: 1. The program, which is stored inside the hard disk drive, is transferred to the RAM memory. A program is a series of instructions to the CPU. 2. The CPU, using a circuit called memory controller, loads the program data from the RAM memory. 3. The data, now inside the CPU, is processed. 4. What happens next will depend on the program. The CPU could continue to load and executing the program or could do something with the processed data, like displaying something on the screen.

Figure 1: How stored data is transferred to the CPU. In the past, the CPU controlled the data transfer between the hard disk drive and the RAM memory. Since the hard disk drive is slower than the RAM memory, this slowed down the system, since the CPU would be busy until all the data was transferred from the hard disk drive to the RAM memory. This method is called PIO, Processor I/O (or Programmed I/O). Nowadays data transfer between the hard disk drive and the RAM memory in made without using the CPU, thus making the system faster. This method is called bus mastering or DMA (Direct Memory Access). In order to simplify our drawing we didn’t put the north bridge chip between the hard disk drive and the RAM memory in Figure 1, but it is there. Processors from AMD based on sockets 754, 939 and 940 (Athlon 64, Athlon 64 X2, Athlon 64 FX, Opteron and some Sempron models) have an embedded memory controller. This means that for these processors the CPU accesses the RAM memory directly, without using the north bridge chip shown in Figure 1. To better understand the role of the chipset in a computer, we recommend you to read our tutorial Everything You Need to Know About Chipsets.

different ways of transferring data inside . If you compare two completely identical CPUs. and that processor ”B“ takes five clock cycles to perform this same instruction. Figure 2: Clock signal. But when you do compare two different processors. a RAM memory with a ”5“ latency means that it will delay five full clock cycles to start delivering data. If they are running at the same clock rate. In the computer. all instructions delay a certain number of clock cycles to be performed. For example. like Intel and AMD – things inside the CPU are completely different. For modern CPUs there is much more in the performance game. Of course this is a generic explanation for a CPU with just one execution unit – modern processors have several execution units working in parallel and it could execute the second instruction at the same time as the first. the time between each clock cycle will be shorter. so things are going to be performed in less time and the performance will be higher. two different manufacturers. a given instruction can delay seven clock cycles to be fully executed. as CPUs have different number of execution units. processor ”B“ will be faster. what clock has to do with performance? To think that clock and performance is the same thing is the most common misconception about processors. So. where we show a typical clock signal: it is a square wave changing from ”0“ to ”1“ at a fixed rate. with a higher clock rate. it will automatically start the execution of the next instruction on the 8th clock tick. the one running at a higher clock rate will be faster. The clock signal is measured in a unit called Hertz (Hz). So if it has two instructions to be executed and it knows that the first will delay seven clock cycles to be executed. each instruction takes a certain number of clock cycles to be executed. A clock of 100 MHz means that in one second there is 100 million clock cycles.Clock So. becaus e it can process this instruction is less time. which is the number of clock cycles per second. In this case. The beginning of each cycle is when the clock signal goes from ”0“ to ”1“. all timings are measured in terms of clock cycles. As we mentioned. This is called superscalar architecture and we will talk more about this later. this is not necessarily true. because it has a table which lists this information. If you get two processors with different architectures – for example. Inside the CPU. On this figure you can see three full clock cycles (”ticks“). Regarding the CPU. Take a look at Figure 2. Let’s say that processor ”A“ takes seven clock cycles to perform a given instruction. in parallel. For example. different cache sizes. the interesting thing is that the CPU knows how many clock cycles each instruction will take. what is clock anyway? Clock is a signal used to sync things inside the computer. we marked this with an arrow.

on a 3. you will see several tracks or paths.4 GHz Pentium 4 this ”3. Don’t worry. As the processor clock signal became very high. called clock multiplication. Figure 4: Internal and external clocks on a Pentium 4 3. etc.4 GHz“ refers to the CPU internal clock. so the signal. we will cover all that in this tutorial. which started with 486DX2 processor. which is obtained multiplying by 17 its 200 MHz external clock. and a higher internal clock. being transmitted as radio waves. which is used when transferring data to and from the RAM memory (using the north bridge chip). . these wires started to work as antennas. We illustrated this example in Figure 4.the CPU. If you look at a motherboard. The motherboard where the processor is installed could not work using the same clock signal. different clock rates with the outside world.4 GHz. The problem is that with higher clock rates. Under this scheme. Figure 3: The wires on the motherboard can work as antennas. External Clock So the CPU manufacturers started using a new concept. would simply vanish. instead of arriving at the other end of the wire. the CPU has an external clock. different ways of processing the instructions inside the execution units. which is used in all CPUs nowadays. To give a real example. one problem showed up. These tracks are wires that connect the several circuits of the computer.

The huge difference between the internal clock and the external clock on modern CPUs is one major roadblock to overcome in order to increase the computer performance. Read or tutorial Inside Pentium 4 Architecture for a detailed view on Pentium 4 architecture. Continuing the Pentium 4 3. Figure 5: Transferring more than one data per clock cycle. it has to reduce its speed by 17x when it has to read data from RAM memory! During this process. The same happens with Intel CPUs: an Intel CPU with a 200 MHz external clock is listed as having an 800 MHz external clock. Because of that. Another one is transferring more than one data chunk per clock cycle. while the technique of transferring four data per clock cycle is called QDR (Quad Data Rate). There are many differences between AMD and Intel architectures. but while AMD CPUs transfer two data per clock cycle. The technique of transferring two data per clock cycle is called DDR (Dual Data Rate). Block Diagram of a CPU In Figure 6. For example. an AMD CPU with a 200 MHz external clock is listed as 400 MHz. We think that understanding the basic block diagram of a modern CPU is the first step to understand how CPUs from Intel and AMD work and the differences between them. Intel CPUs transfer four data per clock cycle. you can see a basic block diagram for a modern CPU.4 GHz example. We still plan to write a specific tutorial about Athlon 64 architecture in the near future. AMD CPUs are listed as having the double of their real external clocks. it works as if it were a 200 MHz CPU! Several techniques are used to minimize the impact of this clock difference. One of them is the use of a memory cache inside the CPU. . Processors from both AMD and Intel use this feature.

the transfer rate will be higher). All the circuits inside the dotted box run at the CPU internal clock. but it is a lot faster. When the CPU loads a data from a certain memory . For example. the datapath between the L2 memory cache and the L1 instruction cache on modern processors is usually 256-bit wide. Static memory consumes more power. also called static memory. memory cache technique is used. i. transfer more bits per clock cycle than 64 or 128. Memory Cache Memory cache is a high performance kind of memory. Also.200 MB/s. while the same system using dual channel memories (128 bits) will have a 6. is more expensive and is physically bigger than dynamic memory. measured in MB/s. read our tutorial Everything You Need to Know About DDR Dual Channel. The higher the number the bits transferred per clock cycle. which dynamic memory is not capable of. Depending on the CPU some of its internal parts can even run at a higher clock rate. For more information on this subject. as the RAM memory is located outside the CPU. The datapath between the RAM memory and the CPU is usually 64-bit wide (or 128-bit when dual channel memory configuration is used).e. The kind of memory used on the computer main RAM memory is called dynamic memory.400 MB/s memory transfer rate. the datapath between the CPU units can be wider. running at the memory clock or the CPU external clock. Since going to the “external world” to fetch data makes the CPU to work at a lower clock rate. the formula is number of bits x clock / 8. the fast the transfer will be done (in other words. The number of bits used and the clock rate can be combined in a unit called transfer rate.Figure 6: Basic block diagram of a CPU. For a system using DDR400 memories in single channel configuration (64 bits) the memory transfer rate will be 3. It can work at the same clock as the CPU.. The dotted line in Figure 6 represents the CPU body. In Figure 5 we used a red arrow between the RAM memory and the L2 memory cache and green arrows between all other blocks to express the different clock rates and datapath width used. which one is lower. To calculate the transfer rate.

The bigger the memory cache. which it can access at its internal clock rate. so the CPU will need to directly access RAM memory less often.000.position. In Figure 7 we illustrate this example.024 bytes. so the CPU doesn’t need to go outside to grab the data: it is already loaded inside in the memory cache embedded in the CPU.000. the next memory position the CPU will request will probably be the position immediately below the memory position that it has just loaded. it will load data from 4. the higher the chances of the data required by the CPU are already there. Figure 7: How the memory cache controller works. The cache controller is always observing the memory positions being loaded and loading data from several memory positions after the memory position that has just been read. if the CPU loaded data stored in the address 1. a circuit called memory cache controller (not drawn in Figure 6 in the name of simplicity) loads into the memory cache a whole block of data below the current position that the CPU has just loaded.096 addresses below the current memory position being load (address 1. thus increasing the system performance (just remember that every time the CPU needs to access the RAM memory directly it needs to lower its clock rate for this operation). To give you a real example.000. the cache controller will load data from “n” addresses after the address 1. By the way.096 not 4. the next data will be inside the memory cache. that’s why 4 KB is 4. This number “n” is called page.000 in our example). 1 KB equals to 1. Since the memory cache controller already loaded a lot of data below the first memory position read by the CPU. if a given processor is working with 4 KB pages (which is a typical value). Since usually programs flow in a sequential way. .

We are mentioning this here because this is a very common mistake. because the required instructions will be closer to the fetch unit. some add the amount of the two and writes ”separated“ – so a ”128 KB. but with a different name and a different location. The exception. So when comparing Pentium 4 to other CPUs people would think that its L1 cache is much smaller. because the fetch unit must access directly the slow RAM memory. and some simply add the two and you have to guess that the amount is total and you should divide by two to get the capacity of each cache. of course. one of the main problems for the CPU is having too many cache misses. Usually the use of the memory cache avoids this a lot. In order to solve this issue. One common doubt is why having three separated cache memories (L1 data cache. respectively. Pay attention to Figure 6 and you will see that L1 instruction cache works as an ”input cache“. this new position won’t be loaded in the L2 memory cache. instead they have a trace execution cache. and refers to the distance they are from the CPU core (execution unit). goes to the Pentium 4 and newer Celeron CPUs based on sockets 478 and 775. So. making the fetch unit to go get that position directly in the RAM memory. because they are only counting the 8 KB L1 data cache. to think that Pentium 4 processors don’t have L1 instruction cache.We call a ”hit“ when the CPU loads a required data from the cache. however. Pentium 4 processors (and Celeron processors using sockets 478 and 775) don’t have a L1 instruction cache. while L1 data cache works as an ”output cache“. the cache controller of modern CPUs analyze the memory block it loaded and whenever it finds a JMP instruction in there it will load the memory block for that position in the L2 memory cache before the CPU reaches that JMP instruction. . Branching As we mentioned several times. and we call a ”miss“ if the required data isn’t there and the CPU needs to access the system RAM memory. the L1 instruction cache is there. If in the middle of the program there is an instruction called JMP (”jump “or” go to“) sending the program to a completely different memory position. which is a cache located between the decode unit and the execution unit. but there is one typical situation where the cache controller will miss: branches. separated“ would mean 64 KB instruction cache and 64 KB data cache –. L1 and L2 means ”Level 1“ and ”Level 2“. The trace execution cache of Pentium 4 and Celeron CPUs is of 150 KB and should be taken in account. L1 instruction cache – which is usually smaller than L2 cache – is particularly efficient when the program starts to repeat a small part of it (loop). L1 instruction cache and L2 cache). thus slowing down the system. Some manufacturers list the two L1 cache separately (some times calling the instruction cache as ”I“ and the data cache as ”D“). On the specs page of a CPU the L1 cache can be found with different kinds of representation.

it will simply discard the one that wasn’t chosen. This would make a cache miss. if a =< b go to address 1. the problem is when the program has a conditional branching. because the values of a and b are unknown and the cache controller would be looking only for JMP-like instructions. .Figure 8: Unconditional branching situation. The solution: the cache controller loads both conditions into the memory cache. This is pretty easy to implement.. when the CPU processes the branching instruction. the address the program should go to depends on a condition not yet known. For example. i. Later.e. It is better to load the memory cache with unnecessary data than directly accessing the RAM memory. We illustrate this example in Figure 9. or if a > b go to address 2. Figure 9: Conditional branching situation.

But this will depend on the next instruction that is going to be processed next (the next instruction could be ”print the result on the screen“). it sends it to the decode unit. Usually between the decode unit and the execution unit there is an unit (called dispatch or schedule unit) in charge of sending the instruction to the correct execution unit. its microcode will tell the decode unit that it needs two parameters. a CPU with six execution units can execute six instructions in parallel. then it has to directly load from the slow system RAM memory. This kind of architecture is called superscalar architecture. Each instruction that a given CPU understands has its own microcode. Arithmetic and Logic Unit. By the way.Processing Instructions The fetch unit is in charge of loading instructions from memory. This is done in order to increase the processor performance. a and b. Finally. they have execution units specialized in one kind of instructions. Another interesting feature that all microprocessors have for a long time is called ”pipeline“. If the instruction is also not there. The execute unit will finally execute the instruction. After the fetch unit grabbed the instruction required by the CPU to be processed. put the fetch unit to grab the next instruction? When the . The decode unit will then figure out what that particular instruction does. as the video card. First. ”generic“ execution units are called ALU. the CPU starts processing the first instructions loaded from the hard drive. It does that by consulting a ROM memory that exists inside the CPU. the result is sent to the L1 data cache. for example. and the cache controller starts loading the caches. If the instruction loaded is. If it is not. After the fetch unit sent the instruction to the decode unit. On modern CPUs you will find more than one execution unit working in parallel. which is the capability of having several different instructions at different stages of the CPU at the same time. it goes to the L2 memory cache. if the instruction is a math instruction it will send it to the FPU and not to one ”generic“ execution unit. The best example is the FPU. the result would be sent to the L1 data cache. which is in charge of executing complex math instructions. Float Point Unit. which fit the values for a and b. i. it will be idle. for example. The microcode will ”teach“ the CPU what to do. Usually modern CPUs don’t have several identical execution units. called microcode. when the processing is over.. For example. After the decode unit ”translated“ the instruction and grabbed all required data to execute the instruction. of course. When you turn on your PC all the caches are empty. so in theory it could achieve the same performance of six processors with just one execution unit. how about instead of doing nothing.e. it will pass all data and the ”step-by-step cookbook“ on how to execute that instruction to the execute unit. It is like a step-by-step guide to every instruction. but as the system starts loading the operating system. The decode unit will then request the fetch unit to grab the data present in the next two memory positions. This result can be then sent back to RAM memory or to another place. and the show begins. it will look if the instruction required by the CPU is in the L1 instruction cache. Continuing our add a+b example. right? So. add a+b.

In a modern CPU with an 11-stage pipeline (stage is another name for each unit of the CPU). it will probably have 11 instructions inside it at the same time almost all the time. and FPU. in our example. because we still have two math units (FPUs) available. . math instruction 8. at the fifth instruction. and so on. In our example. Just as a generic example in order to understand the problem. Since the other FPU will still be available. the second. but once it goes out. the seventh and the tenth instructions. for an 11-stage pipeline CPU. the fetch unit can send the second instruction to the decode unit and grab the third instruction.first instruction goes to the execution unit. generic instruction 6. That’s not good. since all its four generic execution units are busy. Out-Of-Order Execution (OOO) Remember that we said that modern CPUs have several execution units working in parallel? We also said that there are different kinds of execution units. The higher the number of stages. In fact. math instruction What will happen? The schedule/dispatch unit will send the first four instructions to the four ALUs but then. The out-of-order engine continues its search and finds out that the seventh instruction is a math instruction that can be executed in one of the available FPUs. So. which is a math execution unit. the first. generic instruction 2. which is a generic execution unit. out-of-order execution (OOO) and speculative execution. So. generic instruction 9. There are several other tricks used by modern CPUs to increase performance. Let’s also say that the program has the following instruction flow in a given moment: 1. generic instruction 4. the second instruction will get out right after it (and not another 11 steps later). Also. keep in mind that because of this concept several instructions can be running inside the CPU at the same time. generic instruction 10. since all modern CPUs have a superscalar architecture. the execution units will be processing. because the sixth instruction also needs one ALU to be processed. On the other hand. the third. like ALU. four ”generic“ and two FPUs. We will explain two of them. generic instruction 7. it will pass the eight and the ninth instructions and will load the tenth instruction. In our example. the number of instructions simultaneously inside the CPU will be even higher. a CPU with out-of-order execution (all modern CPUs have this feature) will look at the next instruction to see if it can be sent to one of the idle units. it will go down the program looking for another math instruction. the CPU will need to wait for one of their ALUs to be free in order to continue processing. at the same time. and they are idle. an instruction to be fully executed will have to pass through 11 units. it can’t. The very first instruction loaded by the CPU can delay 11 steps to get out of it. let’s say that a given CPU has six execution engines. generic instruction 5. the fourth. the higher the time an instruction will delay to be fully executed. generic instruction 3.

the CPU will simple discard the processing of instruction 15. math instruction 16. it will pull instruction 15 into one of the FPUs. If when the CPU finishes processing the third instruction a is greater than b. The out-of-order engine of all CPUs has a depth limit on which it can crawl looking for instructions (a typical value would be 512). since when instruction 3 asks for instruction 15 it will be already processed. generic instruction 2. since it will need one math to fill one of the FPUs that otherwise would be idle. because the FPU would be otherwise idle anyway. Speculative Execution Let’s suppose that one of these generic instructions is a conditional branching. So at a given moment we could have both branches being processed at the same time. if a=<b the CPU will have a performance boost. going straight to instruction 16 or even further. Consider the example below: 1. Of course the out-of-order engine cannot go forever looking for an instruction if it cannot find one. It doesn’t cost anything to the CPU to execute that particular instruction. it will execute both branches. . On the other hand. generic instruction … When the out-of-order engine analyses this program. generic instruction 10. but in fact it is not. You may think this is a waste of time. it can pull an instruction from the bottom of the program and process it before the instructions above it are processed. if a=<b go to instruction 15 4. generic instruction 5. generic instruction 7. math instruction 8. if instruction 16 also has already been processed by the out-of-order engine.The name out-of-order comes from the fact that the CPU doesn’t need to wait. What will the out-of-order engine do? If the CPU implements a feature called speculative execution (all modern CPUs do). generic instruction 3. generic instruction 6. math instruction … 15. generic instruction 9. Of course everything we explained on this tutorial is an over simplification in order to make this very technical subject easier to understand.