How a CPU Works

Author: Gabriel Torres

Even though every microprocessor has its own internal design, all microprocessors share a same basic concept – which we will explain in this tutorial. We will take a look inside a generic CPU architecture, so you will be able to understand more about Intel and AMD products and the differences between them. The CPU (Central Processing Unit) – which is also called microprocessor or processor – is in charge of processing data. How it will process data will depend on the program. The program can be a spreadsheet, a word processor or a game: for the CPU it makes no difference, since it doesn’t understand what the program is actually doing. It just follows the orders (called commands or instructions) contained inside the program. These orders could be to add two numbers or to send a piece of data to the video card, for example. When you double click on an icon to run a program, here is what happens: 1. The program, which is stored inside the hard disk drive, is transferred to the RAM memory. A program is a series of instructions to the CPU. 2. The CPU, using a circuit called memory controller, loads the program data from the RAM memory. 3. The data, now inside the CPU, is processed. 4. What happens next will depend on the program. The CPU could continue to load and executing the program or could do something with the processed data, like displaying something on the screen.

Figure 1: How stored data is transferred to the CPU. In the past, the CPU controlled the data transfer between the hard disk drive and the RAM memory. Since the hard disk drive is slower than the RAM memory, this slowed down the system, since the CPU would be busy until all the data was transferred from the hard disk drive to the RAM memory. This method is called PIO, Processor I/O (or Programmed I/O). Nowadays data transfer between the hard disk drive and the RAM memory in made without using the CPU, thus making the system faster. This method is called bus mastering or DMA (Direct Memory Access). In order to simplify our drawing we didn’t put the north bridge chip between the hard disk drive and the RAM memory in Figure 1, but it is there. Processors from AMD based on sockets 754, 939 and 940 (Athlon 64, Athlon 64 X2, Athlon 64 FX, Opteron and some Sempron models) have an embedded memory controller. This means that for these processors the CPU accesses the RAM memory directly, without using the north bridge chip shown in Figure 1. To better understand the role of the chipset in a computer, we recommend you to read our tutorial Everything You Need to Know About Chipsets.

So if it has two instructions to be executed and it knows that the first will delay seven clock cycles to be executed. But when you do compare two different processors. Let’s say that processor ”A“ takes seven clock cycles to perform a given instruction. Figure 2: Clock signal. A clock of 100 MHz means that in one second there is 100 million clock cycles. all instructions delay a certain number of clock cycles to be performed. which is the number of clock cycles per second. it will automatically start the execution of the next instruction on the 8th clock tick. different cache sizes. each instruction takes a certain number of clock cycles to be executed. On this figure you can see three full clock cycles (”ticks“). For example. what is clock anyway? Clock is a signal used to sync things inside the computer. different ways of transferring data inside . As we mentioned. Of course this is a generic explanation for a CPU with just one execution unit – modern processors have several execution units working in parallel and it could execute the second instruction at the same time as the first. we marked this with an arrow. The beginning of each cycle is when the clock signal goes from ”0“ to ”1“. the one running at a higher clock rate will be faster. The clock signal is measured in a unit called Hertz (Hz). This is called superscalar architecture and we will talk more about this later. Take a look at Figure 2. a RAM memory with a ”5“ latency means that it will delay five full clock cycles to start delivering data. processor ”B“ will be faster. Regarding the CPU. what clock has to do with performance? To think that clock and performance is the same thing is the most common misconception about processors. as CPUs have different number of execution units. with a higher clock rate. If you get two processors with different architectures – for example. In this case. a given instruction can delay seven clock cycles to be fully executed. so things are going to be performed in less time and the performance will be higher. the time between each clock cycle will be shorter. In the computer. two different manufacturers. For modern CPUs there is much more in the performance game. If you compare two completely identical CPUs. and that processor ”B“ takes five clock cycles to perform this same instruction. in parallel. the interesting thing is that the CPU knows how many clock cycles each instruction will take. If they are running at the same clock rate. Inside the CPU. because it has a table which lists this information. So. all timings are measured in terms of clock cycles.Clock So. For example. becaus e it can process this instruction is less time. where we show a typical clock signal: it is a square wave changing from ”0“ to ”1“ at a fixed rate. like Intel and AMD – things inside the CPU are completely different. this is not necessarily true.

we will cover all that in this tutorial. called clock multiplication.4 GHz. so the signal. you will see several tracks or paths.4 GHz“ refers to the CPU internal clock. instead of arriving at the other end of the wire. These tracks are wires that connect the several circuits of the computer. As the processor clock signal became very high. would simply vanish. and a higher internal clock.the CPU. . We illustrated this example in Figure 4. different ways of processing the instructions inside the execution units. If you look at a motherboard. The motherboard where the processor is installed could not work using the same clock signal. these wires started to work as antennas. which is used when transferring data to and from the RAM memory (using the north bridge chip). Figure 3: The wires on the motherboard can work as antennas. To give a real example. being transmitted as radio waves. The problem is that with higher clock rates. Under this scheme. External Clock So the CPU manufacturers started using a new concept. Figure 4: Internal and external clocks on a Pentium 4 3. which started with 486DX2 processor. which is obtained multiplying by 17 its 200 MHz external clock. different clock rates with the outside world. the CPU has an external clock. which is used in all CPUs nowadays. on a 3. one problem showed up.4 GHz Pentium 4 this ”3. etc. Don’t worry.

Read or tutorial Inside Pentium 4 Architecture for a detailed view on Pentium 4 architecture. Continuing the Pentium 4 3.4 GHz example. One of them is the use of a memory cache inside the CPU. it works as if it were a 200 MHz CPU! Several techniques are used to minimize the impact of this clock difference.The huge difference between the internal clock and the external clock on modern CPUs is one major roadblock to overcome in order to increase the computer performance. it has to reduce its speed by 17x when it has to read data from RAM memory! During this process. while the technique of transferring four data per clock cycle is called QDR (Quad Data Rate). Because of that. an AMD CPU with a 200 MHz external clock is listed as 400 MHz. Figure 5: Transferring more than one data per clock cycle. For example. Processors from both AMD and Intel use this feature. but while AMD CPUs transfer two data per clock cycle. Intel CPUs transfer four data per clock cycle. Block Diagram of a CPU In Figure 6. The same happens with Intel CPUs: an Intel CPU with a 200 MHz external clock is listed as having an 800 MHz external clock. There are many differences between AMD and Intel architectures. . you can see a basic block diagram for a modern CPU. We think that understanding the basic block diagram of a modern CPU is the first step to understand how CPUs from Intel and AMD work and the differences between them. AMD CPUs are listed as having the double of their real external clocks. We still plan to write a specific tutorial about Athlon 64 architecture in the near future. Another one is transferring more than one data chunk per clock cycle. The technique of transferring two data per clock cycle is called DDR (Dual Data Rate).

The dotted line in Figure 6 represents the CPU body. For more information on this subject. also called static memory. Static memory consumes more power. memory cache technique is used. The datapath between the RAM memory and the CPU is usually 64-bit wide (or 128-bit when dual channel memory configuration is used). The number of bits used and the clock rate can be combined in a unit called transfer rate. Since going to the “external world” to fetch data makes the CPU to work at a lower clock rate. When the CPU loads a data from a certain memory . i. while the same system using dual channel memories (128 bits) will have a 6. as the RAM memory is located outside the CPU.400 MB/s memory transfer rate.200 MB/s. All the circuits inside the dotted box run at the CPU internal clock. the transfer rate will be higher). but it is a lot faster. the formula is number of bits x clock / 8. It can work at the same clock as the CPU. transfer more bits per clock cycle than 64 or 128. the fast the transfer will be done (in other words. Also. is more expensive and is physically bigger than dynamic memory. The higher the number the bits transferred per clock cycle.Figure 6: Basic block diagram of a CPU. For a system using DDR400 memories in single channel configuration (64 bits) the memory transfer rate will be 3. running at the memory clock or the CPU external clock. measured in MB/s. which one is lower. Memory Cache Memory cache is a high performance kind of memory. the datapath between the L2 memory cache and the L1 instruction cache on modern processors is usually 256-bit wide. The kind of memory used on the computer main RAM memory is called dynamic memory. the datapath between the CPU units can be wider. which dynamic memory is not capable of. read our tutorial Everything You Need to Know About DDR Dual Channel. In Figure 5 we used a red arrow between the RAM memory and the L2 memory cache and green arrows between all other blocks to express the different clock rates and datapath width used. To calculate the transfer rate. For example.e.. Depending on the CPU some of its internal parts can even run at a higher clock rate.

The cache controller is always observing the memory positions being loaded and loading data from several memory positions after the memory position that has just been read.position. so the CPU will need to directly access RAM memory less often. so the CPU doesn’t need to go outside to grab the data: it is already loaded inside in the memory cache embedded in the CPU. Since the memory cache controller already loaded a lot of data below the first memory position read by the CPU.000. if a given processor is working with 4 KB pages (which is a typical value). the cache controller will load data from “n” addresses after the address 1.024 bytes. if the CPU loaded data stored in the address 1.000. a circuit called memory cache controller (not drawn in Figure 6 in the name of simplicity) loads into the memory cache a whole block of data below the current position that the CPU has just loaded. Since usually programs flow in a sequential way.000. The bigger the memory cache. which it can access at its internal clock rate. By the way. it will load data from 4. To give you a real example. . the higher the chances of the data required by the CPU are already there. that’s why 4 KB is 4. thus increasing the system performance (just remember that every time the CPU needs to access the RAM memory directly it needs to lower its clock rate for this operation).000 in our example). the next data will be inside the memory cache. Figure 7: How the memory cache controller works. 1 KB equals to 1. the next memory position the CPU will request will probably be the position immediately below the memory position that it has just loaded. This number “n” is called page. In Figure 7 we illustrate this example.096 addresses below the current memory position being load (address 1.096 not 4.

respectively. and we call a ”miss“ if the required data isn’t there and the CPU needs to access the system RAM memory. goes to the Pentium 4 and newer Celeron CPUs based on sockets 478 and 775. which is a cache located between the decode unit and the execution unit. the L1 instruction cache is there. to think that Pentium 4 processors don’t have L1 instruction cache. but there is one typical situation where the cache controller will miss: branches. however. of course. So. but with a different name and a different location. some add the amount of the two and writes ”separated“ – so a ”128 KB. Pay attention to Figure 6 and you will see that L1 instruction cache works as an ”input cache“. Pentium 4 processors (and Celeron processors using sockets 478 and 775) don’t have a L1 instruction cache. because the required instructions will be closer to the fetch unit. because they are only counting the 8 KB L1 data cache. L1 instruction cache – which is usually smaller than L2 cache – is particularly efficient when the program starts to repeat a small part of it (loop). because the fetch unit must access directly the slow RAM memory.We call a ”hit“ when the CPU loads a required data from the cache. and refers to the distance they are from the CPU core (execution unit). the cache controller of modern CPUs analyze the memory block it loaded and whenever it finds a JMP instruction in there it will load the memory block for that position in the L2 memory cache before the CPU reaches that JMP instruction. L1 instruction cache and L2 cache). In order to solve this issue. On the specs page of a CPU the L1 cache can be found with different kinds of representation. Branching As we mentioned several times. thus slowing down the system. We are mentioning this here because this is a very common mistake. Some manufacturers list the two L1 cache separately (some times calling the instruction cache as ”I“ and the data cache as ”D“). instead they have a trace execution cache. The trace execution cache of Pentium 4 and Celeron CPUs is of 150 KB and should be taken in account. The exception. while L1 data cache works as an ”output cache“. and some simply add the two and you have to guess that the amount is total and you should divide by two to get the capacity of each cache. If in the middle of the program there is an instruction called JMP (”jump “or” go to“) sending the program to a completely different memory position. one of the main problems for the CPU is having too many cache misses. making the fetch unit to go get that position directly in the RAM memory. Usually the use of the memory cache avoids this a lot. So when comparing Pentium 4 to other CPUs people would think that its L1 cache is much smaller. . this new position won’t be loaded in the L2 memory cache. L1 and L2 means ”Level 1“ and ”Level 2“. One common doubt is why having three separated cache memories (L1 data cache. separated“ would mean 64 KB instruction cache and 64 KB data cache –.

i. because the values of a and b are unknown and the cache controller would be looking only for JMP-like instructions. when the CPU processes the branching instruction. This is pretty easy to implement. For example. It is better to load the memory cache with unnecessary data than directly accessing the RAM memory. if a =< b go to address 1. the problem is when the program has a conditional branching. . This would make a cache miss. the address the program should go to depends on a condition not yet known. it will simply discard the one that wasn’t chosen. Figure 9: Conditional branching situation.. or if a > b go to address 2.e. Later. The solution: the cache controller loads both conditions into the memory cache.Figure 8: Unconditional branching situation. We illustrate this example in Figure 9.

for example. which fit the values for a and b. This is done in order to increase the processor performance. of course. put the fetch unit to grab the next instruction? When the . it will be idle. Another interesting feature that all microprocessors have for a long time is called ”pipeline“. it sends it to the decode unit. The microcode will ”teach“ the CPU what to do. it will pass all data and the ”step-by-step cookbook“ on how to execute that instruction to the execute unit. a and b. but as the system starts loading the operating system. Arithmetic and Logic Unit. The best example is the FPU. when the processing is over. which is in charge of executing complex math instructions. right? So. It does that by consulting a ROM memory that exists inside the CPU. it goes to the L2 memory cache. Usually modern CPUs don’t have several identical execution units. it will look if the instruction required by the CPU is in the L1 instruction cache. First. how about instead of doing nothing. This result can be then sent back to RAM memory or to another place. If the instruction loaded is. The execute unit will finally execute the instruction. Each instruction that a given CPU understands has its own microcode. add a+b. its microcode will tell the decode unit that it needs two parameters. if the instruction is a math instruction it will send it to the FPU and not to one ”generic“ execution unit. called microcode. For example. the result is sent to the L1 data cache. which is the capability of having several different instructions at different stages of the CPU at the same time. so in theory it could achieve the same performance of six processors with just one execution unit. But this will depend on the next instruction that is going to be processed next (the next instruction could be ”print the result on the screen“). The decode unit will then figure out what that particular instruction does. i..e. After the decode unit ”translated“ the instruction and grabbed all required data to execute the instruction. the CPU starts processing the first instructions loaded from the hard drive.Processing Instructions The fetch unit is in charge of loading instructions from memory. After the fetch unit grabbed the instruction required by the CPU to be processed. If it is not. This kind of architecture is called superscalar architecture. for example. Float Point Unit. On modern CPUs you will find more than one execution unit working in parallel. a CPU with six execution units can execute six instructions in parallel. The decode unit will then request the fetch unit to grab the data present in the next two memory positions. After the fetch unit sent the instruction to the decode unit. and the cache controller starts loading the caches. ”generic“ execution units are called ALU. the result would be sent to the L1 data cache. If the instruction is also not there. By the way. they have execution units specialized in one kind of instructions. Finally. When you turn on your PC all the caches are empty. Usually between the decode unit and the execution unit there is an unit (called dispatch or schedule unit) in charge of sending the instruction to the correct execution unit. as the video card. and the show begins. It is like a step-by-step guide to every instruction. then it has to directly load from the slow system RAM memory. Continuing our add a+b example.

We will explain two of them. the first. generic instruction 5. it can’t. out-of-order execution (OOO) and speculative execution. the fourth. because the sixth instruction also needs one ALU to be processed. the second. the fetch unit can send the second instruction to the decode unit and grab the third instruction. it will go down the program looking for another math instruction. the second instruction will get out right after it (and not another 11 steps later). an instruction to be fully executed will have to pass through 11 units. math instruction What will happen? The schedule/dispatch unit will send the first four instructions to the four ALUs but then. . In fact. at the fifth instruction. and FPU. So. There are several other tricks used by modern CPUs to increase performance. in our example. Just as a generic example in order to understand the problem. since all its four generic execution units are busy. it will probably have 11 instructions inside it at the same time almost all the time. On the other hand. generic instruction 3. generic instruction 4. and they are idle. it will pass the eight and the ninth instructions and will load the tenth instruction. The very first instruction loaded by the CPU can delay 11 steps to get out of it. Out-Of-Order Execution (OOO) Remember that we said that modern CPUs have several execution units working in parallel? We also said that there are different kinds of execution units. generic instruction 9. the seventh and the tenth instructions. In our example. keep in mind that because of this concept several instructions can be running inside the CPU at the same time. In a modern CPU with an 11-stage pipeline (stage is another name for each unit of the CPU). at the same time. math instruction 8. generic instruction 7. The out-of-order engine continues its search and finds out that the seventh instruction is a math instruction that can be executed in one of the available FPUs. generic instruction 2. That’s not good. the third. So. The higher the number of stages. the number of instructions simultaneously inside the CPU will be even higher. a CPU with out-of-order execution (all modern CPUs have this feature) will look at the next instruction to see if it can be sent to one of the idle units. since all modern CPUs have a superscalar architecture. which is a math execution unit. for an 11-stage pipeline CPU. four ”generic“ and two FPUs. and so on. but once it goes out. the execution units will be processing. Since the other FPU will still be available. the CPU will need to wait for one of their ALUs to be free in order to continue processing. generic instruction 10. Also. like ALU. which is a generic execution unit.first instruction goes to the execution unit. generic instruction 6. because we still have two math units (FPUs) available. Let’s also say that the program has the following instruction flow in a given moment: 1. In our example. the higher the time an instruction will delay to be fully executed. let’s say that a given CPU has six execution engines.

generic instruction 5. generic instruction 3. generic instruction 7. it will execute both branches. math instruction 16. Of course the out-of-order engine cannot go forever looking for an instruction if it cannot find one. because the FPU would be otherwise idle anyway. math instruction … 15. math instruction 8. it will pull instruction 15 into one of the FPUs. It doesn’t cost anything to the CPU to execute that particular instruction. the CPU will simple discard the processing of instruction 15. since when instruction 3 asks for instruction 15 it will be already processed. generic instruction … When the out-of-order engine analyses this program. Speculative Execution Let’s suppose that one of these generic instructions is a conditional branching.The name out-of-order comes from the fact that the CPU doesn’t need to wait. generic instruction 9. Of course everything we explained on this tutorial is an over simplification in order to make this very technical subject easier to understand. On the other hand. . Consider the example below: 1. You may think this is a waste of time. going straight to instruction 16 or even further. generic instruction 10. generic instruction 6. if a=<b go to instruction 15 4. since it will need one math to fill one of the FPUs that otherwise would be idle. if a=<b the CPU will have a performance boost. if instruction 16 also has already been processed by the out-of-order engine. generic instruction 2. What will the out-of-order engine do? If the CPU implements a feature called speculative execution (all modern CPUs do). it can pull an instruction from the bottom of the program and process it before the instructions above it are processed. So at a given moment we could have both branches being processed at the same time. If when the CPU finishes processing the third instruction a is greater than b. The out-of-order engine of all CPUs has a depth limit on which it can crawl looking for instructions (a typical value would be 512). but in fact it is not.

Sign up to vote on this title
UsefulNot useful