Profile Guided Optimizations

And other optimizations details Shachar Shemesh Lingnu Open Source Consulting Ltd.

Credits and License
This lecture is free to use under the Creative Commons Attribution Share Alike license (cc-bysa)
Please give credit to Shachar Shemesh and a link to

All syntax highlighting curtsy of enscript.

An Apology to People at Home
This lecture makes extensive use of "objdump" to view the compilation results' assembly code. There is no sane way to capture that short of taking videos. If you are reading the slides not during the lecture – my apology.

Optimization – minimizing or maximizing a certain program attribute (wikipedia)
Run time, memory usage, power consumption etc.

A crucial part in allowing production of readable code.

Platform Independent Optimizations

Optimizations that are independent of the platform the program is compiled for


One Program – Unoptimized
#include <stdio.h> int main( int argc, char *argv[] ) { int i; double f; for( i=1; i<=10; ++i ) { printf("%d\n", i); } for( f=0; f<=1; f+=0.1 ) { printf("%.1f\n", f ); } }


"Optimize for Memory Use"
#include <stdio.h> int main( int argc, char *argv[] ) { { int i; for( i=1; i<=10; ++i ) { printf("%d\n", i); } } { double f; for( f=0; f<=1; f+=0.1 ) { printf("%.1f\n", f ); } } }


"Optimize for Speed"
#include <stdio.h> int main( int argc, char *argv[] ) { printf("1\n"); printf("2\n"); printf("3\n"); printf("4\n"); printf("5\n"); printf("6\n"); printf("7\n"); printf("8\n"); printf("9\n"); printf("10\n"); printf("0.0\n"); printf("0.1\n"); printf("0.2\n"); printf("0.3\n"); printf("0.4\n"); printf("0.5\n");


"Optimize for Speed" Even More
#include <stdio.h> int main( int argc, char *argv[] ) { printf("1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n" "0.0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n” "0.8\n0.9\n1.0\n"); return 0; }

Purpose of Optimizer
The first two optimizations are automatically done by the compiler, given the right compilation flags.
The third is out of scope of almost any optimizer I know.
Sort of.

A good optimizer allows producing reasonably efficient code without changing coding style or paradigm.

<something> is not a Panacea
Will not fix inefficient algorithms Will not fix bugs
In fact, may cause bugs! May aggravate bugs that were, otherwise, minor

Optimization's Affect on Debugging
Debugging an optimized binary with a debugger can be very difficult:
Program flow is non-linear. Inline functions "get in the way" all the time.
Worse for C++

Variables, local, static and global, may not be where the debugger can even find them.
They can move around as the code progress.

And without a debugger:
Adding "debug printf" actually changes the optimizer's output.

Optimizer's Limitations
Essentially a compile time/run time efficiency trade off. Lacking "total awareness", some supposedly obvious optimizations are outside the compiler's reach.


What Does This Program Do?
#include <stdio.h> #include "custom_type.h" int main(int argc, char *argv[] ) { int a; custom_type b; a=5; a+=3; a+=2; b=5; b+=3; b+=2; printf("%d %d\n", a, (int)b ); return 0; }

If you don't know, how should the optimizer?


What it Really Does
#ifndef CUSTOM_TYPE_H #define CUSTOM_TYPE_H class custom_type { int var; public: custom_type() {} custom_type( int val ) : var(val) {} operator int() const { return var; } custom_type &operator+=( const custom_type &rhs ) { var+=rhs.var+1; return *this; } ... };

I cheated

#endif // CUSTOM_TYPE_H

The Importance of Inline Functions
You couldn't have known about the cheat. Neither could the optimizer.
Yet it did!

The "operator +=" method was inlined into "main" Once that happens, the optimizer has context for the operation.
It can aggregate the entire set of operations, and replace it with the final result.

Inline can also happen in C.


The Great Divide
#include <stdio.h> int process( int num ) { return 4500/num; } int main( int argc, char *argv[] ) { printf("%d\n", process(2) ); return 0; }

Dividing by Powers of Two
Most CPUs have assembly instruction for dividing two integers (as well as two floating points). Dividing by a power of two can be done more efficiently with a shift operation.
The compiler obviously needs to know it's a power of two.

Why did it keep "process" around if it inlined it?


A Lesser Divide
#include <stdio.h> static int process( int num ) { return 4500/num; } int main( int argc, char *argv[] ) { printf("%d\n", process(2) ); return 0; }

Static and Inline
A static function can only be used within the same source file.
If the compiler sees that all uses have been inlined, it will not bother emitting the original function.

If the program only has one file, you can pass -fwhole-program to make it assume all functions are static. If the function is not defined in the same file, it cannot be inlined at all.

Platform Dependent Optimizations

Optimizations that take the CPU's internal structure into account

Revolution With a RISC
RISC – Reduced instruction set code. Core idea – benchmark programs spend 90% of their time executing the same 3 assembly commands, 95% executing 5.
Leave only those 5. Make them very quick.

Smoking Commands in a Pipe
Use a pipeline to execute the commands:
Split the entire command processing into distinct parts. Execute each part in a separate clock cycle
You can now reduce the time each clock cycle takes – higher clock rate.

Start executing the next command as soon as the first one is done with the first part of the.
Work on as many commands as there are pipe segments at once. Average throughput is 1 command per clock cycle!

DLX – Deductive RISC Processor

IF – Instruction Fetch ID – Instruction Decode EX – ALU operations MEM – Memory fetch WB – Store to memory or registers

A Few General Notes
An instruction stopped before it reaches WB has no effect. The design dictates the assembly. Are the following commands possible?
store r2, (r3+r4)
Yes: ALU step before memory access step

load (r3+r4), r2
No, for precisely the same reason.

Bubbles (Soft)
It can happen that a later command's operands come from an earlier command's pipe step that has not been performed yet.
add r2, r3 add r4, r3 In the above case, we can "short path" the data and have it ready in time. Most CPUs actually do that.

Bubbles (Hard)
How about this sequence?
load (r3), r4 add r2, r4
The memory read of the first line happens in the same cycle as the ALU for the second command. The data is, physically, not present inside the CPU when we need it.

Delay the second command for one cycle until the data is ready. This is called a pipeline bubble.

Optimizer as Bubble Popper
We expect the optimizer to minimize the bubbles in the pipe.
Put an unrelated instruction between the two and prevent a wasted cycle.

This requires that the optimizer know the precise details of the CPU's pipeline. RISC, in general, assumes a compiler. Efficient manual assembly programming of RISC is between very tough and impossible.

The Branch Problem
Consider the following sequence:
compare r2, r3 beq location

The branch requires an ALU operation (though DLX pretends that it doesn't). We only know where to branch to at the end of the third cycle. We need to fetch the next instruction at the beginning of the second cycle.
Two cycles of bubbles for each branch!

Branching (cont.)
How serious is the problem?
Statistics claim that a branch happens every 4 assembly instructions, on average. Turning every 4 instructions into 6 is a 50% slow down!

Branch Solutions: Unconditional Execution
A solution employed by many RISC platforms: Execute the instruction right after the branch – always.
Fills an unconditional bubble with meaning.

Almost always:
Do not perform this fill if we are going to have a bubble anyways.

Branch Solutions: Branch Prediction
Apriori, a branch pointing backwards has a 90% chance of being taken (probably a loop)
Branches pointing forward have only a 50% chance.

The CPU can keep a list of branches, and where they, likely, will go. This list is called "branch prediction". Some platforms have means of "helping" with this guess.
If you know what will likely happen, you can code it into the assembly.

Branch Prediction and the Optimizer
How can the optimizer know what is likely to happen?
Option 1 – guess.
Not really a wild guess. Uses static program flow analysis.

Option 2 – benchmark.
Run the program. For each branch, keep track of how many times it was taken, and how many times it was not. Compile the program again, using this information as a optimization helper.

Profile Guided Optimization
Optimization that eliminates guesses by using real life data. If done properly, can significantly speed up a program.
If not done properly is useless at best.

Cache Locality
CPU keeps code and data recently processed inside an internal cache. Works best if data is in proximity to other data. PGO allows the optimizer to pick frequently used and rarely used areas of code, and keep them together.
Maximizing cache efficiency.

May even split single functions into different ELF segments.

Using PGO with GCC
To turn on all PGO collection compile with -fprofile-generate.
Make sure to pass it during compilation AND linkage.

Run the program through a typical use scenario.
Do NOT run it through all program features. This will actually hurt optimization.

Compile again, this time with -fprofile-use.
Again – during linkage as well.


Build environment – of three projects I tried, only one had a build environment where PGO could be just plugged in. Profile location – hard coded to source location.
Prevents use of ccache and other compiler wrappers. Fixed in gcc 4.4.0 – may override path.

Test cases
Not always easy to find. Sometimes interactive.

PGO and Cross Compilation
The profile files are created in the same directory as the source files. If cross compiling, need to make sure these directories exist. Need to transfer result files back to build machine for rebuild.

PGO and the Kernel
The kernel is not compiled with PGO Seems to be possible, but would require nontrivial work.
Mostly in making sure the profile files are created correctly.

Report from 2004 about someone running PGO from Intel compiler and gaining 40% performance.
Idea rejected because it is impossible to reproduce a PGO kernel binary twice – debugging is hard.

PGO Domain in GCC
Branch prediction statistics Variable values statistics Function use for increased cache locality

The Intel Assembly Family RISC in CISC Clothing
The Intel assembly today is still capable of running programs written for the 8080 CPU:
First released April 1974. 8 bit CPU. Accumulator based machine language CISC Runs Wordstar on CP/M

Intel Assembly
Still contains many CISC constructs. CPU has several internal pipelines internally (2 for the original Pentium). When transferring commands into the cache, they are translated into RISC.

RISC and CPU Compiler Familiarity
RISC assumes intimate familiarity between the compiler and the CPU.
Familiarity to the level that a minor CPU revision may invalidate.

Sometimes this is feasible (embedded). In modern PCs, not so much. The CPU has its own optimizer in hardware, that re-does some of the things the optimizer does.
That's why "memory barriers" exist.

Optimizer Induced Limitations
Some optimization options assume attributes of your program:
-fstrict-aliasing – compiler assumes that a pointer to a type does not have a different type. -fstrict-overflow – compiler assumes that no integer overflow can happen.

If your program does not live up to those assumptions, compiling with -O2 or -Os may break your code.
Will try to issue a warning, but no promises....

Subjects Not Covered
Tail recursion optimization Copy constructor optimization

GCC online manual:
Make sure you explicitly pick the version you are using!

Thank You

Visit us at