You are on page 1of 4

Optimization in GCC | Linux Journal http://www.linuxjournal.com/article/7269?

page=0,2

Username/Email: Password: Login


Register | Forgot your password?

Optimization in GCC
Jan 26, 2005 By M. Tim Jones (/user/801462)
in

Here's what the O options mean in GCC, why some


optimizations aren't optimal after all and how you can make
specialized optimization choices for your application.

Alignment Optimizations

In the second optimization level, we saw that a


number of alignment optimizations were
introduced that had the effect of increasing
performance but also increasing the size of the (/issue/131)
resulting image. Three additional alignment
optimizations specific to this architecture are From Issue #131
available. The -malign-int option allows types to be March 2005 (/issue/131)
aligned on 32-bit boundaries. If you're running on
a 16-bit aligned target, -mno-align-int can be used. The -malign-double controls
whether doubles, long doubles and long-longs are aligned on two-word boundaries
(disabled with -mno-align-double). Aligning doubles provides better performance on
Pentium architectures at the expense of additional memory.

Stacks also can be aligned by using the option -mpreferred-stack-boundary. The


developer specifies a power of two for alignment. For example, if the developer
specified -mpreferred-stack-boundary=4, the stack would be aligned on a 16-byte
boundary (the default). On the Pentium and Pentium Pro targets, stack doubles
should be aligned on 8-byte boundaries, but the Pentium III performs better with
16-byte alignment.

Speed Optimizations

For applications that utilize standard functions, such as memset, memcpy or strlen,
the -minline-all-stringops option can increase performance by inlining string
operations. This has the side effect of increasing the size of the image.

Loop unrolling occurs in the process of minimizing the number of loops by doing
more work per iteration. This process increases the size of the image, but it also can
increase its performance. This option can be enabled using the -funroll-loops option.
For cases in which it's difficult to understand the number of loop iterations, a
prerequisite for -funroll-loops, all loops can be unrolled using the -funroll-all-loops
optimization.

A useful option that has the disadvantage of making an image difficult to debug is
-momit-leaf-frame-pointer. This option keeps the frame pointer out of a register,
which means less setup and restore of this value. In addition, it makes the register
available for the code to use. The optimization -fomit-frame-pointer also can be
useful.

When operating at level -O3 or having -finline-functions specified, the size limit of
the functions that may be inlined can be specified through a special parameter
interface. The following command illustrates capping the size of the functions to
inline at 40 instructions:

gcc -o sort sort.c --param max-inline-insns=40

This can be useful to control the size by which an image is increased using -finline-
functions.

Code Size Optimizations

1 of 4 12/28/2010 4:45 PM
Optimization in GCC | Linux Journal http://www.linuxjournal.com/article/7269?page=0,2

The default stack alignment is 4, or 16 words. For space-constrained systems, the


default can be minimized to 8 bytes by using the option -mpreferred-stack-
boundary=2. When constants are defined, such as strings or floating-point values,
these independent values commonly occupy unique locations in memory. Rather
than allow each to be unique, identical constants can be merged together to reduce
the space that's required to hold them. This particular optimization can be enabled
with -fmerge-constants.

Graphics Hardware Optimizations

Depending on the specified target architecture, certain other extensions are enabled.
These also can be enabled or disabled explicitly. Options such as -mmmx and
-m3dnow are enabled automatically for architectures that support them.

Other Possibilities

We've discussed many optimizations and compiler options that can increase
performance or decrease size. Let's now look at some fringe optimizations that may
provide a benefit to your application.

The -ffast-math optimization provides transformations likely to result in correct code


but it may not adhere strictly to the IEEE standard. Use it, but test carefully.

When global common sub-expression elimination is enabled (-fgcse, level -O2 and
above), two other options may be used to minimize load and store motions.
Optimizations -fgcse-lm and -fgcse-sm can migrate loads and stores outside of loops
to reduce the number of instructions executed within the loop, therefore increasing
the performance of the loop. Both -fgcse-lm (load-motion) and -fgcse-sm (store-
motion) should be specified together.

The -fforce-addr optimization forces the compiler to move addresses into registers
before performing any arithmetic on them. This is similar to the -fforce-mem option,
which is enabled automatically in optimization levels -O2, -Os and -O3.

A final fringe optimization is -fsched-spec-load, which works with the


-fschedule-insns optimization, enabled at -O2 and above. This optimization permits
the speculative motion of some load instructions to minimize execution stalls due to
data dependencies.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

hi, follow the "Listing 3. (/article/7269#comment-356795)


Submitted by Anonymous (not verified) on Mon, 10/11/2010 - 20:58.

hi, follow the "Listing 3. Simple Example of gprof" but when using -O or -O2, the profile is "Flat
profile".So how to resoult it?

my step is:
1: gcc -o test_optimization test_optimization.c -pg -march=i386
2: ./test_optimization
3: gprof --no-graph -b ./test_optimization gmon.out
4: the result is:
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated

2 of 4 12/28/2010 4:45 PM
Optimization in GCC | Linux Journal http://www.linuxjournal.com/article/7269?page=0,2

% cumulative self self total


time seconds seconds calls Ts/call Ts/call name
0.00 0.00 0.00 1 0.00 0.00 factorial


if add -O2 the result is:
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated

% cumulative self self total


time seconds seconds calls Ts/call Ts/call name

single optimization flag without level (/article/7269#comment-349787)


Submitted by Anonymous on Sun, 03/21/2010 - 17:36.

Any optimization can be enabled outside of any level simply by specifying its name with the -f prefix, as:
gcc -fdefer-pop -o test test.c

In current versions of GCC it is incorrect ( http://gcc.gnu.org/wiki/FAQ#optimization-options


(http://gcc.gnu.org/wiki/FAQ#optimization-options) ). Single optimization flag without optimization level doesn't work.
I don't know what about old versions.

gcc 4.2.3 vs visual c 2005 (/article/7269#comment-321862)


Submitted by nanjil (not verified) on Thu, 05/08/2008 - 16:52.

hello:
I just compiled a code under gcc cygwin and visual c 2005 in a lpatop with dula core intel processor.

The debuggable gcc code was about 2x times than faster than visual c++ debuggable code

however the situation reversed when i used O3 optimization in gcc and "release" optimization in visual c.

now the visual c code is 2x faster than gcc.

i did not expect that large a difference; it is HUGE!!


am i missin gsomehting or anybody else has noticed similar thing?

visual c++ optimizations (/article/7269#comment-333239)


Submitted by Anonymous (http://corporatedrones.wordpress.com) (not verified) on Thu, 02/12/2009 -
06:47.

apparentely, MSVC uses a few insecure optimizations counting that the developer created a
secure code. Probably thats why its debug build is slower.

I've seen lots of situations where gcc code gives a error right away, and promptly showing
me and bug and MSVC happily executing a code until it finally stumble upon a non-static field of a class and
finally giving a error. For me , this is simple misleading and thats why I prefer gcc

Detailed article, that is great! (/article/7269#comment-283146)


Submitted by Anonymous (not verified) on Sun, 09/09/2007 - 16:57.

You wrote very detailedly!


It is really useful for me right now since I am doing my thesis work on optimization under Linux. Thank
your so much!

Someone should write some (/article/7269#comment-193325)


Submitted by Anonymous (not verified) on Thu, 11/02/2006 - 11:57.

Someone should write some "C" code and a few scripts that will enable / disable every compiler option
and then print out which options worked best for _your_ particular system.

A benchmark that would specifically test each option (as opposed to using a single benchmark, and huge)
could be written.

EG: no point in benchmarking if we should use:


gcc -O2 -O3 code.c -- One disables the other

gcc -fno-gcse SSE2_code.c

Benchmarks need to have a 'large' effect on the option that is being switched.

This could be ran overnight (or on multiple machines, each doing part of the testing) and results provided on a web page
somewhere.

Experts could put in thier two cents and a wiki of snipperts could
be fed into a code compilator (not compiler, just a bunch of scripts) that would compilate all the snippets and produce a
final program to be compiled on many different machines.

This way we could figure out that if we had such-and-such a system then "how-often" (what % of the time) would we
simply be better off
to use a particular option and when is it more likely based on that TYPE of program we are running (wordprocessor vs.
MultiMedia app).

EG: If you have a Pentium is is ALWAYS (or should be if gcc is correct) best to use the -march=pentium option - BUT - it
is NOT always best to use "-fcrossjumping" (though it _could_ be for certain applications).

The output of all this could simply be a half dozen command line choices for each processor - including a "general purpose
'best'" setting and a "quick compile with great optimization" setting (for intermediate builds).

3 of 4 12/28/2010 4:45 PM
Optimization in GCC | Linux Journal http://www.linuxjournal.com/article/7269?page=0,2

This is something that a few dozen people need to work on to get the ball rolling and then the rest of us need to pitch in and
compile the resulting test scripts to check for errors. With everyone's help we should have the so-called answer(S) to
"which compilation options should I use for machine-X when compiling applcation=category Y.

Just a thought ...

Looks like you have a good (/article/7269#comment-196321)


Submitted by Anonymous (not verified) on Tue, 11/21/2006 - 12:32.

Looks like you have a good project to setup now.

Got Table? (/article/7269#comment-128874)


Submitted by Anonymous (http://www.screwylizardracing.com) (not verified) on Sun, 02/05/2006 - 18:08.

Where can I get a readable copy of Table 1? The copy here is too small to read, and can't be enlarged.

try clicking on it (/article/7269#comment-133643)


Submitted by Anonymous (not verified) on Thu, 03/30/2006 - 02:58.

try clicking on it

-O versus -O0 (/article/7269#comment-15123)


Submitted by Anonymous (not verified) on Wed, 02/02/2005 - 12:19.

Minor comment -- -O defaults to -O1, not to -O0.

4 of 4 12/28/2010 4:45 PM

You might also like