You are on page 1of 25

Optimizing Itanium-Based Applications

Version 1.11

May 16, 2011

Optimizing Itanium-Based Applications

Table of Contents
introduction .....................................................................................................................................................3
six levels of optimization.................................................................................................................................3
level zero .................................................................................................................................................... 3
level one ..................................................................................................................................................... 3
level two ..................................................................................................................................................... 4
level two -ipo.............................................................................................................................................. 4
level three ................................................................................................................................................... 5
level four (level three ipo) ........................................................................................................................ 5
interprocedural optimizations with -ipo...........................................................................................................6
loop optimizations at +O3 or +O4...................................................................................................................8
advanced optimization options and pragmas ...................................................................................................9
enabling aggressive optimizations.............................................................................................................10
removing compilation time limits when optimizing..................................................................................10
limiting the size of optimized code............................................................................................................11
controlling the scheduling model ..............................................................................................................11
controlling floating point optimization......................................................................................................11
controlling data allocation .........................................................................................................................13
controlling symbol binding........................................................................................................................13
controlling other optimization features......................................................................................................16
profile-based optimization.............................................................................................................................20
instrumenting the code ..............................................................................................................................20
collecting execution profile data................................................................................................................20
performing profile-based optimization......................................................................................................20
maintaining profile data files.....................................................................................................................21
merging profile data files...........................................................................................................................21
locking of profile database files.................................................................................................................22
Itanium- versus PA-RISC profile-based optimization differences ............................................................22
compiler-generated performance advice........................................................................................................23
putting it together with optimization option recipes......................................................................................23
References .....................................................................................................................................................25

Optimizing Itanium-Based Applications

introduction
The HP Itanium-based optimizer transforms code so that it runs more efficiently on Itanium-based
HP-UX systems. The optimizer can dramatically improve application performance. In addition, compile
time and memory resources increase with each higher level of optimization due to the increasingly complex
analysis that is performed.
This document discusses the following topics:

Six levels of optimization


Interprocedural optimizations
High-level loop optimizations
Advanced optimization options and pragmas
Profile-based optimization
Compiler-generated performance advice
Putting it all together with optimization options recipes

Note that this version applies to the A.06.26 (AR1109) release of the HP compilers. For an overview of the
HP compiler technology, see HP Compilers for HP Integrity Servers[1].

six levels of optimization


There are six levels of optimization. Each level is a superset of the preceding level. Additional parameters
allow the user to control the aggressiveness of optimization, compile time, and the size of the resulting
executable.

level zero
+O0
description:

Simple register assignment.


Trivial scheduling (one instruction per cycle, one bundle per cycle).
Should almost never be used.

benefits:
Fastest compile time; however, use of this optimization level is strongly discouraged due to the
poor quality of the resulting code.

level one
+O1 (default)
description:

Local optimizations that optimize over a single basic block, including common subexpressions
elimination, constant folding, and load-store elimination.
Performs data prefetching of simple array traversals.
More sophisticated instruction scheduling.
Register promotion of some scalar locals and C/C++ scalar formals.
In C++, inlining of calls within a translation unit.

benefits:

Produces much faster code than +O0, but faster compile time than +O2.

Optimizing Itanium-Based Applications

Debugging correctness of code is maintained. Breakpoints behave as expected and variables have
expected values at breakpoints. See Section 14.27 (Debugging optimized code) in Debugging with
GDB[2] for more information on this topic.

level two
+O2 or O
description:

Performs Level 1 optimizations, plus optimizations performed over entire functions.


Performs intra-module inlining with tuned down heuristics to guarantee fast compile times in
addition to potential performance gains.
Performs global optimizations, code motion, and register promotion.
Performs loop optimizations such as data prefetching (more aggressive than at level one), sum
reduction, scalar replacement, strength reduction, unrolling, rerolling, fusion, unswitching and postincrement synthesis.
Performs additional optimizations, including FMA synthesis and dead code elimination.
Performs optimization of calls to certain library codes if the system headers for the appropriate
library calls are included. For example, inlining of calls to sqrt, sin, cos and certain calls to
memory copies and compares can occur. Commoning of library calls can also occur.
Additionally, the optimizer employs a suite of transformations that take advantage of key Itanium
architectural features to improve the instruction level parallelism of applications. For example, the
scheduler performs techniques such as predication, control speculation, and data speculation.
Predication allows control flow to be converted into conditionally-executed instructions that both
eliminates branch instructions and allows multiple execution paths to be executed simultaneously.
Speculation allows code to be executed earlier than it would be under the order specified by the
developer.
In order to perform these scheduling techniques (described in the previous paragraph) effectively and
efficiently, the code is divided into regions that are each optimized as a unit. Innermost loops are
software pipelined whenever possible, utilizing special branches and rotating registers for an efficient
schedule. Predication enables software pipelining of loops with control flow. Both types of speculation
are also supported for modulo scheduled loops.
This level of optimization limits the ability to debug the application. See Section 14.27 (Debugging
optimized code) in Debugging with GDB[2] for more information on this topic.
benefits:

Significantly faster code than produced at Level 1, due to optimized code and better use of machine
resources and Itanium architectural features.
Non-numeric applications can be improved by 50% or more.
Loop intensive numeric applications achieve even greater speedups due to optimizations such as
more aggressive data prefetching and software pipelining.

level two -ipo


+O2 -ipo or O -ipo
description:

Performs Level 2 optimizations, plus optimizations across the entire application program.
Performs interprocedural optimizations (IPO) at link time, including improved range propagation
and alias analysis, cross module inlining, interprocedural data prefetching, dead variable and dead
function removal, variable privatization, short data optimization, data layout optimization, constant
propagation, and import stub inlining.
Performs indirect call promotion in whole program mode if dynamic PBO data is available
(+Oprofile=use).
Performs inlining of a larger set of math library routines into user code.
See chapter on interprocedural optimizations below for more details.
This level of optimization limits the ability to debug the application. See Section 14.27 (Debugging
optimized code) in Debugging with GDB[2] for more information on this topic.

benefits:

Better alias information and inlining improves and enables additional optimizations over Level 2.
Applications containing many indirect calls or virtual function calls can benefit greatly from
indirect call promotion.
Data optimizations improve cache and TLB behavior.
Code optimizations reduce number of instructions.

level three
+O3
description:

Performs Level 2 optimizations, plus optimizations across all functions in a single file.
Includes inlining and cloning of functions within the same file.
High-level optimizations, such as loop transformations (interchange, fusion, unrolling, and so on)
occur. Please see the section about loop optimization below.
Performs inlining of a larger set of math library routines into user code.
Recognizes simple copy loops and replaces them with calls to optimized memory copy routines.
Recognizes simple manually unrolled loops and rerolls them, enabling better unrolling decisions
for a given platform later in the loop optimizer.
This level of optimization limits the ability to debug the application. See Section 14.27 (Debugging
optimized code) in Debugging with GDB[2] for more information on this topic.

benefits:
Can produce faster code than Level 2. This is particularly true for numerical codes, which tend to
benefit more from the loop transformations, and for codes that frequently call small functions
within the same file or math library functions, which benefit from inlining.

level four (level three ipo)


+O4 or +O3 -ipo
description:

Performs Level 3 optimizations, plus optimizations across the entire application program.
Performs interprocedural optimizations at link time, please see level two -ipo for a summary.
This level of optimization limits the ability to debug the application. See Section 14.27 (Debugging
optimized code) in Debugging with GDB[2] for more information on this topic.

benefits:

Interprocedural optimizations generally improve application performance (see Level two -ipo).

Optimizing Itanium-Based Applications

Better alias information and inlining improves and enables additional loop transformations.

interprocedural optimizations with -ipo


The HP high level optimizer contains an interprocedural optimizer, a high level loop optimizer, and a scalar
optimizer.
The interprocedural optimizer is enabled with the option -ipo at optimization levels two or higher (e.g. +O2
-ipo). Optimization level four (option +O4) implies -ipo.
The high level loop optimizer is fully enabled at optimization levels three or higher (options +O3 and +O4)
and performs optimizations such as loop interchange, loop distribution and loop fusion. Limited high level
loop optimizations are performed at +O2.
The high level scalar optimizer is enabled along with the other high level optimizations and performs
expression simplification and canonicalization, dead code removal, copy propagation, constant
propagation, partial redundancy elimination, partial dead store elimination, as well as control flow
optimizations and basic block cloning.
This chapter focuses on the benefits of the interprocedural optimizer.
The option -ipo can be used to compile some or all of an applications source files. Compiling only some
modules with -ipo enables intermodule optimizations between those files. In this mode, only parts of the
application are analyzed during IPO by the compiler and therefore the compiler has to make pessimistic
assumptions about the rest of the application. This can result in missing out on some optimization
opportunities.
For highest performance, it is benefitial to compile all of an applications source files with -ipo; this is
called whole program mode. In this mode, the compiler can perform precise analysis of an application,
potentially resulting in better performance.
The high level optimizer makes use of PBO information and is more effective when used in combination
with PBO (option +Oprofile=use), for example, PBO data improves function inlining. PBO data can also
reveal the most likely callee at an indirect call size, allowing the high level optimizer to transform the
indirect call into a test and a direct call.
Application performance currently benefits from interprocedural optimization in the following ways:

Interprocedural analysis of memory references and function arguments enables and improves many
optimizations, for example, it yields several additional opportunities for optimizations in the low level
optimizer, including register promotion.
Consider this example:
void foo( int *x, int *y )
{
... = *x;

// load

*y

// store 1

= ...

... = *x;

// load

1
2

}
Without any additional knowledge about the properties of the pointers x and y, the compiler has to
issue a second load instruction (load 2), since the store (store 1) may overwrite the content of the
pointer x.
If, as a result of interprocedural analysis, the compiler is able to determine that x and y never alias
(point to the same memory location), the compiler can promote the value of *x into a register and just
reuse this register (load 2).

The compiler interprocedurally propagates information about modified and referenced data items
(mod/ref analysis), which can benefit various other compiler analyses and transformations which need
to consider global side effects.

The compiler also interprocedurally propagates range information for certain entities.

Function inlining exposes traditional benefits, such as the reduction of call overhead, the improvement
of the locality of the executing code and the reduction of the number of branches. More importantly
though, inlining exposes additional optimization opportunities because of the widened scope and
enables better instruction scheduling.
The inliner framework has been designed to scale to very large applications, uses a novel and very fast
underlying algorithm, and employes an elaborate set of new heurisitics for its inlining decisions.
Note: The inlining engine is also employed at +O2 for intra-module inlining. At this optimization level
the inliner uses tuned down heuristics in order to guarantee fast compile times in addition to positive
performance effects.

The whole call graph is constructed, enabling indirect call promotion, where an indirect call is
converted to a test and a direct call. Depending on the application characteristics, and in the presence
of PBO data, this can result in significant application speedups (we have observed up to 20%
improvements for certain applications)

Dead variable removal allows the high level optimizer to reduce the total memory requirements of the
application by removing global and static variables that are never referenced.

Recognition of global, static and local variables that are assigned but never used allows the optimizer
to remove dead code (which may result in additional dead variables).

Conversion of global variables that are referenced only within a module allows the high level
optimizer to convert the symbol to a private symbol, guaranteeing that it can only be accessed from
within this module. This gives the low-level optimizer greater freedom in optimizing references to that
variable.

Dead function removal (functions that are never called) and redundant function removal (for example,
duplicate template instantiations) help to reduce compile time and improve the effectiveness of cross
module inlining by reducing the working set. Additionally, as the applications total code size reduces,
it will incur less cache and page misses (resulting in potentially higher performance)

Short data optimizations. Global and static data allocated in the short data area can be accessed with a
more efficient access sequence. In whole program mode (-ipo) the compiler can perform precise
analysis to determine if all global and static data fits into the short data area and allocate it there. If the
data doesnt fit, the compiler can determine the best safe short data size threshold, enabling a
maximum amount of data items to be addressable more effectively.
Note: This is an IPO advantage. At other optimization levels the same optimization can be enabled
with the option +Oshortdata. The option -ipo derives an optimal short data threshold.

For calls to external functions (function not residing in a binary) the linker introduces a small import
stub. If the compiler knows that a function call is a call to an external function, it can inline the import
stub, resulting in better performance.
The HP compilers support a mechanism that allows annotating function prototypes with a pragma
(#pragma extern) marking those functions as external functions, enabling import stub inlining.
All this is no longer necessary with -ipo in whole program mode. In this model the compiler knows
which functions are defined by the application and which are external and automatically marks
functions appropriately.

The compiler performs interprocedural data layout optimizations, in particular, structure splitting, array
splitting and dead field removal. If the compiler is able to determine that a given record type can be
modified safely and if additionally heuristics find that such type modifications are beneficial, the
compiler may break a record type into a cold part and a hot part with the goal of reducing cache miss
and TLB penalties.

Optimizing Itanium-Based Applications

Currently, this optimization is limited to a very restricted set of scenarions. Please use +Oinfo to
determine whether this optimization has been performed.

The compiler can also perform non-contiguous array fusion. For some multi-dimensional, noncontiguous, pointer-based arrays, the compiler will modify the declaration, allocation, and uses of such
arrays to instead use a contiguous memory layout. This transformation both allows for more efficient
element access, and results in more optimal cache utilization.

Interprocedural constant promotion is performed.

The compiler inserts inter-procedural data prefetches before callsites for data accessed in the call chain
rooted at the call site.The inserted prefetches will attempt to fetch data accessed via dereferences of
pointer parameters of the call.

The interprocedural analysis phase is also able to expose and warn on additional source problems, for
example, for variables that are declared with incompatible attributes in different source files.
The interprocedural optimization framework has been designed to scale to very large applications.
Fortunately, nothing changes from a users perspective, in particular, existing build processes do not have
to modified. Since the IPO and code generation is performed at link time, the link time may increase
significantly.
At the end of the IPO phase, the code generation and low-level optimization phase is started by invoking
multiple parallel processes of the binary be. The default number of parallel be processes is set to the
number of processors on a machine. This number can be overriden by setting the environment variable
PARALLEL, for example:
export PARALLEL=4

loop optimizations at +O3 or +O4


The high level loop optimizer performs the following classical loop optimizations based on array access
patterns (the loop optimizer is fully enabled at +O3 or +O4, with a limited subset enabled at +O2). These
optimizations are designed to improve locality of array accesses, improving the utilization of the data
cache.
loop interchange
If the compiler finds a perfect loop nest (no statements before or after nested loops), it will analyze the
memory access patterns, which are implicitly defined by the iteration space, and determine legality and
profitability of interchanging an inner loop with an outer loop. For certain loops, this transformation can
significantly reduce data cache misses.
loop distribution
Loop distribution seeks to break a single loop into two or more loops. This transformation may remove
loop-carried dependencies, which may result in more efficient code. It is an enabler for loop interchange, as
more perfect loop nests may be generated, and it may also result in more module-scheduled loops in the
low level optimizer. Finally, this transformation may alleviate the register pressure for the low level
optimizer.
loop fusion
Loop fusion is the opposite of loop distribution, two loops are merged together into a single loop. This
transformation usually has positive effects on cache utilization when both loops access the same arrays in a
similar order.
loop unswitching
Loop unswitching (also known as if-do promotion) seeks to hoist an if statement out of a loop. If a loop
contains an if statement with a test based on the loop induction variable and a loop invariant value, it can be
beneficial to move the if before the loop and to duplicate the loop body into a first form for which the if test
was always true, and a second form for which the if test was always false. This transformation has the

effect that the if-statement is now executed only when the loop is reached, and no longer on every loop
iteration.
loop cloning
Loop cloning seeks to special case loops with variable trip counts with help of profile information. For
example, if a loop iterates from 0 to N, but the profile information hints that the loop most of the time
executes with a constant trip count C, it can be beneficial to special case the loop for C and to check for
this value at runtime to select the proper loop variant. A loop with known trip count can be scheduled most
effectively by the low level optimizer, which can result in dramatic runtime improvements.
loop unrolling
The high level optimizer performs full outer loop unrolling for loops with small trip counts.
loop unroll and jam
The loop unroll and jam transformation performs outer loop unrolling and fusion, which increases
opportunities for scalar replacement. This can reduce the number of memory operations, resulting in better
instruction scheduling.
recognition of memset/memcpy type loops
For loops that essentially copy blocks of data to another memory location, the compiler determines loop
properties, such as the direction of the copy, and then replaces the whole loop with a direct call to a highly
specialized and optimized copy routine.
loop rerolling
Some user code contains manually unrolled loops. These forms of manual unrolling usually comes from
tuning efforts on a particular machine. However, on a different machine, this manually unrolled code may
perform poorly! The compiler tries to identify such unrolled loops, rerolls them by removing incremental
statements and adjusting the loop boundaries and increment. If such a rerolled loop is then passed through
the loop optimizer, better unrolling decisions can be made, depending on machine characteristics. After
loop rerolling, a loop merging pass is run to merge manually unrolled loops and their remainder loops.
loop blocking
Loop blocking is a combination of strip mining and interchange that maximizes data localization. It is
provided primarily to deal with nested loops that manipulate arrays that are too large to fit into the cache.
Under certain circumstances, loop blocking allows reuse of these arrays by transforming the loops that
manipulate them so that they manipulate strips of the arrays that fit into the cache. Effectively, a
blocked loop accesses array elements in sections that are optimally sized to fit in the cache.
scalar replacement
The optimizer finds reuses of array locations in a loop an replaces them with uses of scalar temporaries.
These temporaries can be register promoted to reduce memory acceeses.
loop multiversioning
The loop optimizer can find that some optimizations can be performed on the loop if some conditions are
met (eg: two array references do not overlap). However, some of these conditions may not be known at
compile time. The optimizer can clone the loop, introduce runtime checks for these conditions and optimize
the cloned loop more aggresively.
malloc combining
The optimizer can combine several small block allocations in a loop into a single large block allocation.
This improves locality and reduces the cost of calling the allocation routine.

advanced optimization options and pragmas


The information in the following sections describes several options for enabling optimization.

Optimizing Itanium-Based Applications

enabling aggressive optimizations


+Ofast or -fast (-fast is not supported by Fortran)
description:
[Alias for +O2 +Onolimit +Ofltacc=relaxed +FPD +DSnative +Wl,+pi,1M
+Wl,+pd,1M Wl,+mergeseg]
Enables aggressive optimizations at +O2. This option is safe for the vast majority of applications, but
can result in higher compile time or, for codes with strict FP accuracy needs, incorrect output. In
addition to the optimizations performed at +O2, +Ofast performs the following:

Enable optimizations to employ greater compile-time to fully-optimize large procedures,


potentially resulting in non-linear compile times.
Allow additional FP optimizations that might affect the accuracy of floating point values. See
+Onofltacc on page 11 for more information.
Enable the flush-to-zero rounding mode on the hardware.
Aggressively schedule code for the hardware on which the compiler is running. Attempts to
optimize the code for the resources of that specific processor implementation without regard to the
potential performance impact on other Itanium-based implementations.
Enables large 1M byte instruction and data virtual memory page sizes, which can reduce TLB
misses.
Causes the dynamic loader to merge all data segments of shared libraries into one block at startup
time. This allows the kernel to use larger size page table entries which can improve performance.
However, for short-lived applications, this can result in too much overhead at startup, and can be
disabled by adding -Wl,+nomergeseg after +Ofast.

Use +Ofast with stable, well-behaved code that does not rely on FP corner-case values, and that does
not utilize extremely large integer values.
+Ofast might imply +O3 in a future release, rather than +O2.
benefits:

Safely improves performance for most applications, particularly when the application only runs on
the type of system on which it was compiled.
Avoids the need to specify a larger number of optimization flags because it implies a number of
optimizations that are generally safe and can greatly improve application performance.

+Ofaster
description:
[Alias for +Ofast +O4]
Enables interprocedural optimizations in addition to the advanced optimizations described for +Ofast.
See the descriptions of +Ofast and +O4 for more information.
benefits:

Combined benefits from +Ofast and +O4

removing compilation time limits when optimizing


+O[no]limit
+Olimit = [min|default|none]
(Default +Olimit=default]
By default, the optimizer is tuned to spend a reasonable amount of time optimizing large programs at
+O2 and above, to avoid non-linear compile times.

10

Users can remove optimization time restrictions at +O2 and above by using the +Onolimit or
+Olimit=none option. This allows full optimization of large procedures, but can incur significant
compile time increases for very large procedures, especially those with large sequences of straight-line
code. If you are willing to tolerate longer compile times, +Onolimit can result in significant
performance improvements.
Users can limit the amount of time spent optimizing code to completely avoid non-linear compile times
using +Olimit or +Olimit=min.

limiting the size of optimized code


+O[no]size (default +Onosize)
The user can disable optimizations that greatly expand code size at +O2 and above using the +Osize
option. Most optimizations improve code speed and simultaneously reduce code size. However, some
optimizations can greatly increase code size. Loop unrolling is one such optimization, and is disabled
with +Osize.
+Osize also disables inlining. This option can help reduce instruction cache misses. You can use
+Osize with other optimization controls, such as +Onolimit and +Ofast.

controlling the scheduling model


+DS[blended|itanium2|montecito|poulson|native] (default +DSblended)
Different Itanium-based implementations can have vastly different resource constraints, latencies, and
other scheduling criteria. The optimizing scheduler can currently schedule for several Itanium-based
implementations: Intel Itanium 2, Montecito, and Poulson. The user can schedule code to run best on
each of these implementations by using +DSitanium2, +DSmontecito,, and +DSpoulson,
respectively. Additionally, use +DSmontecito to obtain the best schedule for Montvale and Tukwila.
Use +DSpoulson to schedule for the future post-Tukwila Poulson implementation.
However, users might also want to optimize code once and have it run reasonably well on different
implementations. The default setting, +DSblended, attempts to do this. Currently, +DSblended is a
combination of +DSmontecito and +DSpoulson. As new IPF implementations are released,
+DSblended scheduling will change so that code will run reasonably well on the different
implementations.
You can simply use +DSnative to schedule code fastest for the implementation on which you are
compiling. As new implementations of Itanium are released, additional targets will be added and the
schedule resulting from +DSblended will change.

controlling floating point optimization


+Ofltacc=[strict|default|limited|relaxed]
+O[no]fltacc
#pragma STDC FP_CONTRACT [ON/OFF/DEFAULT]
Controls optimizations on floating-point code, so that the expected accuracy of floating-point
computation is not violated. With +Ofltacc=strict (or its equivalent +Ofltacc), all
optimizations that can change result values are prohibited.
By default, or with +Ofltacc=default, the only value changing optimization allowed is synthesis
of contractions. This includes floating-point multiply add instructions, and its variants. While these
instructions might change the resulting value over a two-instruction multiply add sequence, the resulting
value is actually more accurate because it has not been subject to an intermediate rounding.

Optimizing Itanium-Based Applications

11

On Itanium, the benefit of forming these contractions can be significant. Contractions can be enabled
and disabled in different blocks of code using the FP_CONTRACT pragma. FP_CONTRACT OFF
overrides any prior pragma or +Ofltacc=strict option. FP_CONTRACT ON has no effect other
than undoing a prior FP_CONTRACT OFF, and is overridden by +Ofltacc=strict.
+Ofltacc=limited enables a small number of other value-changing optimizations in addition to
the contractions. These optimizations can prevent the propagation of Not-a-Numbers (NaNs), infinities,
and the sign of zero. For example, performing the optimization of 0.0*x => 0.0 will prevent the
propagation of NaN, infinities, and the sign of zero if x is a Nan, infinity, or negative number.
The most aggressive floating-point optimizations are enabled with +Ofltacc=relaxed (or its
equivalent +Onofltacc). For example, faster and more efficient floating-point divide sequences are
enabled under relaxed accuracy.
Additionally, optimizations that reassociate floating-point computation are enabled with
+Ofltacc=relaxed. For example, the sum reduction optimization, which hides floating-point add
latency by computing partial sums, can be enabled in C or C++ with +Ofltacc=relaxed. It also
enables loop optimizations such as fusion, distribution, blocking, unroll and jam, and interchange in
loops with floating-point accesses. For Fortran, these optimizations are already enabled because
reassociation that does not violate explicit parentheses is always legal.
Finally, +Ofltacc=relaxed implies the +Ocxlimitedrange option (described below).
+O[no]sumreduction
Will [dis]allow the sum reduction optimization, regardless of the floating-point accuracy setting. This
can be used to enable optimization of sum reductions via the computation of partial sums for C or C++
without having to specify the more aggressive +Ofltacc=relaxed, which is less safe. Conversely,
+Onosumreduction can be used to disallow the sum reduction optimization under a floating-point
accuracy setting where it is normally allowed (e.g. by default for Fortran, where the language standard
allows this type of reassociation).
+O[no]cxlimitedrange
(default +Onocxlimitedrange for C, +Ocxlimitedrange for Fortran)
#pragma STDC CX_LIMITED_RANGE [ON/OFF/DEFAULT]
You can use this option to obtain faster, complex arithmetic sequences when an application does not
rely on out-of-range floating point values. This option indicates whether out-of-range floating point
values (for example, NaNs and infinities) can occur and must be preserved. With
+Ocxlimitedrange, out-of-range floating-point values might not be preserved. Enabling the limited
range switch results in faster complex arithmetic sequences. The CX_LIMITED_RANGE pragma
enables limited range behavior for specific blocks of code, whereas the option is global.
CX_LIMITED_RANGE ON overrides +Onocxlimitedrange, and CX_LIMITED_RANGE OFF
has no effect except to undo a prior CX_LIMITED_RANGE ON or +Ocxlimitedrange.
+O[no]fenvaccess (default +Onofenvaccess)
#pragma STDC FENV_ACCESS [ON/OFF/DEFAULT]
#pragma FLOAT_TRAPS_ON
+Ofenvaccess disables any optimizations that might affect behavior under non-default
floating-point modes (for example, alternate rounding directions or trap enables) or where floating-point
exception flags are queried. It can also be enabled locally using either the FENV_ACCESS or
FLOAT_TRAPS_ON pragmas. FENV_ACCESS ON and FLOAT_TRAPS_ON override
+Onofenvaccess. FENV_ACCESS OFF has no effect other than to undo a prior FENV_ACCESS
ON, FLOAT_TRAPS_ON, or +Ofenvaccess. Enabling fenvaccess, for example, prevents dead
code elimination of instructions that can raise exceptions, results in longer floating-point-to-integer
conversion sequences that explicitly check for out-of-range results, and results in longer floating-point
division sequences.

12

+O[no]libmerrno
(default +Onolibmerrno, except with Cs Aa, c89, or AC89 the default
is +Olibmerrno)
Enables support for errno in libm functions. Different, less optimal versions of libm functions are
invoked under +Olibmerrno. Additionally, the optimizer is prohibited from performing
optimizations of these calls (such as coalescing calls to the same libm function with identical inputs)
because they are no longer side-effect-free.
Under Cs Aa, c89, or AC89, the default becomes +Olibmerrno.

controlling data allocation


+Olit=[none|const|all] (default +Olit=all)
Not supported by Fortran.
Indicates which constants can be placed in the read-only data section. With +Olit=none, no constants
are placed in read-only memory. With +Olit=const, any const-qualified data that does not require
loadtime or runtime initialization are placed in read-only memory. Additionally, any string literal that
appears in a context where a const char * is legal are placed in read-only memory. With
+Olit=all, the same behavior occurs as for +Olit=const, except that all string literals are placed
in read-only memory.
Placing constants in read-only memory can result in a smaller executable due to coalescing of identical
string literals and can promote data sharing in a multi-user application.
+Oshortdata[=n]
(default n=8)
Not supported by Fortran.
Controls the size of objects placed in the short data area. All objects of size n bytes or smaller are
placed in the short data area. All references to short data assume that it resides in the short data area
(even if this is not defined within the translation unit). Accessing short own data is done using a short
gp-relative add, which is more efficient than the access sequences for non-short or non-own
data. The own data is statically allocated data known to be defined in the current load module. This
includes static, hidden, and protected data, or data whose definition has been seen prior to a reference in
the translation unit.
Valid values of n are decimal numbers between 8 and 4,194,304 (4 MB). When no size is specified, all
data is placed in the short data area (a link error results if the total size is larger than 4MB). Items of
unknown size (for example, extern int a[];) are addressed with short gp-relative
sequences only when no size is specified with +Oshortdata.
This option is unnecessary when compiling with ipo or +O4, as interprocedural analysis will
automatically perform the optimization on the maximum amount of data possible.

controlling symbol binding


These options allow the user to control the binding of both function and data symbols. The binding controls
how symbols are called and accessed and how the access sequences can be optimized. In the
:filename form, filename refers to a file with a space- or newline-separated list of symbols.
Incorrect use of these options can result in either link-time or run-time errors by changing the number of
visible definitions of each symbol and by changing the way load-time binding occurs.
-Bprotected[=symbol[,symbol]*]
-Bprotected:filename
#pragma protected symbol[,symbol]
#pragma binding protected

Optimizing Itanium-Based Applications

13

You can use this option or pragma to obtain the most optimized access sequences for data and code
symbols. Symbols with the given name(s) are specified as having protected export class. If no symbols
are given, then all symbols, including those referenced but not defined in the translation unit, are
specified as having protected export class. This means that these symbols are not preempted and can be
optimized as such. For example, the compiler can bypass the linkage table for both code and data
references. Additionally, the compiler can omit the saving and restoring of gp around calls to protected
symbols, and can generate a pc-relative call. If the target of the call is not local to the load module, the
linker produces an error. These optimizations are always performed for locally-defined symbols 1 unless
the optimizations have been named in a -Bextern option list. -Bprotected enables these
optimizations for symbols that are not locally defined.
When -Bprotected is specified with no symbol list, it also implies -Wl, -aarchive_shared,
causing the linker to prefer an archive library to a shared one if one is available. This results in better
performance because accesses to archived libraries are faster than those to shared libraries.
The #pragma binding protected applies to all globally-scoped symbols2 following the
pragma prior to the next #pragma binding.
To avoid linker errors when making calls into shared system libraries, include the system header files
for these routines. The symbols are marked properly in the system headers as being preemptible. If the
header files are not included, and therefore these symbols are not marked properly, the linker issues an
error because they are not defined in the load module. This linker error prevents a run-time error, which
would occur due to incorrect optimization such as omission of gp saves and restores around calls to
these symbols. Similar problems can be encountered when linking with applications or third-party
shared libraries, unless they are decorated with the proper pragmas. Library providers should consult
David Grosss Library Providers Guide to Symbol Binding[3] on how to enable use of -Bprotected
in user applications.
For application builds, this option can be used with -exec to obtain fastest data access and call
sequences (see -exec and -minshared).
-Bprotected_data
Marks all data symbols as having protected export class, implying the optimizations to data accesses
discussed under -Bprotected. This option can be used when system header files are not included for
shared library calls made by the application, to obtain a subset of the optimizations available with
-Bprotected. However, header files declaring any shared library data being accessed by the
application must be included. For fastest code, users should add the appropriate header file includes,
and compile with -Bprotected. Alternatively, use -Bprotected_data in combination with
either -Bprotected_def or -exec, to obtain optimized access sequences, if modifying source
code to add header file includes is not an option.
-Bprotected_def
Marks locally (non-tentatively) defined symbols as having protected export class. The optimizations
discussed under -Bprotected are applied to these symbols only. This can be used when system header
files are not included for shared library calls or data accesses. For fastest code, users should add the
appropriate header file includes, and compile with -Bprotected.
This option is a subset of -exec.
-Bhidden[=symbol[,symbol]*]
-Bhidden:filename

A locally-defined symbol is a global or static symbol with a definition in the compilation set from which it is
referenced. The compilation set is the translation unit without -ipo, and with ipo is the collection of translation units
presented to a single linker invocation.
2
A globally-scoped symbol is a symbol that is visible across translation unit boundaries. Examples include simple
globals, static data members, and certain namespace members.

14

#pragma hidden symbol[,symbol]


#pragma binding hidden
The symbols with the given name or names are specified as having hidden export class. If no symbols
are given with Bhidden, all symbols, including those referenced but not defined in the translation
unit, are specified as having hidden export class. The #pragma binding hidden applies to all
globally-scoped symbols following the pragma, prior to the next #pragma binding. Hidden export
class implies that the symbols are not exported outside the load module. In an executable, unreferenced
hidden symbols are eliminated when +Oprocelim is specified.
The treatment by the compiler is otherwise the same as for -Bprotected, including the implicit
-Wl,aarchive_shared when there is no symbol list specified with -Bhidden.
-Bhidden_def
Marks locally (non-tentatively) defined symbols as having hidden export class. The optimizations discussed
under -Bhidden are applied to these symbols only. This can be used when system header files are not
included for shared library calls or data accesses. For fastest code, users should add the appropriate header
file includes, and compile with Bhidden.
-Bdefault=symbol[,symbol]*
-Bdefault:filename
#pragma default_binding symbol[,symbol]
#pragma binding default
The given symbols are specified has having default export class. This means that the symbols can be
imported or exported outside the load module. For tentative symbols, the compiler uses the linkage
table for access. For function calls not local to the translation unit, the compiler saves and restores GP
around the call. However, any accesses to locally (non-tentatively) defined symbols are optimized as
described under -Bprotected. By default, all symbols have default export class. However, this
option can be used to override a global -Bprotected, -Bhidden, or -Bextern option for specific
symbols. For example, if most calls in a translation unit are resolved within the load module, the user
can specify -Bprotected followed by -Bdefault on the list of shared library symbols that are
accessed in the translation unit. The #pragma binding default applies to all globally-scoped
symbols following the pragma, prior to the next #pragma binding.
-Bextern[=symbol[,symbol]*]
-Bextern:filename
#pragma extern symbol[,symbol]
#pragma binding extern
Functionally, this option or pragma is similar to -Bdefault. However, it provides the additional hint
to the compiler that the symbols are likely to reside in a separate load module, and therefore that the
compiler should inline the import stub for the calls to these symbols. Specification of a locally-defined
symbol on the -Bextern option or pragma causes the compiler to mark that symbol with default
export class. Unlike -Bdefault, it also avoids any compile-time binding for this locally defined
symbol, which means that references to the symbol are through the linkage table and gp is saved and
restored around calls to the symbol. Moreover, as with other symbols specified with -Bextern, it
means that calls to the locally defined symbol go through an inlined import stub. Clearly, all of this
results in a performance penalty when accessing these symbols, so -Bextern should be used only for
those symbols that are expected to be external or preemptible.
When specified without a symbol list, -Bextern applies only to undefined and tentatively-defined
symbols. The #pragma binding extern applies to all globally-scoped symbols following the
pragma, prior to the next #pragma binding.

Optimizing Itanium-Based Applications

15

-exec
Asserts that code is being compiled for an executable. Similar to -Bprotected_def, all locally
defined symbols are marked as having protected export class. Additionally, accesses to symbols known
to be defined in the executable can be materialized with absolute addressing, rather than linkage table
accesses.
-minshared
Equivalent to -Bprotected -exec. When building an executable that makes minimal use of shared
libraries, use this option to obtain fastest access sequences to non-shared library code and data.

controlling other optimization features


+Odata_prefetch=[none|direct|indirect] (default +Odata_prefetch=indirect)
+O[no]data_prefetch (+Odata_prefetch = +Odata_prefetch=indirect)
Enables data prefetch insertion. Currently, data prefetches are inserted for loops containing inductive
accesses, or certain linked-list traversals. With +Odata_prefetch=direct, prefetches are inserted
for loads and stores that have inductive addresses and are on heavily-executed paths through the loop.
The prefetches are inserted to cover the longest latency possible given the size of the outstanding
request queues in the cache hierarchy and the expected memory latency, and are given the appropriate
cache hint for the data type being accessed. The compiler attempts to minimize the overhead of
prefetching using a number of techniques, which might involve unrolling the loop or utilizing rotating
registers to share a single static prefetch among multiple arrays. By default, with +Odata_prefetch
or +Odata_prefetch=indirect, in addition to the prefetches inserted by
+Odata_prefetch=direct, the compiler inserts prefetches for data that is accessed with an
address that is indirectly dependent on an induction expression in the loop. In other words, the induction
expression is fed through some other intermediate computation to build the data address. Currently, the
types of intermediate computation supported are loads and bit extracts. For example, in the following
code array A is accessed indirectly using the index loaded from array B:
for (i=0;i<n;i++)
read A[B[i]];
The direct prefetching algorithm would insert prefetches for array B, which has an inductive address.
With indirect prefetching, the compiler detects that array A is accessed indirectly with B[i], and
inserts prefetches appropriately. In order to compute the prefetch address for A, array B is speculatively
loaded. If the prefetch distance is PF, then indirect prefetching inserts the following code into the above
loop:
lfetch B[i+PF*2]
index = ld.s B[i+PF]
(p) lfetch A[index]
Notice that array B is now prefetched at twice the normal prefetch distance, because we need to
speculatively load it at the prefetch distance in order to prefetch array A at the prefetch distance. A
speculative load is used because we can run past the end of array B, and we do not want the load of As
prefetch address to raise any exceptions. Also notice that we may predicate the indirect prefetch, to
avoid executing it the last PF-1 iterations of the loop. Because the speculative load may access an
address that is off the end of the B array, the index used in the indirect prefetch may be junk, potentially
resulting in DTLB misses on the indirect prefetch if we executed it. The accesses to B in the last PF-1
iterations are not likely to result in DTLB misses, since they lie just after the B array in the address
space, particularly when utilizing large pages.
In certain cases, for small arrays, the compiler may decide to insert a number of straight-line prefetches
before the loop to prefetch the entire array, rather than inserting inductive or indirect prefetches into the
loop body.

16

With profile data, the compiler may also insert stride prefetches for linked-list traversals that have
regular runtime address strides. Consider the following source code example:
for (p = ptr; p != 0; p = p->next)
x += p->data;
Normally, the compiler cannot insert prefetches for later iterations of the loop without dereferencing
successive values of the next field. However, profile data may indicate that the values of the p pointer
have a regular address stride in virtual memory. For example, if the values of p on successive iterations
are {8, 16, 24, 32, }, then it has a regular stride of 8 bytes. The compiler can then insert a prefetch
using this stride to prefetch later iterations:
for (p = ptr; p != 0; p = p->next) {
x += p->data;
lfetch p + PF*8;
}
In some cases, profile data may indicate that there are multiple dominant strides across the programs
execution. In that case, the compiler may insert a prefetch using a runtime computation of the stride,
such that the stride used in the current iterations prefetch is the stride between the values of the pointer
in the last two successive iterations.
Without profile data indicating a regular stride for a linked-list traversal, the compiler will insert a
prefetch of the next fields pointer. For the above example, it would insert the following prefetch:
for (p = ptr; p != 0; p = p->next) {
lfetch p->next->next;
x += p->data;
}
If the loop is reasonably large, this can help hide some of the latency from the subsequent iterations
dereference of p.
+Oprefetch_latency=n
Indicates that data prefetches in loops should hide n cycles of memory latency. By default, the compiler
attempts to issue prefetches far enough ahead to just fill the L2 cache outstanding request queue or
cover the expected memory latency. Using this option will override that heuristic, and cause prefetches
to be inserted enough iterations ahead of the corresponding load to cover the n cycles.
+O[no]inline:filename
+O[no]inline=symlist
#pragma no_inline
#pragma inline
#pragma [no]inline_call
Enable or disable inlining for specific functions. The functions can be listed in either a separate file
filename or on the command-line in symlist. By default, the compiler uses heuristics to determine
the profitability of inlining candidates, but these heuristics are overridden by this option. This option
can be used when the user knows that inlining of a certain function is always profitable, or never
profitable. The no_inline pragma can also be used to list those functions that should never be
inlined, and the inline pragma to list those that should always be inlined. Place the appropriate
pragma in the source file that contains the definition of the function that should or should not be inlined.
The [no]inline_call pragma is used to enable or disable inlining of a particular call site. It takes
no arguments and affects the outermost, leftmost call in the next statement. However, the
[no]inline_call pragma is not implemented at first release.

Optimizing Itanium-Based Applications

17

+inline_level n
Fine tunes the aggressiveness of the inliner. The value of <n> can be in the range 0.0-9.0 with 0.1
increments. The following values/ranges have special meaning:

0.0: No inlining is done (same as +d).


1.0: Only functions marked with the inline keyword or implied by the language to be inline are
considered for inlining.
1.0 < num < 2.0 : increasingly make inliner more aggressive below the default level.
2.0: Default level of inlining for +O2,+O3,+O4.
2.0 < num < 9.0 : increasingly aggressive inlining.
9.0: Most aggressive inlining.

+O[no]procelim (default +Oprocelim)


Enables or disables elimination of procedures that are never called. Those marked with hidden export
class are deleted, in addition to unreferenced non-hidden non-static symbols. This option reduces
executable size, which can improve TLB and instruction cache behavior.
+Otype_safety=[off|limited|ansi|strong]
(default +Otype_safety=off, can become limited)
These options are only supported for C applications. These options are used to indicate what type of
aliasing guarantees the compiler can assume when optimizing code. Applications might not execute
correctly if the specified option guarantees a degree of type-safety that is not actually followed in the
code. +Otype_safety=off is the most conservative, and says that objects of all types can alias each
other. Using +Otype_safety=limited specifies that the code follows ANSI aliasing rules, and
that unnamed objects should be treated as if they had unknown type. In other words, objects of type
float and of type int can be assumed to touch different memory locations, and accesses to these
differently-typed objects can be freely scheduled across one another. It does not disambiguate accesses
of different types when both accesses can touch unnamed memory. In addition, as ANSI rules specify,
character objects can touch objects of other types, and must be optimized conservatively with respect to
other objects. Using +Otype_safety=ansi specifies that the code follows ANSI aliasing rules, and
that unnamed objects should be treated the same as named objects. The most aggressive disambiguation
is allowed with +Otype_safety=strong, which says that code follows ANSI aliasing rules, except
that accesses through lvalues of character type are not permitted to touch objects of other types, and it is
assumed that structure and union field addresses are not taken.
+O[no]ptrs_to_globals (default +Optrs_to_globals)
With +Optrs_to_globals, it is assumed that statically-allocated data (including file-scoped
globals, file-scoped statics, and function-scoped statics) is not read or written through pointers. This
allows more aggressive optimization and scheduling of pointer-intensive code.
This option is unnecessary when compiling with ipo or +O4, as interprocedural analysis will apply it
automatically where legal.
+O[no]cross_region_addressing (default +Onocross_region_addressing)
Enables/disables use of cross-region addressing. Enabling this option results in more conservative
address computation, and a loss in performance. It is required if pointers (including array base pointers)
might point to a different region than data being accessed as an offset off of that pointer. This does not
occur in standard-conforming applications. If it does occur, and the option is enabled, the compiler is
not able to take advantage of post-incrementing load and store instructions, because it does not know
when the address might cross into another region. This option has no effect under +DD64.
+[no]parmsoverlap (default +Oparmsoverlap)
Not applicable to Fortran

18

With +Onoparmsoverlap, the optimizer assumes that subprogram arguments do not refer to
overlapping memory locations. This allows more aggressive optimization and scheduling of pointerintensive code.
+O[no]parminit (default +Onoparminit)
Not supported for Fortran.
When enabled, the optimizer inserts instructions to initialize to zero any unspecified function
parameters at call sites. This avoids NaT values in parameter registers. Enabling this option results in
small performance losses, but might be required for correctness.
+O[no]store_ordering (default +Onostore_ordering)
Not supported for Fortran.
Enabling this option forces the optimizer to preserve the original program order for stores to memory
that is possibly visible to another thread. This does not imply strong ordering. This option can be used
to achieve program ordering of stores without using the more conservative volatile semantics applied to
all accesses to global variables with +Ovolatile.
#pragma IF_CONVERT
This block-scoped pragma can be used to indicate that the compiler should employ if-conversion to
eliminate all control flow resulting from conditional code within that scope. If-conversion is the process
by which the compiler uses predicates to eliminate conditional branches. By default at +O2 and higher,
the compiler uses heuristics to determine when it is beneficial to apply if-conversion to eliminate a
conditional branch. This pragma overrides those heuristics and causes the compiler to eliminate all nonloop control flow within the scope of the pragma. Users can specify this pragma to facilitate software
pipelining of inner loops that contain conditional code, because the compiler can only software pipeline
loops that contain certain types of control flow. When placed within the scope of an inner loop, this
pragma causes the compiler to eliminate all branches except for the loop back branch.
+O[no]loop_unroll[=n] (default +Oloop_unroll)
#pragma UNROLL_FACTOR n
#pragma UNROLL n | (n)
The option indicates how many times the optimizer should attempt to unroll each loop. In most cases,
this will only affect innermost loops. Similarly, the block-scoped UNROLL_FACTOR and UNROLL
pragmas specify that the particular innermost loop should be unrolled n times. The
UNROLL_FACTOR pragma must be placed inside the associated loop, whereas the UNROLL pragma
can be placed just before the specified loop. By default the compiler uses heuristics to determine the
best unroll factor for an inner loop. However, if the user knows that a particular unroll factor is best for
the given loop, or alternatively, that no unrolling should be applied to the loop, the option or pragma
can be used to communicate this information to the compiler. The user specified unroll factor overrides
the unroll factor computed by the compiler. Specifying n=1 prevents the compiler from unrolling the
loop. Specifying n=0 causes the compiler to use its own heuristics to determine the best unroll factor
(same as not specifying the option or pragma). The pragma is ignored if it decorates a non-innermost
loop.
+integer_overflow=[ moderate|conservative] (default
+Ointeger_overflow=moderate)
Specifies how aggressive the optimizer should be in assuming that integer arithmetic computations do
not overflow. According to C and C++ language standards, signed integer arithmetic overflow in user
code results in undefined behavior. Therefore, by default (under
+Ointeger_overflow=moderate), the compiler assumes that such overflow will not occur. As a
result, the compiler may remove sign extensions of signed integer accumulations within loop bodies,
which enables further analysis and optimizations. Applications that rely on particular signed integer
overflow behaviour should use +Ointeger_overflow=conservative.
+Oautopar

Optimizing Itanium-Based Applications

19

When the +Oautopar option is used at optimization levels +O3 and above, the compiler will
automatically parallelize those loops which are deemed safe and profitable by the loop transformer.
This optimization allows the compiled program to take advantage of more than one processor (or core)
when executing loops determined to be parallelizable. Most programs which spend a significant
percentage of their execution time in such loops will improve their performance by using this technique
occasionally dramatically. By contrast, some programs may experience performance degradations
when parallelized, and all parallelized programs will increase their use of system resources, which may
slow down other programs running alongside them.

profile-based optimization
Profile-based optimization (PBO) is a set of performance-improving code transformations that make use of
an execution profile gathered for an application. There are three steps involved in performing this
optimization:
1.
2.
3.

Instrumentation Recompile the program to prepare it for execution profile collection.


Data Collection Run the program with representative data to collect execution profile statistics.
Optimization Generate optimized code based on the profile data.

Invoke profile-based optimization by using the HP compiler +Oprofile=collect and


+Oprofile=use command line options, as described below.

instrumenting the code


To instrument your program, use the +Oprofile=collect option as follows:
cc -Aa +Oprofile=collect -c sample.c

Compile for instrumentation.

cc -o sample.exe +Oprofile=collect sample.o

Link to make instrumented executable.

The first command line compiles the code; the +Oprofile=collect option requests that the compiler
prepare the module for profile collection. The -c option in the first command line suppresses linking and
creates an intermediate object file called sample.o. The second command line uses the -o option to link
sample.o into sample.exe. The +Oprofile=collect option prepares sample.exe with data
collection code.
Note: Instrumented programs run slower than non-instrumented programs. Use instrumented code only
to collect statistics for profile-based optimization.

collecting execution profile data


To collect an execution profile for your application, run the instrumented program with representative data
as follows:
sample.exe < input.file1

Collect execution profile data.

sample.exe < input.file2


This step creates and logs the profile statistics to a file, by default called flow.data. You can use this
data collection file to store the statistics from multiple test runs of different programs that you might have
instrumented.

performing profile-based optimization


To optimize the program based on the previously collected run-time profile statistics, recompile the
program as follows:
cc -Aa +Oprofile=use -O -c sample.c

20

Compile for optimization.

cc -o sample.exe +Oprofile=use -O sample.o

Link for optimization.

The +Oprofile=use option is supported at optimization level 2 (-O or +O2) and above.
Note: Profile-based optimization has a greater impact on application performance at each higher level of
optimization. Profile-based optimization should be enabled during the final stages of application
development. To obtain the best performance, re-profile and re-optimize your application after making
source code changes.

maintaining profile data files


Profile-based optimization stores execution profile data in a disk file. By default, this file is called
flow.data and is located in your current working directory. You can override the default name of the
profile data file. This is useful when working on large programs or on projects with many different program
files. To select an alternate profile data file X, use the +Oprofile=use:X command line option. You
can also use the FLOW_DATA environment variable to specify the name of the profile data file with the
+Oprofile=use compile. The [:file] qualifier takes precedence over the FLOW_DATA environment
variable. In the following example, the FLOW_DATA environment variable is set to override the flow.data
file name. The profile data is stored instead in /users/profiles/prog.data.
% cc -Aa -c +Oprofile=collect sample.c
% cc -o sample.exe +Oprofile=collect sample.o
% setenv FLOW_DATA /users/profiles/prog.data
% sample.exe < input.file1
% cc -o sample.exe +Oprofile=use +03 sample.o
In the next example, the Oprofile=use: option uses /users/profiles/prog.data to override
the flow.data file name.
% cc -Aa -c +Oprofile=collect sample.c
% cc -o sample.exe +Oprofile=collect sample.o
% sample.exe < input.file1
% mv flow.data /users/profile/prog.data
% cc -o sample.exe +Oprofile=use:/users/profiles/prog.data +03 sample.o

merging profile data files


Execution profile data files from different runs of an instrumented executable can be merged together into
a single file. This merging can take place implicitly during data collection, or it may be performed
explicitly by the user using the fdm command, This example shows an implicit merge:
% aCC +O3 -c +Oprofile=collect program.C
% aCC -o program.exe +Oprofile=collect program.o
% setenv FLOW_DATA /tmp/program.flow
% sample.exe < A.input

Profile for input A.input written to /tmp/program.flow

% sample.exe < B.input

Profile for input B.input merged into /tmp/program.flow

During the second run of the instrumented executable, the execution profile data derived from running the
program on B.input is merged with (added to) the existing profile database /tmp/program.flow. Profile
databases may also be merged explicitly using the tool /opt/langtools/bin/fdm. Here is an
example of an explicit merge:
% unset FLOW_DATA ; rm flow.data

Optimizing Itanium-Based Applications

21

% ./program.exe < A.input


% mv flow.data A.flow
% ./program.exe < B.input
% mv flow.data B.flow
% /opt/langtools/bin/fdm A.flow A.flow o /tmp/program.flow
The two sequences above (implicit and explicit) will result in the same final profile, modulo sampling
effects.

locking of profile database files


When an instrumented application completes execution and begins writing to the flow.data file to record
its execution profile, it attempts to lock the file in order to obtain exclusive access, This is intended to avoid
cases where two instances of an executable are trying to simultaneously update the same file. Lock files
take the form <flowfile>,lock and are written to the same directory that the flow file is written to; the
lock will persist until the flow file has been completely updated.
If an executable is unable to obtain a lock (perhaps due to many processes all trying to update the same
file), it will write to a temporary flow file flow.XXX where XXX is a pseudo-random string returned by
the tempnam() system call. If this happens, users can then merge the resulting temporary files back into a
single database using fdm.

Itanium- versus PA-RISC profile-based optimization differences


Although the user model is the same, the underlying implementation of profile-based optimization in an
Itanium compile is substantially different from that in the PA-RISC compilers. When transitioning from
PA-RISC to Itanium, please be aware of the following:

22

The PA-RISC equivalent of +Oprofile=collect command line option is +I and the PARISC equivalent of the +Oprofile=use is +P; however, the PA-RISC options are honored by
the Itanium compiler.

In the PA-RISC implementation, compiling a module with -c +I or -c +P causes an ISOM


(high-level intermediate) object file to be generated. Actual code generation is postponed until the
final link phase. This is not the case in the Itanium-based implementation where code is generated
(either +I or +P) during the -c compile.

Instrumented applications are optimized less aggressively than non-instrumented executables. The
PA-RISC compiler is capable of optimizing instrumented code at level +O2, whereas with the
current Itanium-based compilers, profile collection is supported at +O1 optimization (a warning is
issued indicating that the optimization level will drop to +O1 internally for +Oprofile=collect
compiles). This restriction may be lifted in a future release, however.

In the PA-RISC +I implementation, profile counters are 32 bits in size. When selecting input data
sets for runs of instrumented executables, counter saturation can occur if the training run is too
lengthy. On Itanium, profile counters are 64 bits in size, meaning that you can use more lengthy
training runs without concerns about counter saturation.

compiler-generated performance advice


The compiler will emit performance-related advice when +wperfadvice[=1|2|3|4] is specified
(+wperfadvice is equivalent to +wperfadvice=2). The fewest, easiest to correct advice messages
are emitted at level 1. More suggestions are emitted with higher levels, and those emitted by levels 3 and 4
may require extensive or complicated source code changes to achieve performance benefits.
The scenarios currently detected by +wperfadvice include, but are not limited to:

Passing large structures by value instead of by reference

Lack of profile information and inability to perform profile-based optimizations

Frequently executed indirect function calls which may perform better as direct calls.

Possibly inadvertent use of #pragma optimize off

Frequently called routines that are not defined in the load module and cannot be inlined by the
compiler

Loops with constant trip counts that may be multi-versioned

Inability to pipeline loops due to recurrence restraints

putting it together with optimization option recipes


While there are many available compiler options, some of which are detailed in this document. However,
there are a few options that tend to provide big performance boosts for most applications.

Use optimization level 2 (-O or +O2) at a minimum (+O3 for floating-point applications).

Consider compiling with +O4 if not shipping archive libraries (if +O4 is not an option, consider using
minshared, +Bprotected_def and +Oshortdata to attain some of the benefit).

Use PBO (profile-based optimization) for a potentially large improvement in performance (especially
for large commercial applications). PBO provides even bigger improvements on top of +O4.

Use +Ofast, which is safe and effective for the vast majority of programs.

For memory-intensive programs, use large pages via the +pd and +pi linker options or chatr(1).

For floating-point applications, as mentioned above, +O3 should be the minimum optimization level.
Additionally, +Ofltacc=relaxed and +FPD (both included in +Ofast) often provide large
improvements.

Optimizing Itanium-Based Applications

23

Index
A
access sequences, optimized, 14
aggressive optimization
safety of, 10
aggressive optimization, enabling, 10
aggressively schedule code, 10
archive library, 14
C
compilation time limits, removing, 10
controlling optimization, 3
cross-region addressing, enabling/disabling, 18

interprocedural optimizations, 6
ipo. See interprocedural-optimizations
L
large procedures, 10
level four, 5
level one, 3
level three, 5
level two, 4
level zero, 3
library, shared versus archived, 14
linker errors, avoiding, 14
loop optimizations, 8
N

D
data allocation, controlling, 13
data prefetch insertion, 16
dead code elimination, preventing, 12
debugging, 4
E
executable, compiling code for, 16
execution profile, collecting, 21
export class
default, 15
hidden, 15
export class, protected, 14
F
floating point optimizations
reassociating, 12
floating-point code, controlling optimization on,
11
floating-point contractions, 12
floating-point modes
non-default, 12
floating-point optimizations
aggressive, 12
floating-point values
out of range, 12
flush-to-zero rounding mode, 10
FP accuracy, 10

NaN
preventing propagation of, 12
Not-a-Number. See NaN
O
optimization levels, 3
P
PA RISC, differences, 22
PBO. See profile-based optimization
performance advice, 23
PGO. See profile-based optimization
prefetch insertion, 16
profile data
maintaining files, 21
merging files, 22
profile-based optimization, 20
program order for stores, preserving, 19
protected export class
marking all data symbols, 14
marking locally defined symbols, 14, 15
S
scheduling model, controlling, 11
shared library, 14
short data area, 13
symbol binding, controlling, 13

I
if-conversion, 19
inlining the import stub, 15
inlining, enabling or disabling, 17

24

U
unrolling, 19

References
[1] HP Compilers for HP Integrity Servers,
http://h21007.www2.hp.com/portal/download/files/unprot/Itanium/CompilersTechOverview.pdf, 2011.
[2] R. Stallman, R. Pesch, S. Shebs, et al., Debugging with GDB, HP 18th Edition
http://h21007.www2.hp.com/portal/download/files/unprot/devresource/Tools/wdb/doc/gdb60.pdf, Sep
2008.
[3] David Gross, Library Providers Guide to Symbol Binding,
http://h21007.www2.hp.com/portal/download/files/unprot/Itanium/Lib-prov-guide.pdf, 2005

Optimizing Itanium-Based Applications

25

You might also like