HP C/HP-UX Online Help

Return to the Main HP C Online Help page




Optimizing HP C Programs

NOTE: See the Compiling & Running HP C Programs section of the HP C Online Help for a quick reference of all HP C compiler options and pragmas. See the Parallel Options & Pragmas section of the HP C Online Help for detailed descriptions of options and pragmas for threads and parallel programming.

Summary of Major Optimization Levels
Supporting Optimization Options
Enabling Basic Optimization
Enabling Different Levels of Optimization
Changing the Aggressiveness of Optimizations
Enabling Only Conservative Optimizations
Enabling Aggressive Optimizations
Removing Compilation Time Limits When Optimizing
Limiting the Size of Optimized Code
Specifying Maximum Optimization
Combining Optimization Parameters
Summary of Optimization Parameters
Profile-Based Optimization
Controlling Specific Optimizer Features
Using Advanced Optimization Options
Level 1 Optimizations
Level 2 Optimizations
Level 3 Optimizations
Level 4 Optimizations
Guidelines for Using the Optimizer
Optimizer Assumptions
Optimizer Pragmas
Aliasing Options
Improving Shared Library Performance
Improving Compile and Link Times

The HP C optimizer transforms programs so machine resources are used more efficiently. The optimizer can dramatically improve application run-time speed. HP C performs only minimal optimizations unless you specify otherwise. You activate additional optimizations using HP C command-line options and pragmas.

There are four major levels of optimization: levels 1, 2, 3, and 4. Level 4 optimization can produce the fastest executable code. Level 4 is a superset of the other levels.

Additional parameters enable you to control the size of the executable program, compile time, and aggressiveness of the optimizations performed.

Compile time memory and CPU usage increase with each higher level of optimization due to the increasingly complex analysis that must be performed. You can control the trade-offs between compile-time penalties and code performance by choosing the level of optimization you desire.

Generally, the optimizer is not used during code development. It is used when compiling production-level code for benchmarking and general use.

Summary of Major Optimization Levels

Table 7: HP C Major Optimization Options summarizes the major optimization options of HP C:
 

Table 7: HP C Major Optimization Options 
Option  Description  Benefits 
+O0 (default) Constant folding and simple register assignment. Compiles fastest.
+O1 Level 0 optimizations plus instruction scheduling and optimizations that can be performed on small sections of code. Produces faster programs than level 0. Compiles faster than level 2.
+O2 or -O Level 1 optimizations, plus optimizations performed over entire functions in a single file. Optimizes loops in order to reduce pipeline stalls. Performs scalar replacement, and analysis of data-flow, memory usage, loops and expressions. Can produce faster run-time code than level 1 if programs use loops extensively. Compiles faster than level 3. Loop-oriented floating point intensive applications may see run times reduced by 50%. Operating system and interactive applications that use the already optimized system libraries can achieve 30% to 50% additional improvement.
+O3 Level 2 optimizations, plus full optimization across all subprograms within a single file. Includes subprogram inlining. Can produce faster run-time code than level 2 on code that frequently calls small functions. Links faster than level 4.
+O4 Level 3 optimizations, plus full optimizations across the entire application program. Includes global and static variable optimization and inlining across the entire program. Optimizations are performed at link-time. Produces faster run-time code than level 3 if programs use many global variables or if there are many opportunities for inlining procedure calls.

Supporting Optimization Options

Table 8: Other Supporting Optimizations shows optimization options that support the core optimization levels. These optimizations are performed only when specifically invoked. They are available at all optimization levels.
 

Table 8: Other Supporting Optimizations 
Option  Description  Benefits 
+ESfic Generates object code with fast indirect calls. Only correct for programs not using shared libraries. Run-time code is faster.
+ESlit Places string literals and constants defined with the ANSI C const type qualifier into read-only data storage. Storing to constants with this option will cause segmentation violations. Reduces memory requirements and improves run-time speed in multi-user applications. Can improve data-cache utilization.
+I, +P Enables all profile-based optimizations. Uses execution profile data to identify the most frequently executed code paths. Repositions functions, basic blocks, and aids other optimizations according to these frequently executed paths. Improves code locality and cache hit rates. Improves efficiency of other optimizations. Benefits most applications, especially large applications with multiple compilation units. May be used at any optimization level.

Enabling Basic Optimization

To enable basic optimizations, use the -O option (equivalent to +O2), as follows:
cc
-O sourcefile.c
Basic optimizations do not change the behavior of ANSI C standard-conforming code. They improve run-time execution time but only increase compile time and link time by a moderate amount.

Enabling Different Levels of Optimization

There may be times when you want more or less optimization than what is provided with the basic -O option.

Level 1 Optimization

To enable level 1 optimization, use the +O1 option, as follows:
cc
+O1 sourcefile.c
Level 1 optimization compiles quickly, but still provides some run-time speedup.

Level 2 Optimization

To enable level 2 optimization, use the +O2 option, as follows:
cc
+O2 sourcefile.c
Level 2 (equivalent to -O) takes more time to compile, but produces greatly improved run-time speed.

Level 3 Optimization

To enable level 3 optimization, use the +O3 option, as follows:
cc
+O3 sourcefile.c
Level 3 does full optimization of all subprograms within a single file.

Level 4 Optimization

To enable level 4 optimization, use the +O4 option, as follows:
cc
+O4 sourcefile.c
Level 4 can potentially produce the greatest improvements in speed by performing optimizations across multiple object files. Level 4 does optimizations at link time, so compiles will be faster, but links will be longer.

Depending on the size and number of the modules, compiling at level 4 can consume a large amount of virtual memory. Level 4 may consume roughly 1.25 megabytes per 1000 lines of non-commented source. When you use level 4 on a large application, it is a good idea to increase the system swap space. For information on increasing system swap space, see the book Managing Systems and Workgroups.

Changing the Aggressiveness of Optimizations

At each level of optimization, you can control the aggressiveness of the optimizations performed.

Use the +Oconservative option at optimization level 2, 3, or 4 if you are not sure if your code conforms to standards. This option provides more safety.

Use the +Oaggressive option at optimization level 2, 3, or 4 for best performance when you are willing to risk changes to the behavior of your programs. Using the +Oaggressive option can cause your program to have compilation or run-time problems that require troubleshooting.

Enabling Only Conservative Optimizations

You can enable conservative optimizations at the second, third, or fourth optimization levels by using the +Oconservative option, as follows:
cc
+O2 +Oconservative sourcefile.c
or:
cc
+O3 +Oconservative sourcefile.c
or:
cc
+O4 +Oconservative sourcefile.c
Conservative optimizations are optimizations that do not change the behavior of code, in most cases, even if the code does not conform to standards.

Use the conservative optimizations provided with level 2, 3, and 4 when your code is non-ANSI.

Enabling Aggressive Optimizations

To enable aggressive optimizations at the second, third, or fourth optimization levels, use the +Oaggressive option as follows:
cc
+O2 +Oaggressive sourcefile.c
or:
cc
+O3 +Oaggressive sourcefile.c
or:
cc
+O4 +Oaggressive sourcefile.c
Aggressive optimizations are new optimizations or are optimizations that can change the behavior of programs. These optimizations may do any of the following: Use aggressive optimizations with stable, well-structured, ANSI-conforming code. These types of optimizations give you faster code, but are riskier than the default optimizations.

Removing Compilation Time Limits When Optimizing

You can remove optimization time restrictions at the second, third, or fourth optimization levels by using the +Onolimit option as follows:
cc
+O2 +Onolimit sourcefile.c
or:
cc
+O3 +Onolimit sourcefile.c
or:
cc
+O4 +Onolimit sourcefile.c
By default, the optimizer limits the amount of time spent optimizing large programs at levels 2, 3, and 4. Use this option if longer compile times and greater virtual memory use are acceptable because you want additional optimizations to be performed.

Limiting the Size of Optimized Code

You can disable optimizations that expand code size at the second, third, and fourth optimization levels by using the +Osize option, as follows:
cc
+O2 +Osize sourcefile.c
or:
cc
+O3 +Osize sourcefile.c
or:
cc
+O4 +Osize sourcefile.c
Most optimizations improve execution speed and decrease executable code size. A few optimizations significantly increase code size to gain execution speed. The +Osize option disables these code-expanding optimizations.

Use this option if you have limited main memory, swap space, or disk space.

Specifying Maximum Optimization

To get maximum optimization, use:
cc
+Oall
The +Oall option performs the maximum optimization.

Use +Oall with stable, well-structured, ANSI-conforming code. These types of optimizations give you the fastest code, but are riskier than the default optimizations.

You can use +Oall at optimization levels 2, 3, and 4. The default is +Onoall.

The +Oall option by itself (specified without the +O2, +O3, or +O4 options) combines the +O4 +Oaggressive +Onolimit options. This combination performs aggressive optimizations with unrestricted compile time at the highest level of optimization.

Combining Optimization Parameters

You can combine optimization parameters that affect code size, compile-time, and the aggressiveness of the optimizations with a level of optimization.

For example, to specify conservative optimizations at level 2 and disable code-expanding optimizations, use:

cc +O2 +Oconservative +Osize sourcefile.c
+Olimit and +Osize can be used with either +Oaggressive or +Oconservative.

You cannot use +Oaggressive with +Oconservative.

Summary of Optimization Parameters

Table 9: HP C Optimization Parameters summarizes the HP C optimization parameters:
 

Table 9: HP C Optimization Parameters 
Option  What It Does  Level of Opt 
+O[no]aggressive The +O[no]aggressive option enables optimizations that can result in significant performance improvement, but that can change a program"s behavior. These optimizations include newly released optimizations and the optimizations invoked by the following advanced optimization options: 

See Controlling Specific Optimizer Features for details about advanced optimization options.

  • +Osignedpointers
  • +Oregionsched
  • +Oentrysched
  • +Onofltacc
  • +Olibcalls
  • +Onoinitcheck
  • +Ovectorize
The default is +Onoaggressive.
2, 3, 4
+O[no]all The +Oall option performs maximum optimization, including aggressive optimizations and optimizations that can significantly increase compile time and memory usage. The default is +Onoall. 4
+O[no]conservative The +O[no]conservative option causes the optimizer to make conservative assumptions about the code when optimizing it. Use +Oconservative when conservative assumptions are necessary due to the coding style, as with non-standard conforming programs. The +Oconservative option relaxes the optimizer"s assumptions about the target program. The default is +Onoconservative. 2, 3, 4
+O[no]info +Oinfo displays informational messages about the optimization process. This option supports the core optimization levels, and therefore, can be used at levels 0-4. The default is +Onoinfo. 0, 1, 2, 3, 4
+O[no]limit The +Olimit option suppresses optimizations that significantly increase compile-time or that can consume a lot of memory. The +Onolimit option allows optimizations to be performed regardless of their effect on compile-time or memory usage. The default is +Olimit. 2, 3, 4
+O[no]size The +Osize option suppresses optimizations that significantly increase code size. The +Onosize option does not prevent optimizations that can increase code size. The default is +Onosize. 2, 3, 4

Profile-Based Optimization

The following topics are described in this section: Profile-based optimization (PBO) is a set of performance-improving code transformations based on the run-time characteristics of your application.

There are three steps involved in performing this optimization:

  1. Instrumentation - Insert data collection code into the object program.
  2. Data Collection - Run the program with representative data to collect execution profile statistics.
  3. Optimization - Generate optimized code based on the profile data.
Invoke profile-based optimization through HP C by using any level of optimization and the +I and +P options on the cc command line.

When you use PBO, compile times are faster and link times are slower because code generation happens at link time.

Instrumenting the Code

To instrument your program, use the +I option as follows:
cc
-Aa +I -O -c sample.c  Compile
for instrumentation.
cc
-o sample.exe +I -O sample.o  Link
to make instrumented executable.
The first command line uses the -O option to perform level 2 optimization and instruments the code. The -c option in the first command line suppresses linking and creates an intermediate object file called sample.o. The.o file can be used later in the optimization phase, avoiding a second compile.

The second command line uses the -o option to link sample.o into sample.exe. The +I option instruments sample.exe with data collection code. Note that instrumented programs run slower than non-instrumented programs. Only use instrumented code to collect statistics for profile-based optimization.

Collecting Data for Profiling

To collect execution profile statistics, run your instrumented program with representative data as follows:
sample.exe
< input.file1  Collect
execution profile data.
sample.exe
< input.file2
This step creates and logs the profile statistics to a file, by default called flow.data. You can use this data collection file to store the statistics from multiple test runs of different programs that you may have instrumented.

Performing Profile-Based Optimization

To optimize the program based on the previously collected run-time profile statistics, relink the program as follows:
cc
-o sample.exe +P -O sample.o
An alternative to this procedure is to recompile the source file in the optimization step:
cc
-o sample.exe +I -0 sample.c     instrumentation
sample.exe
< input.file1            data
collection
cc
-o sample.exe +P -O sample.c     optimization

Maintaining Profile Data Files

Profile-based optimization stores execution profile data in a disk file. By default, this file is called flow.data and is located in your current working directory.

You can override the default name of the profile data file. This is useful when working on large programs or on projects with many different program files.

You can use the FLOW_DATA environment variable to specify the name of the profile data file with either the +I or +P options. You can use the +df command-line option to specify the name of the profile data file with the +P option.

The +df option takes precedence over the FLOW_DATA environment variable.

In the following example, the FLOW_DATA environment variable is set to override the flow.data file name. The profile data is stored instead in /users/profiles/prog.data.

% setenv FLOW_DATA /users/profiles/prog.data
% cc -Aa -c +I +O3 sample.c
% cc -o sample.exe +I +O3 sample.o
% sample.exe < input.file1
% cc -o sample.exe +P +O3 sample.o
In the next example, the +df option uses /users/profiles/prog.data to override the flow.data file name.
% cc -Aa -c +I +O3 sample.c
% cc -o sample.exe +I +O3 sample.o
% sample.exe < input.file1
% mv flow.data /users/profile/prog.data
% cc -o sample.exe +df /users/profiles/prog.data +P +O3 sample.o

Maintaining Instrumented and Optimized Program Files

You can maintain both instrumented and optimized versions of a program. You might keep an instrumented version of the program on hand for development use, and several optimized versions on hand for performance testing and program distribution.

Care must be taken when maintaining different versions of the executable file because the instrumented program file name is used as the key identifier when storing execution profile data in the data file.

The optimizer must know what this key identifier name is in order to find the execution profile data. By default, the key identifier name used to retrieve the profile data is the instrumented program file name used to run the program for data collection.

When you optimize a program file and the optimized program file name is different from the instrumented program file name, you must use the +pgm option. Specify the instrumented program file name with this option. The optimizer uses this value as the key identifier to retrieve execution profile data.

In the following example, the instrumented program file name is sample.inst. The optimized program file name is sample.opt. The +pgm name option is used to pass the instrumented program name to the optimizer:

% cc -Aa -c +I +O3 sample.c
% cc -o sample.inst +I +O3 sample.o
% sample.inst < input.file1
% cc -o sample.opt +P +O3 +pgm sample.inst sample.o

Profile-Based Optimization Notes

When using profile-based optimization, please note the following: For more information on profile-based optimization, see the HP-UX Linker and Libraries Online User Guide.

+Oprofile, option for Profile Based Optimization

This release of HP C compiler has new profile based optimization options for improved usability in some common scenarios. The options below do not produce ISOMs (Intermediate-code .o files) as do the +I, +P, and +O4 options. Therefore they will rebuild faster than the ISOM-building options, but cannot just be relinked in the +P phase from the ISOMs built by the +I phase. The new options also do not support cross-module optimization with the +O4 option. PBO build processes that do not rebuild from source will not work with these new options, but processes that currently use scripts to run ld -r commands on every ISOM to convert it to a SOM can use the cc driver with these new options instead of the scripts.

The new options:

The above new options correspond to (though building SOMs instead of ISOMs):
NOTE 
  • The new options can be used only with -c (compile only), if not the optimization is performed as earlier.
  • The new options are available only at optimization level below +O4 and at +O4 optimization, +I or +P is used.
  • Mixing of old and new options while optimizing on the same command line is disabled. For example: Using +Oprofile and +I/+P/+df in the same command line are incompatible.

Controlling Specific Optimizer Features

Most of the time, specifying optimization level 1, 2, 3, or 4 should provide you with the control over the optimizer that you need. Additional parameters are provided when you require a finer level of control.

At each level, you can turn on and off specific optimizations using the +O[no]optimization option. The optimization parameter is the name of a specific optimization technique. The optional prefix [no] disables the specified optimization.

Below is a list of advanced optimizer options, followed by detailed information on each option:

+Olevel=name1[,name2,...nameN]

Optimization levels: 1, 2, 3, 4

Default: All functions are optimized at the level specified by the ordinary +Olevel option.

This option lowers optimization to the specified levelfor one or more named functions. level can be 0, 1, 2, 3, or 4. The name parameters are names of functions in the module being compiled. Use this option when one or more functions do not optimize well or properly. It must be used with an ordinary +Olevel option.

This option works the same as the OPT_LEVEL pragma described under Optimizer Control Pragmas . This option overrides the OPT_LEVEL pragma for the specified functions. As with the pragma, you can only lower the level of optimization; you cannot raise it above the level specified in the ordinary +Olevel option. To avoid confusion, it is best to use either this option or the OPT_LEVEL pragma rather than both.

Examples

The following command optimizes all functions at level 3, except for the functions myfunc1 and myfunc2, which it optimizes at level 1.

$ cc +O3 +O1=myfunc1,myfunc2 funcs.c main.c

The following command optimizes all functions at level 2, except for the functions myfunc1 and myfunc2, which it optimizes at level 0.

$ cc -O +O0=myfunc1,myfunc2 funcs.c main.c

+O[no]autopar

See +O[no]autopar.

+O[no]dataprefetch

Default: +Onodataprefetch

When +Odataprefetch is enabled, the optimizer inserts instructions within innermost loops to explicitly prefetch data from memory into the data cache. Data prefetch instructions will be inserted only for data structures referenced within innermost loops using simple loop varying addresses (that is, in a simple arithmetic progression). It is only available for PA-RISC 2.0 targets.

The math library contains special prefetching versions of vector routines. If you have a PA-RISC 2.0 application that contains operations on arrays larger than 1 megabyte in size, using +Ovectorize in conjunction with +Odataprefetch may improve performance substantially.

Use this option for applications that have high data cache miss overhead.

+O[no]dynsel

See +O[no]dynsel.

+O[no]entrysched

Optimization levels: 1, 2, 3, 4

Default: +Onoentrysched

The +Oentrysched option optimizes instruction scheduling on a procedure"s entry and exit sequences. Enabling this option can speed up an application. The option has undefined behavior for applications which handle asynchronous interrupts by examining the sigcontext values of caller stack operands. The option affects unwinding in the entry and exit regions.

At optimization level +O2 and higher (using data flow information), save and restore operations become more efficient.

This option can change the behavior of programs that perform stack unwind-based exception handling or asynchronous interrupt handling. The behavior of setjmp() and longjmp() is not affected.

+O[no]extern[=name1,name2,...nameN]

Optimization levels: 0, 1, 2, 3, 4

Default: +Oextern

This option is available in the LP64 data model only.

The +O[no]extern option allows you to specify which accesses to symbols in an executable or shared library (a load module) can be optimized. Use of +Onoextern creates code that cannot be included in a shared library.Use +Onoextern only to build executables.Only internal symbols (defined in the load module) can be optimized. If +Onoextern is specified without a name list, the compiler assumes that no symbols are external to the load module being compiled, and any symbol can be optimized. If +Oextern is specified without a name list, the compiler assumes that all symbols are external to the load module being compiled and thus cannot be optimized; this is the default.If +Oextern is specified with a name list, the compiler treats the specified symbols as external even if +Onoextern without a name list is in effect. The following example indicates that foo and bar are to eventually be imported from another load module (for example, a shared library); all other functions and data items will not be external, since +Onoextern is specified.

+Oextern=foo,bar +Onoextern
When +Onoextern is specified with a name list, the compiler treats the specified symbols as internal even if +Oextern without a name list is in effect. The following example indicates that references to baz and x may be optimized for access in the local load module. All other symbols will be subject to resolution to another load module since +Oextern is the default.
+Onoextern=baz,x
Use this option to precisely control which symbols' accesses may be optimized. Knowledge of the shared libraries used by an application, or the exported interface of a shared library is required.See also, the HP_DEFINED_EXTERNAL pragma.The default is +Oextern with no name list.

+O[no]fail_safe

Optimization levels: 1, 2, 3

Default: +Ofail_safe

The +Ofail_safe option allows compilations with internal optimization errors to continue by issuing a warning message and restarting the compilation at +O0.

You can use +Onofail_safe at optimization levels 1, 2, 3, or 4 when you want the internal optimization errors to abort your build.

This option is disabled when compiling for parallelization.

+O[no]fastaccess

Optimization levels: 0, 1, 2, 3, 4

Default: +Onofastaccess at optimization levels 0, 1, 2 and 3, +Ofastaccess at optimization level 4

The +Ofastaccess option optimizes for fast access to global data items.

Use +Ofastaccess to improve execution speed at the expense of longer compile times.

+O[no]fltacc

Optimization levels: 2, 3, 4

The +Onofltacc option allows the compiler to perform floating-point optimizations that are algebraically correct but that may result in numerical differences. For example, this option may change the order of expression evaluation as such: If a, b, and c are floating-point variables, the expressions (a + b) + c and a + (b + c) may give slightly different results due to rounding. In general, these differences will be insignificant.

The +Onofltacc option also enables the optimizer to generate fused multiply-add (FMA) instructions, the FMPYFADD and FMPYNFADD. These instructions improve performance but occasionally produce results that may differ from results produced by code without FMA instructions. In general, the differences are slight. FMA instructions are only available on PA-RISC 2.0 systems.

Specifying +Ofltacc disables the generation of FMA instructions as well as some other floating-point optimizations. Use +Ofltacc if it is important that the compiler evaluate floating-point expressions as it does in unoptimized code. The +Ofltacc option does not allow any optimizations that change the order of expression evaluation and therefore may affect the result.

If you are optimizing code at level 2 or higher and do not specify +Onofltacc or +Ofltacc, the optimizer will use FMA instructions, but will not perform floating-point optimizations that involve expression reordering or other optimizations that potentially impact numerical stability.

The list below identifies the different actions taken by the optimizer according to whether you specify +Ofltacc, +Onofltacc, or neither option.

Optimization        Expression       FMA?
Options             Reordering?
 
+O2                 No               Yes
+O2 +Ofltacc        No               No
+O2 +Onofltacc      Yes              Yes

+O[no]global_ptrs_unique[=name1,name2,...nameN]

Optimization levels: 2, 3, 4

Default: +Onoglobal_ptrs_unique

Use this option to identify unique global pointers, so that the optimizer can generate more efficient code in the presence of unique pointers, for example by using copy propagation and common sub-expression elimination. A global pointer is unique if it does not alias with any variable in the entire program.

This option supports a comma-separated list of unique global pointer variable names.

+O[no]initcheck

Optimization levels: 2, 3, 4

Default: unspecified

The initialization checking feature of the optimizer has three possible states: on, off, or unspecified. When on (+Oinitcheck), the optimizer initializes to zero any local, scalar, non-static variables that are uninitialized with respect to at least one path leading to a use of the variable.

When off (+Onoinitcheck), the optimizer issues warning messages when it discovers definitely uninitialized variables, but does not initialize them.

When unspecified, the optimizer initializes to zero any local, scalar, non-static variables that are definitely uninitialized with respect to all paths leading to a use of the variable.

Use +Oinitcheck to look for variables in a program that may not be initialized.

+O[no]inline[=name1, name2,...nameN]

Optimization levels: 3, 4Default: +Oinline

When +Oinline is specified without a name list, any function can be inlined. For inlining to be successful, follow prototype definitions for function calls in the appropriate header file.

When specified with a name list, the named functions are important candidates for inlining. For example, saying

+Oinline=foo,bar +Onoinline
indicates that inlining be strongly considered for foo and bar; all other routines will not be considered for inlining, since +Onoinline is given.

When this option is disabled with a name list, the compiler will not consider the specified routines as candidates for inlining. For example, saying

+Onoinline=baz,x
indicates that inlining should not be considered for baz and x; all other routines will be considered for inlining, since +Oinline is the default.

The +Onoinline disables inlining for all functions or a specific list of functions.

Use this option when you need to precisely control which subprograms are inlined.

+Oinline_budget=n

Optimization levels: 3, 4

Default: +Oinline_budget=100

where n is an integer in the range 1 - 1000000 that specifies the level of aggressiveness, as follows:

The +Onolimit and +Osize options also affect inlining. Specifying the +Onolimit option has the same effect as specifying +Oinline_budget=200. The +Osize option has the same effect as +Oinline_budget=1.

Note, however, that the +Oinline_budget=n option takes precedence over both of these options. This means that you can override the effect of +Onolimit or +Osize option on inlining by specifying the +Oinline_budget=n option on the same compile line.

+O[no]libcalls

Optimization levels: 0, 1, 2, 3, 4

Default: +Onolibcalls

Use the +Olibcalls option to increase the runtime performance of code which calls standard library routines in simple contexts. The +Olibcalls option expands the following library calls inline:

Inlining will take place only if the function call follows the prototype definition the appropriate header file. Fast subprogram linkage is also emitted to tuned millicode versions of the math library functions sin, cos, tan, atan 2, log, pow,asin, acos, atan, exp, and log10. (See the HP-UX Floating-Point Guide for the most up-to-date listing of the math library functions.) The calling code must not expect to access ERRNO after the function"s return.

A single call to printf() may be replaced by a series of calls to putchar(). Calls to sprintf() and strlen() may be optimized more effectively, including elimination of some calls producing unused results. Calls to setjmp() and longjmp() may be replaced by their equivalents _setjmp() and _longjmp(), which do not manipulate the process"s signal mask.

Use +Olibcalls to improve the performance of selected library routines only when you are not performing error checking for these routines.

Using +Olibcalls with +Ofltacc will give different floating point calculation results than those given using +Ofltacc without +Olibcalls.

The +Olibcalls option replaces the obsolete -J option.

+O[no]loop_block

See +O[no]loop_block.

+O[no]loop_transform

Optimization levels: 3, 4

Default: +Oloop_transform

The +O[no]loop_transform option enables [disables] transformation of eligible loops for improved cache performance. The most important transformation is the reordering of nested loops to make the inner loop unit stride, resulting in fewer cache misses.

+Onoloop_transform may be a helpful option if you experience any problem while using +Oparallel.

+O[no]loop_unroll[=unroll factor]

Optimization levels: 2, 3, 4

Default: +Oloop_unroll

The +Oloop_unroll option turns on loop unrolling. When you use +Oloop_unroll, you can also use the unroll factor to control the code expansion. The default unroll factor is 4, that is, four copies of the loop body. By experimenting with different factors, you may improve the performance of your program.

+O[no]loop_unroll_jam

See +O[no]loop_unroll_jam.

+O[no]moveflops

Optimization levels: 2, 3, 4

Default: +Omoveflops

Allows [or disallows] moving conditional floating point instructions out of loops. The +Onomoveflops option replaces the obsolete +OE option. The behavior of floating-point exception handling may be altered by this option.

Use +Onomoveflops if floating-point traps are enabled and you do not want the behavior of floating-point exceptions to be altered by the relocation of floating-point instructions.

+O[no]multiprocessor

Optimization levels2: 2, 3, 4

Default: +Onomultiprocessor

If +Omultiprocessor is specified, the compiler performs optimimizations appropriate for executables or shared libraries to run in several different processes on multiprocessor machines.

If you enable this option inappropriately (for example, for an executable only run a uniprocessor system), performance may be degraded.

+O[no]parallel

See +O[no]parallel.

+O[no]parallel_env

Need to add information on this option.

+O[no]parmsoverlap

Optimization levels: 2, 3, 4

Default: +Oparmsoverlap

The +Oparmsoverlap option optimizes with the assumption that the actual arguments of function calls overlap in memory.

The +Onoparmsoverlap option replaces the obsolete +Om1 option.

Use +Onoparmsoverlap if C programs have been literally translated from FORTRAN programs.

+O[no]pipeline

Optimization levels: 2, 3, 4

Default: +Opipeline

Enables [or disables] software pipelining. The +Onopipeline option replaces the obsolete +Os option.

Use +Onopipeline to conserve code space.

+O[no]procelim

Optimization levels: 0, 1, 2, 3, 4

Default: +Onoprocelim at levels 0-3, +Oprocelim at level 4

When +Oprocelim is specified, procedures that are not referenced by the application are eliminated from the output executable file. The +Oprocelim option reduces the size of the executable file, especially when optimizing at levels 3 and 4, at which inlining may have removed all of the calls to some routines.

When you specify +Onoprocelim, procedures that are not referenced by the application are not eliminated from the output executable file.

The default is +Onoprocelim at levels 0-3, and +Oprocelim at level 4.

If the +Oall option is enabled, the +Oprocelim option is enabled.

+O[no]promote_indirect_calls

Optimization levels: 3, 4 and profile-based optimization

Default: +Onopromote_indirect_calls

This option uses profile data from profile-based optimization and other information to determine the most likely target of indirect calls and promotes them to direct calls. In all cases the optimized code tests to make sure the direct call is being taken & if not, executes the indirect call. If +Oinline is in effect, the optimizer may also inline the promoted calls. This option can only be used with profile-based optimization, described in Profile-Based Optimization .

The optimizer tries to determine the most likely target of indirect calls. If the profile data is incomplete or ambiguous, the optimizer may not select the best target. If this happens, your code's performance may decrease.

At +O3, this option is only effective if indirect calls from functions within a file are mostly to target functions within the same file. This is because +O3 optimizes only within a file whereas +O4 optimizes across files.

+O[no]ptrs_ansi

Optimization levels: 2, 3, 4

Default: +Onoptrs_ansi

Use +Optrs_ansi to make the following two assumptions, which the more aggressive +Optrs_strongly_typed does not make:

When both are specified, +Optrs_ansi takes precedence over +Optrs_strongly_typed.
  • For more information about type aliasing see Aliasing Options .

  • +O[no]ptrs_strongly_typed

    Optimization levels: 2, 3, 4

    Default: +Onoptrs_strongly_typed

    Use +Optrs_strongly_typed when pointers are type-safe. The optimizer can use this information to generate more efficient code.

    Type-safe (that is, strongly-typed) pointers are pointers to a specific type that only point to objects of that type, and not to objects of any other type. For example, a pointer declared as a pointer to an int is considered type-safe if that pointer points to an object only of type int, but not to objects of any other type.

    Based on the type-safe concept, a set of groups are built based on object types. A given group includes all the objects of the same type.

    The term type-inferred aliasing is a concept which means any pointer of a type in a given group (of objects of the same type) can only point to any object from the same group; it can not point to a typed object from any other group.

    For more information about type aliasing see Aliasing Options .

    Type casting to a different type violates type-inferring aliasing rules. See Example 2 below.

    Dynamic casting is allowed. See Example 3 below.

    For more details, see Aliasing Options .

    Example 1: How Data Types Interact

    The optimizer generally spills all global data from registers to memory before any modification to global variables or any loads through pointers. However, you can instruct the optimizer on how data types interact so it can generate more efficient code.

    If you have the following:

    1  int *p;
    2  float *q;
    3  int a,b,c;
    4  float d,e,f;
    5  foo()
    6  {
    7    for (i=1;i<10;i++) {
    8              d=e
    9             *p=b;
    10             e=d+f;
    11             f=*q;
    12   }
    13 }
    With +Onoptrs_strongly_typed turned on, the pointers p and q will be assumed to be disjoint because the types they point to are different types. Without type-inferred aliasing, *p is assumed to invalidate all the definitions. So, the use of d and f on line 10 have to be loaded from memory. With type-inferred aliasing, the optimizer can propagate the copy of d and f and thus avoid two loads and two stores.

    This option can be used for any application involving the use of pointers, where those pointers are type safe. To specify when a subset of types are type-safe, use the [NO]PTRS_STRONGLY_TYPED pragma. The compiler issues warnings for any incompatible pointer assignments that may violate the type-inferred aliasing rules discussed in Aliasing Options .

    Example 2: Unsafe Type Cast

    Any type cast to a different type violates type-inferred aliasing rules. Do not use +Optrs_strongly_typed with code that has these unsafe type casts. Use the [NO]PTRS_STRONGLY_TYPED pragma to prevent the application of type-inferred aliasing to the unsafe type casts.

    struct foo{
      int a;
      int b;
    } *P;
     
    struct bar {
      float a;
      int b;
      float c;
    } *q;
     
    P = (struct foo *) q;
      /* Incompatible pointer assignment
      through type cast */
    Example 3: Generally Applying Type Aliasing

    Dynamic cast is allowed with +Optrs_strongly_typed or +Optrs_ansi. A pointer dereference is called dynamic cast if a cast is applied on the pointer to a different type.

    In the example below, type-inferred aliasing is applied on P generally, not just to the particular dereference. Type-aliasing will be applied to any other dereferences of P.

    struct s {
      short int a;
      short int b;
      int c;
    } *P;
    * (int *)P = 0;
    For more information about type aliasing, see Aliasing Options .

    +O[no]ptrs_to_globals[=name1, name2, ...nameN]

    Optimization levels: 2, 3, 4

    Default: +Optrs_to_globals

    By default global variables are conservatively assumed to be modified anywhere in the program. Use this option to specify which global variables are not modified through pointers, so that the optimizer can make your program run more efficiently by incorporating copy propagation and common sub-expression elimination.

    This option can be used to specify all global variables as not modified via pointers, or to specify a comma-separated list of global variables as not modified via pointers.

    Note that the on state for this option disables some optimizations, such as aggressive optimizations on the program"s global symbols.

    For example, use the command-line option +Onoptrs_to_globals=a,b,c to specify global variables a, b, and c as not being accessed through pointers. No pointer can access these global variables. The optimizer will perform copy propagation and constant folding because storing to *p will not modify a or b.

    int a, b, c;
    float *p;
    foo()
    {
       a = 10;
       b = 20;
      *p = 1.0;
       c = a + b;
    }
    If all global variables are unique, use the following option without listing the global variables:
    +Onoptrs_to_globals
    In the example below, the address of b is taken. This means b can be accessed indirectly through the pointer. You can still use +Onoptrs_to_globals as: +Onoptrs_to_globals +Optrs_to_globals=b.
    long b,c;
    int *p;
     
    p=b;
     
    foo()
    For more information about type aliasing see Aliasing Options .

    +O[no]regionsched

    Optimization levels: 2, 3, 4

    Default: +Onoregionsched

    Applies aggressive scheduling techniques to move instructions across branches. This option is incompatible with the linker -z option. If used with -z, it may cause a SIGSEGV error at run-time.

    Use +Oregionsched to improve application run-time speed. Compilation time may increase.

    +Oreusedir=directory

    Optimization levels: 4 or with profile-based optimization

    Default: no reuse of object files

    This option specifies a directory where the linker can save object files created from intermediate object files when using +O4 or profile-based optimization. It reduces link time by not recompiling intermediate object files when they don't need to be.

    When you compile with +I, +P, or +O4, the compiler generates intermediate code in the object file. Otherwise, the compiler generates regular object code in the object file. When you link, the linker first compiles the intermediate object code to regular object code, then links the object code. With this option you can reduce link time on subsequent links by avoiding recompiling intermediate object files that have already been compiled to regular object code and have not changed.

    Note that when you do change a source file or command line options and recompile, a new intermediate object file will be created and compiled to regular object code in the specified directory. The previous object file in the directory will not be removed. You should periodically remove this directory since old object files cannot be reused and will not be automatically removed.

    +O[no]regreassoc

    Optimization levels: 2, 3, 4

    Default: +Oregreassoc

    If disabled, this option turns off register reassociation.

    Use +Onoregreassoc to disable register reassociation if this optimization hinders the optimized application performance.

    +O[no]report=[report_type]

    See +O[no]report[= report_type].

    +O[no]sharedgra

    See +O[no]sharedgra.

    +O[no]sideeffects[=name1, name2, ...nameN]

    Optimization levels: 2, 3, 4

    Default: assume all subprograms have side effects

    Assume that subprograms specified in the name list might modify global variables. Therefore, when +Osideeffects is enabled the optimizer limits global variable optimization.

    The default is to assume that all subprograms have side effects unless the optimizer can determine that there are none.

    Use +Onosideeffects if you know that the named functions do not modify global variables and you wish to achieve the best possible performance.

    +O[no]signedpointers

    Optimization levels: 0, 1, 2, 3, 4

    Default: +Onosignedpointers

    Perform [or do not perform] optimizations related to treating pointers as signed quantities. Applications that allocate shared memory and that compare a pointer to shared memory with a pointer to private memory may run incorrectly if this optimization is enabled.

    Use +Osignedpointers to improve application run-time speed.

    +O[no]static_prediction

    Optimization levels: 0, 1, 2, 3, 4

    Default: +Onostatic_prediction

    +Ostatic_prediction turns on static branch prediction for PA-RISC 2.0 targets.

    PA-RISC 2.0 has two means of predicting which way conditional branches will go: dynamic branch prediction and static branch prediction. Dynamic branch prediction uses a hardware history mechanism to predict future executions of a branch from its last three executions. It is transparent and quite effective unless the hardware buffers involved are overwhelmed by a large program with poor locality.

    With static branch prediction on, each branch is predicted based on implicit hints encoded in the branch instruction itself; the dynamic branch prediction is not used.

    Static branch prediction"s role is to handle large codes with poor locality for which the small dynamic hardware facility will prove inadequate.

    Use +Ostatic_prediction to better optimize large programs with poor instruction locality, such as operating system and database code.

    Use this option only when using PBO, as an amplifier to +P. It is allowed but silently ignored with +I, so makefiles need not change between the +I and +P phases.

    +O[no]vectorize

    Optimization levels: 0, 1, 2, 3, 4

    Default: +Onovectorize

    +Ovectorize allows the compiler to replace certain loops with calls to vector routines.

    Use +Ovectorize to increase the execution speed of loops.

    When +Onovectorize is specified, loops are not replaced with calls to vector routines.

    Because the +Ovectorize option may change the order of operations in an application, it may also change the results of those operations slightly. See the HP-UX Floating-Point Guide for details.

    The math library contains special prefetching versions of vector routines. If you have a PA2.0 application that contains operations on very large arrays (larger than 1 megabyte in size), using +Ovectorize in conjunction with +Odataprefetch may improve performance substantially.

    You may use +Ovectorize at levels 3 and 4. +Onovectorize is also included as part of +Oaggressive and +Oall.

    This option is only valid for PA-RISC 1.1 and 2.0 systems.

    +O[no]volatile

    Optimization levels: 1, 2, 3, 4

    Default: +Onovolatile

    The +Ovolatile option implies that memory references to global variables cannot be removed during optimization.

    The +Onovolatile option implies that all globals are not of volatile class. This means that references to global variables can be removed during optimization.

    The +Ovolatile option replaces the obsolete +OV option.

    Use this option to control the volatile semantics for all global variables.

    +O[no]whole_program_mode

    Optimization level: 4

    Default: +Onowhole_program_mode

    The +Owhole_program_mode option enables the assertion that only the files that are compiled with this option directly reference any global variables and procedures that are defined in these files. In other words, this option asserts that there are no unseen accesses to the globals.

    When this assertion is in effect, the optimizer can hold global variables in registers longer and delete inlined or cloned global procedures.

    All files compiled with +Owhole_program_mode must also be compiled with +O4. If any of the files were compiled with +O4 but were not compiled with +Owhole_program_mode, the linker disables the assertion for all files in the program.

    The default, +Onowhole_program_mode, disables the assertion.

    Use this option to increase performance speed, but only when you are certain that only the files compiled with +Owhole_program_mode directly access any globals that are defined in these files.

    Using Advanced Optimization Options

    Several advanced optimization options can be specified on the same command line. For example, the following command line specifies aggressive level 3 optimizations with unrestricted compile time, disables software pipelining, and disables moving conditional floating-point instructions out of a loop:
    cc +O3 +Oaggressive +Onolimit +Onomoveflops +Onopipeline \
       sourcefile.c
    Specify the level of optimization first (+O1, +O2, +O3, or +O4), followed by any +O[no]optimization options.

    Level 1 Optimization Modules

    The level 1 optimization modules are: The examples in this section are shown at the source code level wherever possible. Transformations that cannot be shown at the source level are shown in assembly language. See Table 10: Descriptions of Assembly Language Instructions for descriptions of the assembly language instructions used.

    Branch Optimization

    The branch optimization module traverses the procedure and transforms branch instruction sequences into more efficient sequences where possible. Examples of possible transformations are:
              if(a) {
                .
                .
                .
                 statement 1
              } else {
                 goto L1;
              }
              statement 2
          L1:
    becomes:
              if(!a) {
                  goto L1;
              }
              statement 1
              statement 2
          L1:

    Dead Code Elimination

    The dead code elimination module removes unreachable code that is never executed.

    For example, the code:

         if(0) {
            a = 1;
         } else {
            a = 2;
    becomes:
         a = 2;

    Faster Register Allocation

    The faster register allocation module, used with unoptimized code, analyzes register use faster than the coloring register allocator (a level 2 module).

    This module performs the following:

    Instruction Scheduler

    The instruction scheduler module performs the following: For example, the code:
         LDW     -52(0,30),r1
         ADDI    3,r1,r31    ;interlock with load of r1
         LDI     10,r19
    becomes:
         LDW     -52(0,sp),r1
         LDI     10,r19
         ADDI    3,r1,r31    ;use of r1 is now separated from load

    Table 10: Descriptions of Assembly Language Instructions
    Instruction  Description 
    LDWoffset(sr, base), target Loads a word from memory into register target.
    ADDIconst, reg, target Adds the constant const to the contents of register reg and puts the result in register target.
    LDIconst, target Loads the constant const into register target.
    LDOconst(reg),target Adds the constant const to the contents of register reg and puts the result in register target.
    ANDreg1, reg2, target Performs a bitwise AND of the contents of registers reg1 and reg2 and puts the result in register target.
    COMIBcond const, reg, lab Compares the constant const to the contents of register reg and branches to label lab if the condition cond is true.
    BBcond reg,num,lab Tests the bit number num in the contents of register reg and branches to label lab if the condition cond is true.
    COPYreg, target Copies the contents of register reg to register target.
    STWreg, offset(sr, base) Store the word in register reg to memory.

    Peephole Optimizations

    The peephole optimization process involves looking at small windows of machine code for optimization opportunities. Wherever possible, the peephole optimizer replaces assembly language instruction sequences with faster (usually shorter) sequences, and removes redundant register loads and stores.

    For example, the code:

         LDI     32,r3
         AND     r1,r3,r2
         COMIB,= 0,r2,L1
    becomes:
         BB,>=   r1, 26, L1

    Level 2 Optimization Modules

    Level 2 performs optimizations within each procedure. At level 2, the optimizer performs all optimizations performed at the prior level, with the following additions: The examples in this section are shown at the source code level wherever possible. Transformations that cannot be shown at the source level are shown in assembly language.

    Coloring Register Allocation

    The name of this optimization comes from the similarity to map coloring algorithms in graph theory. This optimization determines when and how long commonly used variables and expressions occupy a register. It minimizes the number of references to memory (loads and stores) a code segment makes. This can improve run-time speed.

    You can help the optimizer understand when certain variables are heavily used within a function by declaring these variables with the register qualifier. The first 10 register qualified variables encountered in the source are honored. You should pick the ten most important variables to be most effective.

    The coloring register allocator may override your choices and promote to a register a variable not declared register over one that is, based on estimated speed improvements.

    The following code shows the type of optimization the coloring register allocation module performs. The code:

         LDI     2,r104
         COPY    r104,r103
         LDO     5(r103),r106
         COPY    r106,r105
         LDO     10(r105),r107
    becomes:
         LDI     2,r25
         LDO     5(r25),r26
         LDO     10(r26),r31

    Induction Variables and Strength Reduction

    The induction variables and strength reduction module removes expressions that are linear functions of a loop counter and replaces each of them with a variable that contains the value of the function. Variables of the same linear function are computed only once. This module also simplifies the function by replacing multiplication instructions with addition instructions wherever possible.

    For example, the code:

         for (i=0; i<25; i++) {
              r[i] = i * k;
         }
    becomes:
         t1 = 0;
         for (i=0; i<25; i++) {
              r[i] = t1;
              t1 += k;
         }

    Local and Global Common Subexpression Elimination

    The common subexpression elimination module identifies expressions that appear more than once and have the same result, computes the result, and substitutes the result for each occurrence of the expression. The types of subexpression include instructions that load values from memory, as well as arithmetic evaluation.

    For example, the code:

         a = x + y + z;
         b = x + y + w;
    becomes:
         t1 = x + y;
         a = t1 + z;
         b = t1 + w;

    Constant Folding and Propagation

    Constant folding computes the value of a constant expression at compile time. For example:
    A = 10;
    B = A + 5;
    C = 4 * B;
    can be replaced by:
    A = 10;
    B = 15;
    C = 60;

    Loop Invariant Code Motion

    The loop invariant code motion module recognizes instructions inside a loop whose results do not change and moves them outside the loop. This ensures that the invariant code is only executed once.

    For example, the code:

         x = z;
         for(i=0; i<10; i++)
         {
              a[i] = 4 * x + i;
         }
    becomes:
         x = z;
         t1 = 4 * x;
         for(i=0; i<10; i++)
         {
              a[i] = t1 + i;
         }

    Store/Copy Optimization

    Where possible, the store/copy optimization module substitutes registers for memory locations, by replacing store instructions with copy instructions and deleting load instructions.

    For example, the following HP C code:

         a = x + 23;
    where a is a local variable.
         return a;
    produces the following code for the unoptimized case:
         LDO     23(r26),r1
         STW     r1,-52(0,sp)
         LDW     -52(0,sp),ret0
    and this code for the optimized case:
         LDO     23(r26),ret0

    Unused Definition Elimination

    The unused definition elimination module removes unused memory location and register definitions. These definitions are often a result of transformations made by other optimization modules.

    For example, the function:

         f(int x)
         {
              int a,b,c:
     
              a = 1;
              b = 2;
              c = x * b;
              return c;
         }
    becomes:
         f(int x)
         {
              int a,b,c;
     
              b = 2;
              c = x * b;
              return c;
         }

    Software Pipelining

    Software pipelining is a code transformation that optimizes program loops. It rearranges the order in which instructions are executed in a loop. It generates code that overlaps operations from different loop iterations. Software pipelining is useful for loops that contain arithmetic operations on floats and doubles.

    The goal of this optimization is to avoid CPU stalls due to memory or hardware pipeline latencies. The software pipelining transformation adds code before and after the loop to achieve a high degree of optimization within the loop.

    Example

    The following pseudo-code fragment shows a loop before and after the software pipelining optimization. Four significant things happen: The following is a C for loop:
    #define SIZ 10000
    float x[SIZ], y[SIZ]; \*Software pipelining works with*\
    int i;                \*floats and doubles.           *\
    init();
    for (i = 0;i<= SIZ;i++);
            {
            x[i] =x[i] / y[i] + 4.00
            }
    When this loop is compiled with software pipelining, the optimization can be expressed in pseudo-code as follows:
    R1 = 0;                Initialize
    array index.
    R2 = 4.0;              Load
    constant value.
    R3 = Y[0];             Load
    first Y value.
    R4 = X[0];             Load
    first X value.
    R5 = R4 / R3;          Perform
    division on first element:n = X[0] / Y[0].
     
    do {                   Begin
    loop.
          R6 = R1;         Save
    current array index.
          R1++;            Increment
    array index.
          R7 = X[R1];      Load
    current X value.
          R8 = Y[R1];      Load
    current Y value.
          R9 = R5 + R2;    Perform
    addition on prior row:X[i] = n + 4.0.
          R10 = R7 / R8;   Perform
    division on current row:m = X[i+1] / Y[i+1].
          X[R6] = R9;      Save
    result of operations on prior row.
     
          R6 = R1;         Save
    current array index.
          R1++;            Increment
    array index.
          R4 = X[R1];      Load
    next X value.
          R3 = Y[R1];      Load
    next Y value.
          R11 = R10 + R2;  Perform
    addition on current row:X[i+1] = m + 4
          R5 = R4 / R3;    Perform
    division on next row:n =  X[i+2] / Y[i+2]
          X[R6] = R11      Save
    result of operations on current row.
    } while (R1 <= 100);   End
    loop.
     
    R9 = R5 + R2;          Perform
    addition on last row:X[i+2] = n + 4
    X[R6] = R9;            Save
    result of operations on last row.
    This transformation stores intermediate results of the division instructions in unique registers (noted as n and m). These registers are not referenced until several instructions after the division operations. This decreases the possibility that the long latency period of the division instructions will stall the instruction pipeline and cause processing delays.

    Prerequisites of Pipelining

    Software pipelining is attempted on a loop that meets the following criteria: This optimization produces slightly larger program files and increases compile time. It is most beneficial in programs containing loops that are executed a large number of times. This optimization is not recommended for loops that are executed only a small number of times.

    Use the +Onopipeline option with the +O2, +O3, or +O4 option to suppress software pipelining if program size is more important than execution speed. This will perform level two optimization, but disable software pipelining.

    Register Reassociation

    Array references often require one or more instructions to compute the virtual memory address of the array element specified by the subscript expression. The register reassociation optimization implemented in the PA-RISC compilers tries to reduce the cost of computing the virtual memory address expression for array references found in loops.

    Within loops, the virtual memory address expression can be rearranged and separated into a loop varying term and a loop invariant term. Loop varying terms are those items whose values may change from one iteration of the loop to another. Loop invariant terms are those items whose values are constant throughout all iterations of the loop. The loop varying term corresponds to the difference in the virtual memory address associated with a particular array reference from one iteration of the loop to the next.

    The register reassociation optimization dedicates a register to track the value of the virtual memory address expression for one or more array references in a loop and updates the register appropriately in each iteration of a loop.

    The register is initialized outside the loop to the loop invariant portion of the virtual memory address expression and the register is incremented or decremented within the loop by the loop variant portion of the virtual memory address expression. On PA-RISC, the update of such a dedicated register can often be performed for free using the base-register modification capability of load and store instructions.

    The net result is that array references in loops are converted into equivalent but more efficient pointer dereferences.

    For example:

    int a[10][20][30];
     
    void example (void)
    {
      int i, j, k;
     
      for (k = 0; k < 10; k++)
        for (j = 0; j < 10; j++)
          for (i = 0; i < 10; i++)
          {
              a[i][j][k] = 1;
          }
    }
    after register reassociation is applied to the innermost loop becomes:
    int a[10][20][30];
     
    void example (void)
    {
      int i, j, k;
      register int (*p)[20][30];
     
      for (k = 0; k < 10; k++)
        for (j = 0; j < 10; j++)
          for (p = (int (*)[20][30]) a[0][j][k], i = 0; i < 10; i++)
          {
              *(p++[0][0]) = 1;
          }
    }
    In the above example, the compiler-generated temporary register variable, p, strides through the array a in the innermost loop. This register pointer variable is initialized outside the innermost loop and auto-incremented within the innermost loop as a side-effect of the pointer dereference.

    Register reassociation can often enable another loop optimization. After performing the register reassociation optimization, the loop variable may be needed only to control the iteration count of the loop. If this is case, the original loop variable can be eliminated altogether by using the PA-RISC ADDIB and ADDB machine instructions to control the loop iteration count.

    Level 3 Optimizations

    Level 3 optimization includes level 2 optimizations, plus full optimization across all subprograms within a single file. Level 3 also inlines certain subprograms within the input file. Use +O3 to get level 3 optimization.

    Level 3 optimization produces faster run-time code than level 2 on code that frequently calls small functions within a file. Level 3 links faster than level 4.

    Inlining within a Single Source File

    Inlining substitutes functions calls with copies of the function"s object code. Only functions that meet the optimizer"s criteria are inlined. This may result in slightly larger executable files. However, this increase in size is offset by the elimination of time-consuming procedure calls and procedure returns.

    Example of Inlining

    The following is an example of inlining at the source code level. Before inlining, the source file looks like this:
    /* Return the greatest common divisor of two positive integers,  */
    /* int1 and int2, computed using Euclid"s algorithm.  (Return 0  */
    /* if either is not positive.)                                   */
    int gcd(int1,int2)
      int int1;
      int int2;
    {
      int inttemp;
     
        if ( ( int1 <= 0 ) || ( int2 <= 0 ) ) {
            return(0);
        }
        do {
            if ( int1 < int2 ) {
                inttemp = int1;
                int1    = int2;
                int2    = inttemp;
            }
            int1 = int1 - int2;
        } while (int1 > 0);
        return(int2);
    }
     
    main()
    {
      int xval,yval,gcdxy;
        /* statements before call to gcd */
        gcdxy = gcd(xval,yval);
        /* statements after call to gcd */
    }
    After inlining, the source file looks like this:
    main()
    {
      int xval,yval,gcdxy;
        /* statements before inlined version of gcd */
        {
          int int1;
          int int2;
     
            int1 = xval;
            int2 = yval;
            {
              int inttemp;
     
                if ( ( int1 <= 0 ) || ( int2 <= 0 ) ) {
                    gcdxy = ( 0 );
                    goto AA003;
                }
                do {
                    if ( int1 < int2 ) {
                        inttemp = int1;
                        int1    = int2;
                        int2    = inttemp;
                    }
                    int1 = int1 - int2;
                } while ( int1 > 0 );
                gcdxy = ( int2 );
            }
        }
    AA003 : ;
        /* statements after inlined version of gcd */
    }

    Level 4 Optimizations

    Level 4 performs optimizations across all files in a program. At level 4, all optimizations of the prior levels are performed. Two additional optimizations are performed: Interprocedural global optimizations across all files within a program searches across function boundaries to produce better and faster code sequences. Normally, global optimizations are performed within individual functions or source code files. Interprocedural optimizations look at function interactions within a program and transform particular code sequences into faster code. Since information about every function within a program is required, this level of optimization must be performed at link time.

    Inlining Across Multiple Files

    Inlining at Level 4 is performed across all procedures within the program. Inlining at level 3 is done within one file.

    Inlining substitutes function calls with copies of the function"s object code. Only functions that meet the optimizer"s criteria are inlined. This may result in slightly larger executable files. However, this increase in size is offset by the elimination of time-consuming procedure calls and procedure returns.

    Global and Static Variable Optimization

    Global and static variable optimizations look for ways to reduce the number of instructions required for accessing global and static variables. The compiler normally generates two machine instructions when referencing global variables. Depending on the locality of the global variables, single machine instructions may sometimes be used to access these variables. The linker rearranges the storage location of global and static data to increase the number of variables that can be referenced by single instructions.

    Global Variable Optimization Coding Standards

    Since this optimization rearranges the location and data alignment of global variables, avoid the following programming practices:

    Guidelines for Using the Optimizer

    The following guidelines assist in effectively using and and writing efficient HP C programs.

    Optimizer Assumptions

    During optimization, the compiler gathers information about the use of variables and passes this information to the optimizer. The optimizer uses this information to ensure that every code transformation maintains the correctness of the program, at least to the extent that the original unoptimized program is correct.

    When gathering this information, the HP C compiler makes the following assumption: while inside a function, the only variables that can be accessed indirectly through a pointer or by another function call are:

    Optimizer Pragmas

    Pragmas give you the ability to: Pragmas cannot cross line boundaries and the word pragma must be in lowercase letters. Optimizer pragmas may not appear inside a function.

    Optimizer Control Pragmas

    The OPTIMIZE and OPT_LEVEL pragmas control which functions are optimized, and which set of optimizations are performed. You can place these pragmas before any function definitions and they override any previous pragma. These pragmas cannot raise the optimization level above the level specified in the command line.

    OPT_LEVEL 0, 1, and 2 provide more control over optimization than the +O1 and +O2 compiler options. You use these pragmas to raise or lower optimization at a function level inside the source file. Whereas, the compiler options can only be used for an entire source file. (OPT_LEVEL 3 and 4 can only be used at the beginning of the source file.)

    Table 11: Optimization Level Precedence shows the possible combinations of options and pragmas and the resulting optimization levels. The level at which a function will be optimized is the lower of the two values specified by the command line optimization level and the optimization pragma in force.
     

    Table 11: Optimization Level Precedence 
    Command-line Optimization Level  #Pragma OPT_LEVEL  Resulting OPT_LEVEL 
    none OFF 0
    none 1 0
    none 2 0
    +O1 OFF 0
    +O1 1 1
    +O1 2 1
    +O1 3 1
    +O1 4 1
    +O2 OFF 0
    +O2 1 1
    +O2 2 2
    +O2 3 2
    +O2 4 2
    +O3 OFF 0
    +O3 1 1
    +O3 2 2
    +O3 3 3
    +O3 4 3
    +O4 OFF 0
    +O4 1 1
    +O4 2 2
    +O4 3 3
    +O4 4 4

    The values of OPTIMIZE and OPT_LEVEL are summarized in Table 12: Optimizer Control Pragmas
     

    Table 12: Optimizer Control Pragmas 
    Pragma  Description 
    #pragma OPTIMIZE ON Turns optimization on.
    #pragma OPTIMIZE OFF Turns optimization off.
    #pragma OPT_LEVEL 1 Optimize only within small blocks of code
    #pragma OPT_LEVEL 2 Optimize within each procedure.
    #pragma OPT_LEVEL 3 Optimize across all procedures within a source file.
    #pragma OPT_LEVEL 4 Optimize across all procedures within a program.

    Inlining Pragmas

    When INLINE is specified without a functionname, any function can be inlined. When specified with functionname(s), these functions are candidates for inlining.

    The NOINLINE pragma disables inlining for all functions or specified functionname(s).

    The syntax for performing inlining is:

    #pragma INLINE [functionname(1), ..., functionname(n)]
    
    #pragma NOINLINE [functionname(1), ..., functionname(n)]
    For example, to specify inlining of the two subprograms checkstat and getinput, use:
    #pragma INLINE checkstat, getinput
    To specify that an infrequently called routine should not be inlined when compiling at optimization level 3 or 4, use:
    #pragma NOINLINE opendb
    See also the related +O[no]inline optimization option.

    Alias Pragmas

  • NO_SIDE_EFFECTS Pragma
  • ALLOCS_NEW_MEMORY pragma
  • FLOAT_TRAPS_ON pragma

  • [NO]PTRS_STRONGLY_TYPED Pragma 

    The compiler gathers information about each function (such as information about function calls, variables, parameters, and return values) and passes this information to the optimizer. The NO_SIDE_EFFECTS and ALLOCS_NEW_MEMORY pragma tell the optimizer to make assumptions it can not normally make, resulting in improved compile-time and run-time speed. They change the default information the compiler collects.
    If used, the NO_SIDE_EFFECTS and ALLOCS_NEW_MEMORY pragmas should appear before the first function defined in a file and are in effect for the entire file. When used appropriately, these optional pragmas provide better optimization.

    NO_SIDE_EFFECTS Pragma

    By default, the optimizer assumes that all functions might modify global variables. To some degree, this assumption limits the extent of optimizations it can perform on global variables. The NO_SIDE_EFFECTS pragma provides a way to override this assumption. If you know for certain that some functions do not modify global variables, you can gain further optimization of code containing calls to these functions by specifying the function names in this pragma.

    NO_SIDE_EFFECTS has the following form:

    #pragma NO_SIDE_EFFECTS functionname(1), ..., functionname(n)
    All functions in functionname are the names of functions that do not modify the values of global variables. Global variable references can be optimized to a greater extent in the presence of calls to the listed functions. Note that you need the NO_SIDE_EFFECTS pragma in the files where the calls are made, not where the function is defined. This pragma takes effect from the line it first occurs on to the end of the file.

    ALLOCS_NEW_MEMORY pragma

    The ALLOCS_NEW_MEMORY pragma states that the function functionname returns a pointer to new memory that it either allocates or a routine that it calls allocates. ALLOCS_NEW_MEMORY has the following form:
    #pragma ALLOCS_NEW_MEMORY functionname(1), ..., functionname(n)
    The new memory must be memory that was either newly allocated or was previously freed and is now reallocated. For example, the standard routines malloc() and calloc() satisfy this requirement.

    Large applications might have routines that are layered abovemalloc() and calloc(). These interface routines make the calls to malloc() and calloc(), initialize the memory, and return the pointer that malloc() or calloc() returns. For example, in the program below:

    struct_type *get_new_record(void)
       {
       struct_type *p;
     
       if ((p=malloc(sizeof(*p))) == NULL) {
            printf("get_new_record():out of memory\n");
            abort();
            }
       else {
            /* initialize the struct */
            .
            .
            .
            return p;
            }
    the routine get_new_record falls under this category, and can be included in the ALLOCS_NEW_MEMORY pragma.

    FLOAT_TRAPS_ON pragma

    Informs the compiler that the function(s) may enable floating-point trap handling. When the compiler is so informed, it will not perform loop invariant code motion (LICM) on floating-point operations in the function(s) named in the pragma. This pragma is required for proper code generation when floating-point traps are enabled.

    #pragma FLOAT_TRAPS_ON { functionname,...functionname } #pragma FLOAT_TRAPS_ON { _ALL }

    For example:

    #pragma FLOAT_TRAPS_ON xyz,abc
    informs the compiler and optimizer that xyz and abc have floating-point traps turned on and therefore LICM optimization should not be performed.

    [NO]PTRS_STRONGLY_TYPED Pragma

    The PTRS_STRONGLY_TYPED pragma allows you to specify when a subset of types are type-safe. This provides a finer lever of control than +O[no]ptrs_strongly_typed.
    #pragma PTRS_STRONGLY_TYPED END
     
    #pragma NOPTRS_STRONGLY_TYPED BEGIN
     
    #pragma NOPTRS_STRONGLY_TYPED END
    Any types that are defined between the begin-end pair are taken to apply type-safe assumptions. These pragmas are not allowed to nest. For each BEGIN an associated END must be defined in the compilation unit.

    The pragma will take precedence over the command-line option. Although, sometimes both are required (see example 2).

    Example 1

    double *d;
    #pragma PTRS_STRONGLY_TYPED BEGIN
    int *i;
    float *f;
    #pragma PTRS_STRONGLY_TYPED END
    main(){
      .  .  .
    }
    In this example only two types, pointer-to-int and pointer-to-float will be assumed to be type-safe.

    Example 2

    cc +Optrs_strongly_typed foo.c
     
    /*source for Ex.2 */
    double *d;
      ...
    #pragma NOPTRS_STRONGLY_TYPED BEGIN
    int *i;
    float *f;
    #pragma NOPTRS_STRONGLY_TYPED END
      ...
    main(){
      ...
    }
    In this example all types are assumed to be type-safe except the types bracketed by pragma NOPTRS_STRONGLY_TYPED. The command-line option is required because the default option is +Onoptrs_strongly_typed.

    Aliasing Options

    To be conservative, the optimizer assumes that a pointer can point to any object in the entire application. Instead, if the optimizer can be educated on the application pointer usage, then the optimizer can generate more efficient code, due to the elimination of some false assumptions. Such behavior can be communicated to the optimizer by using the following options: where list is a comma-separated list of global variable names.

    Here are the type-inferred aliasing rules:

    Improving Shared Library Performance

    This section describes the following pragmas to be used for improving shared library performance: The pragmas described here can improve performance of shared libraries by reducing the overhead of calling shared library routines. You must be very careful using these pragmas because incorrect use can result in incorrect and unpredictable behavior. See also the HP-UX Linker and Libraries User's Guide for more information on improving shared library performance.

    HP_NO_RELOCATION Pragma

    This pragma improves performance of shared library calls by omitting floating-point parameter relocation stubs in calls to shared library functions. Put this pragma in header files of functions that take floating point parameters or return floating point data and that will be placed in shared libraries. By putting it in the header file and ensuring all calls reference the header file, you ensure that it is specified at the function definition and at all calls.

    WARNING  This pragma must be at the function definition and at all call sites. If the pragma is omitted from the function definition or from any call, the linker will generate parameter relocation code and the application will behave incorrectly since floating point parameters will not be in expected registers.

    Syntax

    #pragma HP_NO_RELOCATION
    name1[, name2[, ...]]

    where name1, name2, and so forth are names of functions in shared libraries.

    Background

    Parameter relocation stubs are instructions that move (relocate) floating point parameters and function return values between floating point registers and general registers. They are generated for calls to routines in shared libraries. Relocation stubs are generated when passing floating point parameters or using a floating point function return in routines in shared libraries. This pragma prevents this unnecessary relocation from being done.

    NOTE  Do not use this option with functions that use the varargs macros. See the HP C/HP-UX Reference Manual or the varargs(5) man page for information on the varargs macros.

    HP_LONG_RETURN Pragma

    This pragma improves performance of shared library calls by omitting export stubs and using a long return instruction sequence instead. An export stub is a short code segment generated by the linker for a global definition in a shared library. External calls to shared library functions go through the export stub.

    Put this pragma in header files of functions that will go in shared libraries so it is specified at the function definition and at all calls. For functions with floating point parameters or returns, use the HP_NO_RELOCATION pragma along with this pragma.


    WARNING  This pragma must be at the function definition and at all call sites. If the pragma is omitted from the function definition or from any call, the compiler will generate incompatible return code and the application will behave incorrectly.

    Syntax

    #pragma HP_LONG_RETURN
    name1[, name2[, ...]]

    where name1, name2, and so forth are names of functions in shared libraries.

    Background

    An export stub is generated by default for each function in a shared library. Each call to the function goes through the export stub. The export stub serves two purposes: to relocate parameters and perform an interspace return.

    The HP_LONG_RETURN pragma generates a long return sequence in the export stub instead of an interspace branch. If you also use the HP_NO_RELOCATION pragma (for functions taking floating point parameters) with the HP_LONG_RETURN pragma, all the code in the export stub is omitted, eliminating the export stub entirely. The HP_LONG_RETURN pragma by itself eliminates the need for export stubs for functions taking non-floating-point parameters.


    NOTE  Using HP_LONG_RETURN without using HP_NO_RELOCATION with floating point parameters, could actually degrade performance by creating export stubs and relocation stubs.

    These pragmas improve performance of calls to shared library functions from outside the shared library. Therefore do not use this pragma for hidden functions (see the -h and +e linker options) or for functions called only from within the same shared library linked with the -B symbolic linker option, otherwise this pragma may degrade performance. (See the HP-UX Linker & Libraries User's Guide for information on the above mentioned options.)

    Do not use this pragma if you compile on PA-RISC 2.0 or later or with the +DA2.0 option since the effect is the default. That is, if no relocations are generated, export stubs are not generated on PA-RISC 2.0 and later, and a long return instruction sequence is generated by default, so this pragma has no effect.


    HP_DEFINED_EXTERNAL Pragma

    This pragma improves performance of shared library calls by inlining import stubs. Place this pragma at calls to shared library routines along with the HP_NO_RELOCATION pragma (if using floating-point parameters or return values) and the HP_LONG_RETURN pragma.

    WARNING  Do not use this pragma at function definitions, only at function calls. Specifying it at function definitions will result in incorrect behavior.

    On PA-RISC 1.1, use this pragma only when calling a shared library from an executable file. Using it on calls within an executable file will cause the program to abort.


    Syntax

    #pragma HP_DEFINED_EXTERNAL
    name1[, name2[, ...]]where name1, name2, and so forth are names of functions in shared libraries.

    Background

    Import stubs are code sequences generated at calls to shared library routines. The import stub queries the PLT (Procedure Linkage Table) to determine the address of the shared library function & calls it. The HP_DEFINED_EXTERNAL pragma inlines this import stub.

    NOTE  If your function takes floating-point parameters, you should also use the HP_NO_RELOCATION pragma (if floating point parameters are present). You should also use the HP_LONG_RETURN pragma with this pragma. If you don't, the import stub may be too large to inline.

    Use this pragma only on calls to functions in shared libraries. On PA-RISC 2.0, it will degrade performance of calls to any other functions.


    Improving Compile and Link Times

    In general, optimization increases the amount of time it takes to compile your program, link your program, or both. However, the following options can help to decrease this time: