| United States-English |
|
|
|
![]() |
Parallel Programming Guide for HP-UX Systems: K-Class and V-Class Servers > Chapter 7 Controlling optimizationInvoking command-line options |
|
At each optimization level, you can turn specific optimizations on or off using the +O[no]optimization option. The optimization parameter is the name of a specific optimization. The optional prefix [no] disables the specified optimization. The following sections describe the optimizations that are turned on or off, their defaults, and the optimization levels at which they may be used. In syntax descriptions, namelist represents a comma-separated list of names. Optimization level: +O2, +O3, +O4 Default: +Onoaggressive +O[no]aggressive enables or disables optimizations that can result in significant performance improvement, and can change a program's behavior. This includes the optimizations invoked by the following advanced options (these are discussed separately in this chapter):
Optimization level: all Default: +Onoall Equivalent option: +Oall option is equivalent to specifying +O4 +Oaggressive +Onolimit +Oall performs maximum optimization, including aggressive optimizations and optimizations that can significantly increase compile time and memory usage. Optimization level: +O3, +O4 (+Oparallel must also be specified) Default: +Oautopar When used with +Oparallel option, +Oautopar causes the compiler to automatically parallelize loops that are safe to parallelize. A loop is considered safe to parallelize if its iteration count can be determined at runtime before loop invocation. It must also contain no loop-carried dependences, procedure calls, or I/O operations. A loop-carried dependence exists when one iteration of a loop assigns a value to an address that is referenced or assigned on another iteration. When used with +Oparallel, the +Onoautopar option causes the compiler to parallelize only those loops marked by the loop_parallel or prefer_parallel directives or pragmas. Because the compiler does not automatically find parallel tasks or regions, user-specified task and region parallelization is not affected by this option. C pragmas and Fortran directives are used to improve the effect of automatic optimizations and to assist the compiler in locating additional opportunities for parallelization. See “Optimization directives and pragmas” for more information. Optimization level: +O2, +O3, +O4 Default: +Onoconservative Equivalent option: +Oconservative is equivalent to +Onoaggressive +O[no]conservative causes
the optimizer to make or not make conservative assumptions about
the code when optimizing. +Oconservative
is useful in assuming a particular program's coding style,
such as whether it is standard-compliant. Specifying +Onoconservative
disables any optimizations that assume Optimization level: +O2, +O3, +O4 Default: +Onodataprefetch When +Odataprefetch is used, the
optimizer inserts instructions within innermost loops to explicitly
prefetch data from memory into the data cache. For cache lines containing
data to be written, +Odataprefetch
prefetches the cache lines so that they are valid for both read
and write access. Data prefetch instructions are inserted only for
data referenced within innermost loops using simple loop-varying
addresses in a simple arithmetic progression. It is only available
for The math library libm contains special prefetching versions of vector routines. If you have a PA-RISC 2.0 application containing operations on arrays larger than one megabyte in size, using +Ovectorize in conjunction with +Odataprefetch may substantially improve performance. You can also use the +Odataprefetch option for applications that have high data cache miss overhead. Optimization level: +O3, +O4 (+Oparallel must also be specified) Default: +Odynsel When specified with +Oparallel, +Odynsel enables workload-based dynamic selection. For parallelizable loops whose iteration counts are known at compile time, +Odynsel causes the compiler to generate either a parallel or a serial version of the loop—depending on which is more profitable. This optimization also causes the compiler to generate both parallel and serial versions of parallelizable loops whose iteration counts are unknown at compile time. At runtime, the loop's workload is compared to parallelization overhead, and the parallel version is run only if it is profitable to do so. The +Onodynsel option disables dynamic selection and tells the compiler that it is profitable to parallelize all parallelizable loops. The dynsel directive and pragma are used to enable dynamic selection for specific loops, when +Onodynsel is in effect. See the section “Dynamic selection” for additional information. Optimization level: +O1, +O2, +O3, +O4 Default: +Onoentrysched +Oentrysched optimizes instruction scheduling on a procedure's entry and exit sequences by unwinding in the entry and exit regions. Subsequently, this option is used to increase the speed of an application. +O[no]entrysched can also change the behavior o f programs performing exception-handling or that handle asynchronous interrupts. The behavior of setjmp() and longjmp() is not affected. Optimization level: +O1, +O2, +O3, +O4 Default: +Ofail_safe +Ofail_safe allows your compilations to continue when internal optimization errors are detected. When an error is encountered, this option issues a warning message and restarts the compilation at +O0. The +Ofail_safe option is disabled when you specify +Oparallel with +O3 or +O4 to compile with parallelization. Using +Onofail_safe aborts your compilation when internal optimization errors are detected. Optimization level: +O0, +O1, +O2, +O3, +O4 Default: +Onofastaccess at
+O0, +O1,
+O2 and +O3; +Ofastaccess performs optimization for fast access to global data items. Use +Ofastaccess to improve execution speed at the expense of longer compile times. Optimization level: +O2, +O3, +O4 Default: none (See Table 7-2 “+O[no]fltacc and floating-point optimizations”.) +O[no]fltacc enables or disables optimizations that cause imprecise floating-point results. +Ofltacc disables optimizations that cause imprecise
floating-point results. Specifying +Ofltacc
disables the generation of Fused +Onofltacc improves execution speed at the expense of floating-point precision. The +Onofltacc option allows the compiler to perform floating-point optimizations that are algebraically correct, but may result in numerical differences. These differences are generally insignificant. The +Onofltacc option also enables the optimizer to generate FMA instructions. If you optimize code at +O2 or higher, and do not specify +Onofltacc or +Ofltacc, the optimizer uses FMA instructions. However, it does not perform floating-point optimizations that involve expression reordering. FMA is implemented by the PA-8x00 instructions FMPYFADD and FMPYNFADD and improves performance. Occasionally, these instructions may produce results that may differ in accuracy from results produced by code without FMA. In general, the differences are slight. Table 7-2 “+O[no]fltacc and floating-point optimizations” presents a summary of the preceding information. Table 7-2 +O[no]fltacc and floating-point optimizations
Optimization level: +O2, +O3, +O4 Default: +Onoglobal_ptrs_unique
Using this C compiler option identifies unique global pointers so that the optimizer can generate more efficient code in the presence of unique pointers, such as using copy propagation and common subexpression elimination. A global pointer is unique if it does not alias with any variable in the entire program. This option supports a comma-separated list of unique global pointer variable names, represented by namelist in +O[no]global_ptrs_unique[=namelist]. If namelist is not specified, using +O[no]global_ptrs_unique informs the compiler that all [no] global pointers are unique. The example below states that no global pointers are unique, except a and b: +Oglobal_ptrs_unique=a,b The next example says that all global pointers are unique except a and b: +Onoglobal_ptrs_unique=a,b Optimization level: +O0, +O1, +O2, +O3, +O4 Default: +Onoinfo +Oinfo displays informational messages about the optimization process. This option is used at all optimization levels, but is most useful at +O3 and +O4. For more information about this option, see Chapter 8 “Optimization Report” Chapter 7 “Controlling optimization”. Optimization level: +O2, +O3, +O4 Default: unspecified +O[no]initcheck performs an initialization check for the optimizer. The optimizer has three possible states that check for initialization: on, off, or unspecified.
Optimization level: +O3, +O4 Default: +Oinline When +Oinline is specified without a name list, any function can be inlined. For successful inlining, follow the prototype definitions for function calls in the appropriate header files. When specified with a name list, the named functions are important candidates for inlining. For example, the following statement indicates that inlining be strongly considered for foo and bar: +Oinline=foo,bar +Onoinline All other routines are not considered for inlining because +Onoinline is given.
Use the +Onoinline[=namelist] option to exercise precise control over which subprograms are inlined. Use of this option is guided by knowledge of the frequency with which certain routines are called and may be warranted by code size concerns. When this option is disabled with a name list, the compiler does not consider the specified routines as candidates for inlining. For example, the following statement indicates that inlining should not be considered for baz and x: +Onoinline=baz,x All other routines are considered for inlining because +Oinline is the default. Optimization level: +O3, +O4 Default: +Oinline_budget=100 In +Oinline_budget=n, n is an integer in the range 1 to 1000000 that specifies the level of aggressiveness, as follows:
The +Onolimit and +Osize options also affect inlining. Specifying the +Onolimit option implies specifying +Oinline_budget=200. The +Osize option implies +Oinline_budget=1. However, +Oinline_budget takes precedence over both of these options. This means that you can override the effects on inlining of the +Onolimit and +Osize options, by specifying the +Oinline_budget option on the same compile line. Optimization level: +O0, +O1, +O2, +O3, +O4 Default: +Onolibcalls at
+O0 and +O1; +Olibcalls increases the runtime performance of code that calls standard library routines in simple contexts. The +Olibcalls option expands the following library calls inline:
Inlining takes place only if the function call follows the prototype definition in the appropriate header file. A single call to printf() may be replaced by a series of calls to putchar(). Calls to sprintf() and strlen() may be optimized more effectively, including elimination of some calls producin g unused results. Calls to setjmp() and longjmp() may be replaced by their equivalents _setjmp() and _longjmp(), which do not manipulate the process's signal mask. Using the +Olibcalls option invokes millicode versions of frequently called math functions. Currently, there are millicode versions for the following functions:
See the HP-UX Floating-Point Guide for the most up-to-date listing of the math library functions. +Olibcalls also improves the performance of selected library routines (when you are not performing error checking for these routines). The calling code must not expect to access ERRNO after the function's return. Using +Olibcalls with +Ofltacc gives different floating-point calculation results than those given using +Olibcalls without +Ofltacc. Optimization level: +O2, +O3, +O4 Default: +Olimit The +Olimit option suppresses optimizations that significantly increase compile-time or that can consume a considerable amount of memory. The +Onolimit option allows optimizations to be performed, regardless of their effects on compile-time and memory usage. Specifying the +Onolimit option implies specifying +Oinline_budget=200. See the section " +Oinline_budget=n" “+Oinline_budget=n” for more information. Optimization level: +O3, +O4 Default: +Onoloop_block +O[no]loop_block enables or disables blocking of eligible loops for improved cache performance. The +Onoloop_block option disables both automatic and directive-specified loop blocking. For more information on loop blocking, see the section “Loop blocking”. Optimization level: +O3, +O4 Default: +Oloop_transform +O[no]loop_transform enables or disables transformation of eligible loops for improved cache performance. The most important transformation is the interchange of nested loops to make the inner loop unit stride, resulting in fewer cache misses. The other transformations affected by +O[no]loop_transform are loop distribution, loop blocking, loop fusion, loop unroll, and loop unroll and jam. See Chapter 3 “Optimization levels” for information on loop transformations. If you experience any problem while using +Oparallel, +Onoloop_transform may be a helpful option. Optimization level: +O2, +O3, +O4 Default: +Oloop_unroll=4 +Oloop_unroll enables loop unrolling. When you use +Oloop_unroll, you can also suggest the unroll factor to control the code expansion. The default unroll factor is four, meaning that the loop body is replicated four times. By experimenting with different factors, you may improve the performance of your program. In some cases, the compiler uses its own unroll factor. The +Onoloop_unroll option disables partial and complete unrolling. Loop unrolling improves efficiency by eliminating l oop overhead, and can create opportunities for other optimizations, such as improved register use and more efficient scheduling. See the section “Loop unrolling” for more information on unrolling. Optimization level: +O3, +O4 Default: +Onoloop_unroll_jam The +O[no]loop_unroll_jam option enables or disables loop unrolling and jamming. The +Onoloop_unroll_jam option (the default) disables both automatic and directive-specified unroll and jam. Loop unrolling and jamming increases register exploitation. For more information on the unroll and jam optimization, see the section “Loop unroll and jam”. Optimization level: +O2, +O3, +O4 Default: +Omoveflops +O[no]moveflops allows or disallows moving conditional floating-point instructions out of loops. The behavior of floating-point exception handling may be altered by this option. Use +Onomoveflops if floating-point traps are enabled and you do not want the behavior of floating-point exceptions to be altered by the relocation of floating-point instructions. Optimization level: +O2, +O3, +O4 Default: +Onomultiprocesssor Specifying the +Omultiprocessor option at +O2 and above tells the compiler to appropriately optimize several different processes on multiprocessor machines. The optimizations are those appropriate for executables and shared libraries. Enabling this option incorrectly (such as on a uniprocessor machine) may cause performance problems. Specifying +Onomultiprocessor (the default) disables the optimization of more than one process running on multiprocessor machines. Optimization level: +O3, +O4 Default: +Onoparallel The +Onoparallel option is the default for all optimization levels. This option disables automatic and directive-specified parallelization. If you compile one or more files in an application using +Oparallel, then the application must be linked (using the compiler driver) with the +Oparallel option to link in the proper start-up files and runtime support. The +Oparallel option causes the compiler to:
The following methods are used to specify the number of processors used in executing your parallel programs:
The +Oparallel option is valid only at optimization level +O3 and above. For information on parallelization, see the section “Levels of parallelism”. Using the +Oparallel option disables +Ofail_safe, which is enabled by default. See the section “+O[no]fail_safe” for more information. Optimization level: +O2, +O3, +O4 Default (Fortran): +Onoparmsoverlap Default (C/C++): +Oparmsoverlap +Oparmsoverlap causes the optimizer to assume that the actual arguments of function calls overlap in memory. Optimization level: +O2, +O3, +O4 Default: +Opipeline +O[no]pipeline enables or disables software pipelining. If program size is more important than execution speed, use +Onopipeline. Software pipelining is particularly useful for loops containing arithmetic operations on REAL or REAL*8 variables in Fortran or on float or double variables in C and C++. Optimization level: +O0, +O1, +O2, +O3, +O4 Default: +Onoprocelim at
+O0, +01,
+O2, +O3; When +Oprocelim is specified, procedures not referenced by the application are eliminated from the output executable file. The +Oprocelim option reduces the size of the executable file, especially when optimizing at +O3 and +O4, at which inlining may have removed all of the calls to some routines. When +Onoprocelim is specified, procedures not referenced by the application are not eliminated from the output executable file. If the +Oall option is enabled, the +Oprocelim option is enabled. Optimization level: +O2, +O3, +O4 Default: +Onoptrs_ansi The +Optrs_ansi option makes the following two assumptions, which the more aggressive +Optrs_strongly_typed does not:
When both +Optrs_ansi and +Optrs_strongly_typed are specified, +Optrs_ansi takes precedence. Optimization level: +O2, +O3, +O4 Default: +Onoptrs_strongly_typed Use the C compiler option +Optrs_strongly_typed when pointers are type-safe. The optimizer can use this information to generate more efficient code.
Type-safe (strongly-typed) pointers point to a specific type that, in turn, only point to objects of that type. For example, a pointer declared as a pointer to an int is considered type-safe if that pointer points to an object of type int only. Based on the type-safe concept, a set of groups are built based on object types. A given group includes all the objects of the same type. In type-inferred aliasing, any pointer of a type in a given group (of objects of the same type) can only point to any object from the same group. It cannot point to a typed object from any other group. Type casting to a different type violates type-inferring aliasing rules. Dynamic casting is, however, allowed, as shown in Example 41. The optimizer generally spills all global data from registers to memory before any modification to global variables or any loads through pointers. However, the optimizer can generate more efficient code if it knows how various data types interact. Consider the following example (line numbers are provided for reference):
With +Onoptrs_strongly_typed turned on, the pointers p and q are assumed to be disjoint because the types they point to are different types. Without type-inferred aliasing, *p is assumed to invalidate all the definitions. So, the use of d and f on line 10 have to be loaded from memory. With type-inferred aliasing, the optimizer can propagate the copy of d and f, thus avoiding two loads and two stores. This option is used for any application involving the use of pointers, where those pointers are type safe. To specify when a subset of types are type-safe, use the ptrs_strongly_typed pragma. The compiler issues warnings for any incompatible pointer assignments that may violate the type-inferred aliasing rules discussed in the section “C aliasing options”. Any type cast to a different type violates type-inferred aliasing rules. Do not use +Optrs_strongly_typed with code that has these "unsafe" type casts. Use the no_ptrs_strongly_typed pragma to prevent the application of type-inferred aliasing to the unsafe type casts.
Generally applying type aliasing Dynamic casting is allowed with +Optrs_strongly_typed or +Optrs_ansi. A pointer dereference is called a dynamic cast if a cast is applied on the pointer to a different type. In the example below, type-inferred aliasing is generally applied on P, not just to the particular dereference. Type-aliasing is applied to any other dereferences of P.
For more information about type aliasing, see the section “C aliasing options”. Optimization level: +O2, +O3, +O4 Default: +Optrs_to_globals By default, global variables are conservatively assumed to be modified anywhere in the program. Use the C compiler option +Onoptrs_to_globals to specify which global variables are not modified through pointers. This allows the optimizer to make the program run more efficiently by incorporating copy propagation and common subexpression elimination.
This option is used to specify all global variables that are not modified using pointers, or to specify a comma-separated list of global variables that are not modified using pointers. The on state for this option disables some optimizations, such as aggressive optimizations on the program's global symbols. For example, use the command-line option +Onoptrs_to_globals=a,b,c to specify global variables a, b, and c to not be accessible through pointers. The result (shown below) is that no pointer can access these global variables. The optimizer performs copy propagation and constant folding because storing to *p does not modify a or b.
If all global variables are unique, use the +Onoptrs_to_globals option without listing the global variables (that is, without using namelist). In the example below, the address of b is taken. This means b is accessed indirectly through the pointer. You can still use +Onoptrs_to_globals as:
For more information about type aliasing, see the section “C aliasing options”. Optimization level: +O2, +O3, +O4 Default: +Oregreassoc +O[no]regreassoc enables or disables register reassociation. This is a technique for folding and eliminating integer arithmetic operations within loops, especially those used for array address computations. This optimization provides a code-improving transformation supplementing loop-invariant code motion and strength reduction. Additionally, when performed in conjunction with software pipelining, register reassociation can also yield significant performance improvement. Optimization level: +O3, +O4 Default: +Onoreport +Oreport[=report_type] specifies the contents of the Optimization Report. Values of report_type and the Optimization Reports they produce are shown in Table 7-3 “Optimization Report contents”. Table 7-3 Optimization Report contents
The Loop Report gives information on optimizations performed on loops and calls. Using +Oreport (without =report_type) also produces the Loop Report. The Privatization Table provides information on loop variables that are privatized by the compiler. +Oreport[=report_type] is active only at +O3 and above. The +Onoreport option does not accept any of the report_type values. For more information about the Optimization Report, see Chapter 8 “Optimization Report”. +Oinfo also displays information on the various optimizations being performed by the compilers. +Oinfo is used at any optimization level, but is most useful at +O3 and above. The default at all optimization levels is +Onoinfo. Optimization level: +O2, +O3, +O4 Default: +Osharedgra The +Onosharedgra option disables global register allocation for shared-memory variables that are visible to multiple threads. This option may help if a variable shared among parallel threads is causing wrong answers. See the section “Global register allocation (GRA)” for more information. Global register allocation (+Osharedgra) is enabled by default at optimization level +O2 and higher. Optimization level: +O2, +O3, +O4 Default: +Onosignedpointers
The C and C++ option +O[no]signedpointers requests that the compiler perform or not perform optimizations related to treating pointers as signed quantities. This helps improve application runtime speed. Applications that allocate shared memory and that compare a pointer to shared memory with a pointer to private memory may run incorrectly if this optimization is enabled. Optimization level: +O2, +O3, +O4 Default: +Onosize The +Osize option suppresses optimizations that significantly increase code size. Specifying +Osize implies specifying +Oinline_budget=1. See the section "+Oinline_budget=n" “+Oinline_budget=n” for additional information. The +Onosize option does not prevent optimizations that can increase code size. Optimization level: +O0, +O1, +O2, +O3, +O4 Default: +Onostatic_prediction +Ostatic_prediction turns
on static branch prediction for PA-RISC 2.0 predicts the direction conditional branches go in one of two ways:
Optimization level: +O3, +O4 Default: +Onovectorize +Ovectorize allows the compiler to replace certain loops with calls to vector routines. Use +Ovectorize to increase the execution speed of loops.
When +Onovectorize is specified, loops are not replaced with calls to vector routines. Because the +Ovectorize option may change the order of floating-point operations in an application, it may also change the results of those operations slightly. See the HP-UX Floating-Point Guide for more information. The math library contains special prefetching versions of vector routines. If you have a PA2.0 application containing operations on large arrays (larger than 1 Megabyte in size), using +Ovectorize in conjunction with +Odataprefetch may improve performance. +Ovectorize is also included as part of the +Oaggressive and +Oall options. Optimization level: +O1, +O2, +O3, +O4 Default: +Onovolatile
The C and C++ option +Ovolatile implies that memory references to global variables cannot be removed during optimization. The +Onovolatile option indicates that all globals are not of volatile class. This means that references to global variables are removed during optimization. Use this option to control the volatile semantics for all global variables. Optimization level: +O4 Default: +Onowhole_program_mode Use +Owhole_program_mode to increase performance speed. This should be used only when you are certain that only the files compiled with +Owhole_program_mode directly access any globals that are defined in these files.
+Owhole_program_mode enables the assertion that only the files that are compiled with this option directly reference any global variables and procedures that are defined in these files. In other words, this option asserts that there are no unseen accesses to the globals. When this assertion is in effect, the optimizer can hold global variables in registers longer and delete inlined or cloned global procedures. All files compiled with +Owhole_program_mode must also be compiled with +O4. If any of the files were compiled with +O4, but were not compiled with +Owhole_program_mode, the linker disables the assertion for all files in the program. The default, +Onowhole_program_mode, disables the assertion noted above. Optimization level: +O0, +O1, +O2, +O3, +O4 Default target value: corresponds to the machine on which you invoke the compiler. This option specifies the target machine architecture for which compilation is to be performed. Using this option causes the compiler to perform architecture-specific optimizations. target takes one of the following values:
This option is valid at all optimization levels. The default target value corresponds to the machine on which you invoke the compiler. Using the +tm target option implies +DA and +DS settings as described in Table 7-4 “+tm target and +DA/+DS”. +DAarchitecture causes the compiler to generate code for the architecture specified by architecture. +DSmodel causes the compiler to use the instruction scheduler tuned to model. See the f90(1) man page, aCC(1) page, or the cc(1) man page for more information describing the +DA and +DS options. Table 7-4 +tm target and +DA/+DS
If you specify + DA or +DS on the compiler command line, your setting takes precedence over the setting implied by +tm target. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||