Summary of Major Optimization Levels
Supporting Optimization Options
Enabling Basic Optimization
Enabling Different Levels of Optimization
Changing the Aggressiveness of Optimizations
Enabling Only Conservative Optimizations
Enabling Aggressive Optimizations
Removing Compilation Time Limits When Optimizing
Limiting the Size of Optimized Code
Specifying Maximum Optimization
Combining Optimization Parameters
Summary of Optimization Parameters
Profile-Based Optimization
Controlling Specific Optimizer Features
Using Advanced Optimization Options
Level 1 Optimizations
Level 2 Optimizations
Level 3 Optimizations
Level 4 Optimizations
Guidelines for Using the Optimizer
Optimizer Assumptions
Optimizer Pragmas
Aliasing Options
Improving Shared Library Performance
Improving Compile and Link Times
The HP C optimizer transforms programs so machine resources are used more efficiently. The optimizer can dramatically improve application run-time speed. HP C performs only minimal optimizations unless you specify otherwise. You activate additional optimizations using HP C command-line options and pragmas.
There are four major levels of optimization: levels 1, 2, 3, and 4.Level 4 optimization can produce the fastest executable code. Level 4 is a superset of the other levels.
Additional parameters enable you to control the size of the executable program, compile time, and aggressiveness of the optimizations performed.
Compile time memory and CPU usage increase with each higher level of optimization due to the increasingly complex analysis that must be performed. You can control the trade-offs between compile-time penalties and code performance by choosing the level of optimization you desire.
Generally, the optimizer is not used during code development. It is used when compiling production-level code for benchmarking and general use.
cc-O sourcefile.cBasic optimizations do not change the behavior of ANSI C standard-conforming code. They improve run-time execution time but only increase compile time and link time by a moderate amount.
cc+O1 sourcefile.cLevel 1 optimization compiles quickly, but still provides some run-time speed.
cc+O2 sourcefile.cLevel 2 (equivalent to -O) takes more time to compile, but produces greatly improved run-time speed.
cc+O3 sourcefile.cLevel 3 does full optimization of all subprograms within a single file.
cc+O4 sourcefile.cLevel 4 can potentially produce the greatest improvements in speed by performing optimizations across multiple object files. Level 4 does optimizations at link time, so compiles will be faster, but links will be longer.
Depending on the size and number of the modules, compiling at level4 can consume a large amount of virtual memory. Level 4 may consume roughly1.25 megabytes per 1000 lines of non-commented source. When you use level4 on a large application, it is a good idea to increase the system swap space. For information on increasing system swap space, see the book Managing Systems and Workgroups.
Use the +Oconservative option at optimization level 2, 3, or4 if you are not sure if your code conforms to standards. This option provides more safety.
Use the +Oaggressive option at optimization level 2, 3, or4 for best performance when you are willing to risk changes to the behavior of your programs. Using the +Oaggressive option can cause your program to have compilation or run-time problems that require troubleshooting.
cc+O2 +Oconservative sourcefile.cor:
cc+O3 +Oconservative sourcefile.cor:
cc+O4 +Oconservative sourcefile.cConservative optimizations are optimizations that do not change the behavior of code, in most cases, even if the code does not conform to standards.
Use the conservative optimizations provided with level 2, 3, and 4 when your code is non-ANSI.
cc+O2 +Oaggressive sourcefile.cor:
cc+O3 +Oaggressive sourcefile.cor:
cc+O4 +Oaggressive sourcefile.cAggressive optimizations are new optimizations or are optimizations that can change the behavior of programs. These optimizations may do any of the following:
cc+O2 +Onolimit sourcefile.cor:
cc+O3 +Onolimit sourcefile.cor:
cc+O4 +Onolimit sourcefile.cBy default, the optimizer limits the amount of time spent optimizing largeprograms at levels 2, 3, and 4. Use this option if longer compile times and greater virtual memory use are acceptable because you want additional optimizations to be performed.
cc+O2 +Osize sourcefile.cor:
cc+O3 +Osize sourcefile.cor:
cc+O4 +Osize sourcefile.cMost optimizations improve execution speed and decrease executable code size. A few optimizations significantly increase code size to gain execution speed. The +Osize option disables these code-expanding optimizations.
Use this option if you have limited main memory, swap space, or diskspace.
cc+OallThe +Oall option performs the maximum optimization.
Use +Oall with stable, well-structured, ANSI-conforming code. These types of optimizations give you the fastest code, but are riskier than the default optimizations.
You can use +Oall at optimization levels 2, 3, and 4. The default is +Onoall.
The +Oall option by itself (specified without the +O2,+O3, or +O4 options) combines the +O4 +Oaggressive +Onolimitoptions. This combination performs aggressive optimizations with unrestricted compile time at the highest level of optimization.
For example, to specify conservative optimizations at level 2 and disablecode-expanding optimizations, use:
cc +O2 +Oconservative +Osize sourcefile.c+Olimit and +Osize can be used with either +Oaggressiveor +Oconservative.
You cannot use +Oaggressive with +Oconservative.
| Option | What It Does | Level of Opt |
|---|---|---|
| +O[no]aggressive |
The +O[no]aggressive option enables optimizations that
can result in significant performance improvement, but that can
change a program's behavior. These optimizations include newly
released optimizations and the optimizations invoked by the following
advanced optimization options:
See Controlling Specific Optimizer Featuresfor details about advanced optimization options.
|
2, 3, 4 |
| +O[no]all | The +Oall option performs maximum optimization, including aggressive optimizations and optimizations that can significantly increase compile time and memory usage. The default is +Onoall. | 4 |
| +O[no]conservative | The +O[no]conservative option causes the optimizer to make conservative assumptions about the code when optimizing it. Use +Oconservativewhen conservative assumptions are necessary due to the coding style, aswith non-standard conforming programs. The +Oconservative option relaxes the optimizer's assumptions about the target program. The defaultis +Onoconservative. | 2, 3, 4 |
| +O[no]info | +Oinfo displays informational messages about the optimization process. This option supports the core optimization levels, and therefore, can be used at levels 0-4. The default is +Onoinfo. | 0, 1, 2, 3, 4 |
| +O[no]limit | The +Olimit option suppresses optimizations that significantly increase compile-time or that can consume a lot of memory. The +Onolimitoption allows optimizations to be performed regardless of their effecton compile-time or memory usage. The default is +Olimit. | 2, 3, 4 |
| +O[no]size | The +Osize option suppresses optimizations that significantlyincrease code size. The +Onosize option does not prevent optimizationsthat can increase code size. The default is +Onosize. | 2, 3, 4 |
| +O[no]clone |
This option allows the user to turn on[off] the cloning facility of
the optimizer. Cloning is on by default. It is mainly
provided for users who may see a lot of cloning adversely affecting the
performance of their code. If inlining is turned off, cloning
is turned off by default.
You cannot specify +Onoinline +Oclone.
|
3, 4 |
| +O[no]memory[=malloc] |
This option enables[disables] memory optimizations.
Specifying malloc in the list will enable[disable]
optimizations which consolidate memory allocation procedure calls.
This option is disabled by default. It is incompatible
with +Oopenmp and +Oparallel, and is ignored when
these options are in effect.
|
3, 4 |
There are three steps involved in performing this optimization:
When you use PBO, compile times are faster and link times are slowerbecause code generation happens at link time.
cc-Aa +I -O -c sample.c Compilefor instrumentation.
cc-o sample.exe +I -O sample.o Linkto make instrumented executable.The first command line uses the -O option to perform level 2 optimizationand instruments the code. The -c option in the first command linesuppresses linking and creates an intermediate object file called sample.o.The.o file can be used later in the optimization phase, avoidinga second compile.
The second command line uses the -o option to link sample.ointo sample.exe. The +I option instruments sample.exewith data collection code. Note that instrumented programs run slower thannon-instrumented programs. Only use instrumented code to collect statisticsfor profile-based optimization.
sample.exe< input.file1 Col lectexecution profile data.
sample.exe< input.file2This step creates and logs the profile statistics to a file, by defaultcalled flow.data. You can use this data collection file to storethe statistics from multiple test runs of different programs that you mayhave instrumented.
cc-o sample.exe +P -O sample.oAn alternative to this procedure is to recompile the source file in theoptimization step:
cc-o sample.exe +I -0 sample.c instrumentation
sample.exe< input.file1 datacollection
cc-o sample.exe +P -O sample.c optimization
You can override the default name of the profile data file. This isuseful when working on large programs or on projects with many differentprogram files.
You can use the FLOW_DATA environment variable to specify thename of the profile data file with either the +I or +Poptions. You can use the +df command-line option to specify thename of the profile data file with the +P option.
The +df option takes precedence over the FLOW_DATAenvironment variable.
In the following example, the FLOW_DATA environment variableis set to override the flow.data file name. The profile data isstored instead in /users/profiles/prog.data.
% setenv FLOW_DATA /users/profiles/prog.data% cc -Aa -c +I +O3 sample.c% cc -o sample.exe +I +O3 sample.o% sample.exe < input.file1% cc -o sample.exe +P +O3 sample.oIn the next example, the +df option uses /users/profiles/prog.datatooverride the flow.data file name.
% cc -Aa -c +I +O3 sample.c% cc -o sample.exe +I +O3 sample.o% sample.exe < input.file1% mv flow.data /users/profile/prog.data% cc -o sample.exe +df /users/profiles/prog.data +P +O3 sample.o
Care must be taken when maintaining different versions of the executablefile because the instrumented program file name is used as the keyidentifier when storing execution profile data in the data file.
The optimizer must know what this key identifier name is in orderto find the execution profile data. By default, the key identifiername used to retrieve the profile data is the instrumentedprogram file name used to run the program for data collection.
When you optimize a program file and the optimized program file nameis different from the instrumented program file name, you must use the+pgmoption. Specify the instrumented program file name with this option. Theoptimizer uses this value as the key identifier to retrieve executionprofile data.
In the following example, the instrumented program file name is sample.inst.The optimized program file name is sample.opt. The +pgmname option is used to pass the instrumented program name to the optimizer:
% cc -Aa -c +I +O3 sample.c% cc -o sample.inst +I +O3 sample.o% sample.inst < input.file1% cc -o sample.opt +P +O3 +pgm sample.inst sample.o
The existing behavior of the compiler has been to generate intermediatecode when PBO options (+I, +P) are used and the final code generationwill happen during link-phase, unless +Oreusedir= is used. Atthis stage, linker calls ucomp. An obvious disadvantage is, even when asingle file is changed code generation for all other files will happenduring link-phase. This makes the overall compile-link time significantlyhigh. As an enhancement to the current behavior, compiler will generatethe PA-RISC machine code (SOM) whenever the newly introduced PBO optionsare used. This does not require code generation to happen during link-phaseas the compiler itself would have converted the intermediate code (ISOM)into machine code (SOM) by calling ucomp.
The following lists the newly introduced PBO options:
| NOTE |
|
At each level, you can turn on and off specific optimizations usingthe +O[no]optimization option. The optimizationparameter is the name of a specific optimization technique. The optionalprefix[no] disables the specified optimization.
Below is a list of advanced optimizer options, followed by detailed information on each option:
Default: All functions are optimized at the level specified by the ordinary+Olevel option.
This option lowers optimization to the specified levelfor oneor more named functions. level can be 0, 1, 2, 3, or 4. The nameparameters are names of functions in the module being compiled. Use thisoption when one or more functions do not optimize well or properly. Itmust be used with an ordinary +Olevel option.
This option works the same as the OPT_LEVEL pragma described under OptimizerCon
trol Pragmas . This option overrides the OPT_LEVEL pragma for thespecified functions. As with the pragma, you can only lower the level ofoptimization; you cannot raise it above the level specified in the ordinary+Olevel option. To avoid confusion, it is best to use either thisoption or the OPT_LEVEL pragma rather than both.
$ cc +O3 +O1=myfunc1,myfunc2 funcs.c main.c
The following command optimizes all functions at level 2, except forthe functions myfunc1 and myfunc2, which it optimizesat level 0.
$ cc -O +O0=myfunc1,myfunc2 funcs.c main.c
When +Odataprefetch is enabled, the optimizer inserts instructionswithin innermost loops to explicitly prefetch data from memory into thedata cache. Data prefetch instructions will be inserted only for data structuresreferenced within innermost loops using simple loop varying addresses (thatis, in a simple arithmetic progression). It is only available for PA-RISC2.0 targets.
The math library contains special prefetching versions of vector routines.If you have a PA-RISC 2.0 application that contains operations on arrayslarger than 1 megabyte in size, using +Ovectorize in conjunctionwith +Odataprefetch may improve performance substantially.
Use this option for applications that have high data cache miss overhead.
Default: +Onoentrysched
The +Oentrysched option optimizes instruction scheduling ona procedure"s entry and exit sequences. Enabling this option can speedup an application. The option has undefined behavior for applications whichhandle asynchronous interrupts by examining the sigcontext values of callerstack operands. The option affects unwinding in the entry and exit regions.
At optimization level +O2 and higher (using data flow information),save and restore operations become more efficient.
This option can change the behavior of programs that perform stack unwind-basedexception handling or asynchronous interrupt handling. The behavior ofsetjmp()and longjmp() is not affected.
Default: +Oextern
This option is available in the LP64 data model only.
The +O[no]extern option allows you to specify which accessesto symbols in an executable or shared library (a load module) can be optimized.Use of +Onoextern creates code that cannot be included in a sharedlibrary.Use +Onoextern only to build executables.Only internalsymbols (defined in the load module) can be optimized. If +Onoexternis specified without a name list, the compiler assumes that no symbolsare external to the load module being compiled, and any symbol can be optimized.If +Oextern is specified without a name list, the compiler assumesthat all symbols are external to the load module being compiled and thuscannot be optimized; this is the default.If +Oextern is specifiedwith a name list, the compiler treats the specified symbols as externaleven if +Onoextern without a name list is in effect. The followingexample indicates that foo and bar are to eventuallybe imported from another load module (for example, a shared library); allother functions and data items will not be external, since +Onoexternis specified.
+Oexter n=foo,bar +OnoexternWhen +Onoextern is specified with a name list, the compiler treatsthe specified symbols as internal even if +Oextern without a namelist is in effect. The following example indicates that references to bazand x may be optimized for access in the local load module. Allother symbols will be subject to resolution to another load module since+Oexternis the default.
+Onoextern=baz,xUse this option to precisely control which symbols' accesses may be optimized.Knowledge of the shared libraries used by an application, or the exportedinterface of a shared library is required.See also, the HP_DEFINED_EXTERNALpragma.The default is +Oextern with no name list.
Default: +Ofail_safe
The +Ofail_safe option allows compilations with internal optimizationerrors to continue by issuing a warning message and restarting the compilationat +O0.
You can use +Onofail_safe at optimization levels 1, 2, 3, or4 when you want the internal optimization errors to abort your build.
This option is disabled when compiling for parallelization.
Default: +Onofastaccess at optimization levels 0, 1, 2 and3, +Ofastaccess at optimization level 4
The +Ofastaccess option optimizes for fast access to globaldata items.
Use +Ofastaccess to improve execution speed at the expenseof longer compile times.
The +Onofltacc option allows the compiler to perform floating-pointoptimizations that are algebraically correct but that may result in numericaldifferences. For example, this option may change the order of expressionevaluation as such: If a,b, and c are floating-pointvariables, the expressions (a + b) + c and a + (b + c)may give slightly different results due to rounding. In general, thesedifferences will be insignificant.
The +Onofltacc option also enables the optimizer to generatefused multiply-add (FMA) instructions, the FMPYFADD and FMPYNFADD.These instructions improve performance but occasionally produce resultsthat may differ from results produced by code without FMA instructions.In general, the differences are slight. FMA instructions are only availableon PA-RISC 2.0 systems.
Specifying +Ofltacc disables the generation of FMA instructionsas well as some other floating-point optimizations. Use +Ofltaccif it is important that the compiler evaluate floating-point expressionsas it does in unoptimized code. The +Ofltacc option does not allowany optimizations that change the order of expression evaluation and thereforemay affect the result.
If you are optimizing code at level 2 or higher and do not specify +Onofltaccor +Ofltacc, the optimizer will use FMA instructions, but willnot perform floating-point optimizations that involve expression reorderingor other optimizations that potentially impact numerical stability.
The list below identifies the different actions taken by the optimizeraccording to whether you specify +Ofltacc,+Onofltacc,or neither option.
Optimization Expression FMA?Options Reordering? +O2 No Yes+O2 +Ofltacc No No+O2 +Onofltacc Yes& nbsp; Yes
Default: +Onoglobal_ptrs_unique
Use this option to identify unique global pointers, so that the optimizercan generate more efficient code in the presence of unique pointers, forexample by using copy propagation and common sub-expression elimination.A global pointer is unique if it does not alias with any variable in theentire program.
This option supports a comma-separated list of unique global pointervariable names.
Default: unspecified
The initialization checking feature of the optimizer has three possiblestates: on, off, or unspecified. When on (+Oinitcheck), the optimizerinitializes to zero any local, scalar, non-static variables that are uninitializedwith respect to at least one path leading to a use of the variable.
When off (+Onoinitcheck), the optimizer issues warning messageswhen it discovers definitely uninitialized variables, but does not initializethem.
When unspecified, the optimizer initializes to zero any local, scalar,non-static variables that are definitely uninitialized with respect toall paths leading to a use of the variable.
Use +Oinitcheck to look for variables in a program that maynot be initialized.
When +Oinline is specified without a name list, anyfunction can be inlined. For inlining to be successful, follow prototypedefinitions for function calls in the appropriate header file.
When specified with a name list, the named functions are importantcandidates for inlining. For example, saying
+Oinline=foo,bar +Onoinlineindicates that inlining be strongly considered for foo and bar;all other routines will not be considered for inlining, since +Onoinlineis given.
When this option is disabled with a name list, the compiler will notconsider the specified routines as candidates for inlining. For example,saying
+Onoinline=baz,xindicates that inlining should not be considered for baz and x;all other routines will be considered for inlining, since +Oinlineis the default.
The +Onoinline disables inlining for all functions or a specificlist of functions.
Use this option when you need to precisely control which subprogramsare inlined.
Default: +Oinline_budget=100
where n is an integer in the range 1 - 1000000 that specifiesthe level of aggressiveness, as follows:
Note, however, that the +Oinline_budget=n option takesprecedence over both of these options. This means that you can overridethe effect of +Onolimit or +Osize option on inliningby specifying the +Oinline_budget=n option on the samecompile line.
Default: +Onolibcalls
Use the +Olibcalls option to increase the runtime performance of code which calls standard library routines in simple contexts. The +Olibcallsoption expands the following library calls inline:
A single call to printf() may be replaced by a series of callsto putchar(). Calls to sprintf() and strlen()may be optimized more effectively, including elimination of some callsproducing unused results. Calls to setjmp() and longjmp()may be replaced by their equivalents _setjmp() and _longjmp(), which do not manipulate the process"s signal mask.
Use +Olibcalls to improve the performance of selected libraryroutines only when you are not performing error checking for these routines.
Using +Olibcalls with +Ofltacc will give differentfloating point calculation results than those given using +Ofltaccwithout +Olibcalls.
The +Olibcalls option replaces the obsolete -J option.
+Olit can take the values all and none.
+Olit=all places all string variables and all const-qualified
variables that do not require load-time or run-time initialization in the
read-only data section.
+Olit=const places all string literals appearing in a context
where const
char * is legal, and all const-qualified variables that do not require
load-time or run-time initialization in the read-only data section.
If +Olit=none is specified, no constants are placed in the read-only data
section.
Default: +Oloop_transform
The +O[no]loop_transform option enables [disables] transformationof eligible loops for improved cache performance. The most important transformationis the reordering of nested loops to make the inner loop unit stride, resultingin fewer cache misses.
+Onoloop_transform may be a helpful option if you experienceany problem while using+Oparallel.
Default: +Oloop_unroll
The +Oloop_unroll option turns on loop unrolling. When youuse +Oloop_unroll, you can also use the unroll factor to controlthe code expansion. The default unroll factor is 4, that is, fourcopies of the loop body. By experimenting with different factors, you mayimprove the performance of your program.
Default: +Omoveflops
Allows [or disallows] moving conditional floating point instructionsout of loops. The +Onomoveflops option replaces the obsolete +OEoption. The behavior of floating-point exception handling may be alteredby this option.
Use +Onomoveflops if floating-point traps are enabled and youdo not want the behavior of floating-point exceptions to be altered bythe relocation of floating-point instructions.
Default: +Onomultiprocessor
If +Omultiprocessor is specified, the compiler performs optimimizationsappropriate for executables or shared libraries to run in several differentprocesses on multiprocessor machines.
If you enable this option inappropriately (for example, for an executableonly run a uniprocessor system), performance may be degraded.
Default: +Oparmsoverlap
The +Oparmsoverlap option optimizes with the assumption thatthe actual arguments of function calls overlap in memory.
The +Onoparmsoverlap option replaces the obsolete +Om1option.
Use +Onoparmsoverlap if C programs have been literally translatedfrom FORTRAN programs.
Default: +Opipeline
Enables [or disables] software pipelining. The +Onopipelineoption replaces the obsolete +Os option.
Use +Onopipeline to conserve code space.
Default: +Onoprocelim at levels 0-3, +Oprocelim atlevel 4
When +Oprocelim is specified, procedures that are not referencedby the application are eliminated from the output executable file. The+Oprocelimoption reduces the size of the executable file, especially when optimizingat levels 3 and 4, at which inlining may have removed all of the callsto some routines.
When you specify +Onoprocelim, procedures that are not referencedby the application are not eliminated from the output executable file.
The default is +Onoprocelim at levels 0-3, and +Oprocelimat level 4.
If the +Oall option is enabled, the +Oprocelim optionis enabled.
Default: +Onopromote_indirect_calls
This option uses profile data from profile-based optimization and otherinformation to determine the most likely target of indirect calls and promotesthem to direct calls. In all cases the optimized code tests to make surethe direct call is being taken & if not, executes the indirect call.If +Oinline is in effect, the optimizer may also inline the promoted calls.This option can only be used with profile-based optimization, describedin Profile-Based Optimization .
The optimizer tries to determine the most likely target of indirectcalls. If the profile data is incomplete or ambiguous, the optimizer maynot select the best target. If this happens, your code's performance maydecrease.
At +O3, this option is only effective if indirect calls from functionswithin a file are mostly to target functions within the same file. Thisis because +O3 optimizes only within a file whereas +O4 optimizes acrossfiles.
Default: +Onoptrs_ansi
Use +Optrs_ansi to make the following two assumptions, whichthe more aggressive+Optrs_strongly_typed does not make:
Default: +Onoptrs_strongly_typed
Use +Optrs_strongly_typed when pointers are type-safe. Theoptimizer can use this information to generate more efficient code.
Type-safe (that is, strongly-typed) pointers are pointers to a specifictype that only point to objects of that type, and not to objects of anyother type. For example, a pointer declared as a pointer to an intis considered type-safe if that pointer points to an object only of typeint,but not to objects of any other type.
Based on the type-safe concept, a set of groups are built based on objecttypes. A given group includes all the objects of the same type.
The term type-inferred aliasing is a concept which means anypointer of a type in a given group (of objects of the same type) can onlypoint to any object from the same group; it can not point to a typed objectfrom any other group.
For more information about type aliasing see AliasingOptions .
Type casting to a different type violates type-inferring aliasing rules.See Example 2 below.
Dynamic casting is allowed. See Example 3 below.
For more details, see Aliasing Options .
Example 1: How Data Types Interact
The optimizer generally spills all global data from registers to memorybeforeany modification to global variables or any loads through pointers. However,you can instruct the optimizer on how data types interact so it can generatemore efficient code.
If you have the following:
1 int *p;2 float *q;3 int a,b,c;4 float d,e,f;5 foo()6 {7 for (i=1;i<10;i++) {8 d=e9 *p=b;10 e=d+f;11 f=*q;12 }13 }With +Onoptrs_strongly_typed turned on, the pointers pand q will be assumed to be disjoint because the types they pointto are different types. Without type-inferred aliasing, *p isassumed to invalidate all the definitions. So, the use of d andfon line 10 have to be loaded from memory. With type-inferred aliasing,the optimizer can propagate the copy of d and f and thusavoid two loads and two stores.This option can be used for any application involving the use of pointers,where those pointers are type safe. To specify when a subset of types aretype-safe, use the [NO]PTRS_STRONGLY_TYPED pragma. The compilerissues warnings for any incompatible pointer assignments that may violatethe type-inferred aliasing rules discussed in AliasingOptions .
Example 2: Unsafe Type Cast
Any type cast to a different type violates type-inferred aliasing rules.Do not use +Optrs_strongly_typed with code that has these unsafetype casts. Use the [NO]PTRS_STRONGLY_TYPED pragma to preventthe application of type-inferred aliasing to the unsafe type casts.
struct foo{ int a; int b;} *P; struct bar { float a; int b; float c;} *q; P = (struct foo *) q; /* Incompatible pointer assignment through type cast */Example 3: Generally Applying Type AliasingDynamic cast is allowed with +Optrs_strongly_typed or +Optrs_ansi.A pointer dereference is called dynamic cast if a cast is applied on thepointer to a different type.
In the example below, type-inferred aliasing is applied onPgenerally, not just to the particular dereference. Type-aliasing will beapplied to any other dereferences of P.
struct s { short int a; short int b; int c;} *P;* (int *)P = 0;For more information about type aliasing, see AliasingOptions .Default: +Optrs_to_globals
By default global variables are conservatively assumed to be modifiedanywhere in the program. Use this option to specify which global variablesare not modified through pointers, so that the optimizer can make yourprogram run more efficiently by incorporating copy propagation and commonsub-expression elimination.
This option can be used to specify all global variables as not modifiedvia pointers, or to specify a comma-separated list of global variablesas not modified via pointers.
Note that the on state for this option disables some optimizations,such as aggressive optim izations on the program"s global symbols.
For example, use the command-line option +Onoptrs_to_globals=a,b,cto specify global variables a,b, and c as notbeing accessed through pointers. No pointer can access these global variables.The optimizer will perform copy propagation and constant folding becausestoring to *p will not modify a or b.
int a, b, c;float *p;foo(){ a = 10; b = 20; *p = 1.0; c = a + b;}If all global variables are unique, use the following option without listingthe global variables:+Onoptrs_to_globalsIn the example below, the address of b is taken. This means bcan be accessed indirectly through the pointer. You can still use+Onoptrs_to_globalsas: +Onoptrs_to_globals +Optrs_to_globals=b.
long b,c;int *p; p=b; foo()For more information about type aliasing see AliasingOptions .
Default: +Onoregionsched
Applies aggressive scheduling techniques to move instructions acrossbranches. This option is incompatible with the linker -z option.If used with -z, it may cause a SIGSEGV error at run-time.
Use +Oregionsched to improve application run-time speed. Compilationtime may increase.
Default: no reuse of object files
This option specifies a directory where the linker can save object filescreated from intermediate object files when using +O4 or profile-basedoptimization. It reduces link time by not recompiling intermediate objectfiles when they don't need to be.
When you compile with +I, +P, or +O4, the compiler generates intermediatecode in the object file. Otherwise, the compiler generates regular objectcode in the object file. When you link, the linker first compiles the intermediateobject code to regular object code, then links the object code. With thisoption you can reduce link time on subsequent links by avoiding recompilingintermediate object files that have already been compiled to regular objectcode and have not changed.
Note that when you do change a source file or command line options andrecompile, a new intermediate object file will be created and compiledto regular object code in the specified directory. The previous objectfile in the directory will not be removed. You should periodically removethis directory since old object files cannot be reused and will not beautomatically removed.
Default: +Oregreassoc
If disabled, this option turns off register reassociation.
Use +Onoregreassoc to disable register reassociation if thisoptimization hinders the optimized application performance.
Default: assume all subprograms have side effects
Assume that subprograms specified in the name list might modifyglobal variables. Therefore, when +Osideeffects is enabled theoptimizer limits global variable optimization.
The default is to assume that all subprograms have side effects unlessthe optimizer can determine that there are none.
Use +Onosideeffects if you know that the named functions donot modify global variables and you wish to achieve the best possible performance.
Default: +Onosignedpointers
Perform [or do not perform] optimizations related to treating pointersas signed quantities. Applications that allocate shared memory and thatcompare a pointer to shared memory with a pointer to private memory mayrun incorrectly if this optimization is enabled.
Use +Osignedpointers to improve application run-time speed.
Default: +Onostatic_prediction
+Ostatic_prediction turns on static branch prediction for PA-RISC2.0 targets.
PA-RISC 2.0 has two means of predicting which way conditional brancheswill go: dynamic branch prediction and static branch prediction. Dynamicbranch prediction uses a hardware history mechanism to predict future executionsof a branch from its last three executions. It is transparent and quiteeffective unless the hardware buffers involved are overwhelmed by a largeprogram with poor locality.
With static branch prediction on, each branch is predicted based onimplicit hints encoded in the branch instruction itself; the dynamic branchprediction is not used.
Static branch prediction"s role is to handle large codes with poor localityfor which the small dynamic hardware facility will prove inadequate.
Use +Ostatic_prediction to better optimize large programs withpoor instruction locality, such as operating system and database code.
Use this option only when using PBO, as an amplifier to +P.It is allowed but silently ignored with +I, so makefiles neednot change between the +I and +P phases.
Default: +Onovectorize
+Ovectorize allows the compiler to replace certain loops withcalls to vector routines.
Use +Ovectorize to increase the execution speed of loops.
When +Onovectorize is specified, loops are not replaced withcalls to vector routines.
Because the +Ovectorize option may change the order of operationsin an application, it may also change the results of those operations slightly.See theHP-UX Floating-Point Guide for details.
The math library contains special prefetching versions of vector routines.If you have a PA2.0 application that contains operations on very largearrays (larger than 1 megabyte in size), using +Ovectorize inconjunction with +Odataprefetch may improve performance substantially.
You may use +Ovectorize at levels 3 and 4. +Onovectorizeis also included as part of +Oaggressive and +Oall.
This option is only valid for PA-RISC 1.1 and 2.0 systems.
Default: +Onovolatile
The +Ovolatile option implies that memory references to globalvariables cannot be removed during optimization.
The +Onovolatile option implies that all globals are not ofvolatileclass. This means that references to global variablescan be removedduring optimization.
The +Ovolatile option replaces the obsolete +OV option.
Use this option to control the volatile semantics for all globalvariables.
Default: +Onowhole_program_mode
The +Owhole_program_mode option enables the assertion thatonly the files that are compiled with this option directly reference anyglobal variables and procedures that are defined in these files. In otherwords, this option asserts that there are no unseen accesses to the globals.
When this assertion is in effect, the optimizer can hold global variablesin registers longer and delete inlined or cloned global procedures.
All files compiled with +Owhole_program_mode must also be compiledwith +O4. If any of the fi les were compiled with +O4but were not compiled with +Owhole_program_mode, the linker disablesthe assertion for all files in the program.
The default, +Onowhole_program_mode, disables the assertion.
Use this option to increase performance speed, but only when you arecertain that only the files compiled with +Owhole_program_modedirectly access any globals that are defined in these files.
cc +O3 +Oaggressive +Onolimit +Onomoveflops +Onopipeline \ sourcefile.cSpecify the level of optimization first (+O1,+O2, +O3,or +O4), followed by any +O[no]optimization options.
if(a) { . . . statement 1 } else { goto L1; } statement 2 L1:becomes: if(!a) { goto L1; } statement 1 statement 2 
; L1:For example, the code:
if(0) { a = 1; } else { a = 2;becomes:a = 2;
This module performs the following:
LDW -52(0,30),r1 ADDI 3,r1,r31 ;interlock with load of r1 LDI 10,r19becomes:
LDW -52(0,sp),r1 LDI 10,r19 ADDI 3,r1,r31 ;use of r1 is now separated from load
For example, the code:
LDI 32,r3 AND r1,r3,r2   ; COMIB,= 0,r2,L1becomes:
BB,>= r1, 26, L1
You can help the optimizer understand when certain variables are heavilyused within a function by declaring these variables with the registerqualifier. The first 10 register qualified variables encounteredin the source are honored. You should pick the ten most important variablesto be most effective.
The coloring register allocator may override your choices and promoteto a register a variable not declared register over one that is,based on estimated speed improvements.
The following code shows the type of optimization the coloring registerallocation module performs. The code:
LDI 2,r104 COPY r104,r103 LDO 5(r103),r106 COPY r106,r105 LDO 10(r105),r107becomes:
LDI 2,r25 LDO 5(r25),r26 LDO 10(r26),r31
For example, the code:
for (i=0; i<25; i++) { r[i] = i * k; }becomes: t1 = 0; for (i=0; i<25; i++) { r[i] = t1; t1 += k; }For example, the code:
a = x + y + z; b = x + y + w;becomes:
t1 = x + y; a = t1 + z; b = t1 + w;
A = 10;B = A + 5;C = 4 * B;can be replaced by:
A = 10;B = 15;C = 60;
For example, the code:
x = z; for(i=0; i<10; i++) { a[i] = 4 * x + i; }becomes: x = z; t1 = 4 * x; for(i=0; i<10; i++) { a[i] = t1 + i; }For example, the following HP C code:
a = x + 23;where a is a local variable.
return a;produces the following code for the unoptimized case:
LDO 23(r26),r1 STW r1,-52(0,sp) LDW -52(0,sp),ret0and this code for the optimized case:
LDO 23(r26),ret0
For example, the function:
f(int x) { int a,b,c: a = 1; b = 2; c = x * b; return c; }becomes: f(int x) { int a,b,c; b = 2; c = x * b; return c; }The goal of this optimization is to avoid CPU stalls due to memory orhardware pipeline latencies. The software pipelining transformation addscode before and after the loop to achieve a high degree of optimizationwithin the loop.
#define SIZ 10000float x[SIZ], y[SIZ]; \*Software pipelining works with*\int i; \*floats and doubles. *\init();for (i = 0;i<= SIZ;i++); { x[i] =x[i] / y[i] + 4.00 }When this loop is compiled with software pipelining, the optimization canbe expressed in pseudo-code as follows:R1 = 0; Initializearray index.
R2 = 4.0; Loadconstant value.
R3 = Y[0]; Loadfirst Y value.
R4 = X[0]; Loadfirst X value.
R5 = R4 / R3; Performdivision on first element:n = X[0] / Y[0].
do { Beginloop.R6 = R1; Savecurrent array index.
R1++; Incrementarray index.
R7 = X[R1]; Loadcurrent X value.
R8 = Y[R1]; Loadcurrent Y value.
R9 = R5 + R2; Performaddition on prior row:X[i] = n + 4.0.
R10 = R7 / R8; Performdivision on current row:m = X[i+1] / Y[i+1].
X[R6] = R9; Saveresult of operations on prior row.
R6 = R1; Savecurrent array index.
R1++; Incrementarray index.
R4 = X[R1]; Loadnext X value.
R3 = Y[R1]; Loadnext Y value.
R11 = R10 + R2; Performaddition on current row:X[i+1] = m + 4
R5 = R4 / R3; Performdivision on next row:n = X[i+2] / Y[i+2]
X[R6] = R11 Saveresult of operations on current row.
} while (R1 <= 100); Endloop.
R9 = R5 + R2; Performaddition on last row:X[i+2] = n + 4
X[R6] = R9; Saveresult of operations on last row.This transformation stores intermediate results of the division instructionsin unique registers (noted as n and m). These registers arenot referenced until several instructions after the division operations.This decreases the possibility that the long latency period of the divisioninstructions will stall the instruction pipeline and cause processing delays.
Use the +Onopipeline option with the +O2,+O3,or +O4 option to suppress software pipelining if program sizeis more important than execution speed. This will perform level two optimization,but disable software pipelining.
Within loops, the virtual memory address expression can be rearrangedand separated into a loop varying term and a loop invariant term. Loopvarying terms are those items whose values may change from one iterationof the loop to another. Loop invariant terms are those items whosevalues are constant throughout all iterations of the loop. The loopvarying term corresponds to the difference in the virtual memory addressassociated with a particular array reference from one iteration of theloop to the next.
The register reassociation optimization dedicates a register to trackthe value of the virtual memory address expression for one or more arrayreferences in a loop and updates the register appropriately in each iterationof a loop.
The register is initialized outside the loop to the loop invariant portionof the virtual memory address expression and the register is incrementedor decremented within the loop by the loop variant portion of the virtualmemory address expression. On PA-RISC, the update of such a dedicated registercan often be performed for free using the base-register modification capabilityof load and store instructions.
The net result is that array references in loops are converted intoequivalent but more efficient pointer dereferences.
For example:
int a[10][20][30]; void example (void){ int i, j, k; for (k = 0; k < 10; k++) for (j = 0; j < 10; j++) for (i = 0; i < 10; i++) { a[i][j][k] = 1; }}after register reassociation is applied to the innermost loop becomes:int a[10][20][30]; void example (void){ int i, j, k; register int (*p)[20][30]; for (k = 0; k < 10; k++) for (j = 0; j < 10; j++) for (p = (int (*)[20][30]) a[0][j][k], i = 0; i < 10; i++) { *(p++[0][0]) = 1; }}In the above example, the compiler-generated temporary register variable,p,strides through the array a in the innermost loop. This registerpointer variable is initialized outside the innermost loop and auto-incrementedwithin the innermost loop as a side-effect of the pointer dereference.Register reassociation can often enable another loop optimization. Afterperforming the register reassociation optimization, the loop variable maybe needed only to control the iteration count of the loop. If this is case,the original loop variable can be eliminated altogether by using the PA-RISCADDIBand ADDB machine instructions to control the loop iteration count.
Level 3 optimization produces faster run-time code than level 2 on codethat frequently calls small functions within a file. Level 3 links fasterthan level 4.
/* Return the greatest common divisor of two positive integers, *//* int1 and int2, computed using Euclid"s algorithm. (Return 0 *//* if either is not positive.) */int gcd(int1,int2) int int1; int int2;{ int inttemp; if ( ( int1 <= 0 ) || ( int2 <= 0 ) ) { return(0); } do { if ( int1 < int2 ) { inttemp = int1; int1 = int2; int2 = inttemp; } int1 = int1 - int2; } while (int1 > 0); return(int2);} main(){ int xval,yval,gcdxy; /* statements before call to gcd */ gcdxy = gcd(xval,yval); /* statements after call to gcd */}After inlining, the source file looks like this:main(){ int xval,yval,gcdxy; /* statements before inlined version of gcd */ { int int1; int int2; int1 = xval; int2 = yval; { int inttemp; if ( ( int1 <= 0 ) || ( int2 <= 0 ) ) { gcdxy = ( 0 ); goto AA003; } do { if ( int1 < int2 ) { inttemp = int1; int1 = int2; int2 = inttemp; } int1 = int1 - int2; } while ( int1 > 0 ); gcdxy = ( int2 ); }&n
bsp; }AA003 : ; /* statements after inlined version of gcd */}Inlining substitutes function calls with copies of the function"s objectcode. Only functions that meet the optimizer"s criteria are inlined. Thismay result in slightly larger executable files. However, this increasein size is offset by the elimination of time-consuming procedure callsand procedure returns.
When gathering this information, the HP C compiler makes the followingassumption: while inside a function, the only variables that can be accessedindirectly through a pointer or by another function call are:
OPT_LEVEL 0, 1, and 2 provide more control over optimizationthan the +O1 and +O2 compiler options. You use thesepragmas to raise or lower optimization at a function level inside the sourcefile. Whereas, the compiler opt ions can only be used for an entire sourcefile. (OPT_LEVEL 3 and 4 can only be used at the beginningof the source file.)
Table 11: Optimization Level Precedence showsthe possible combinations of options and pragmas and the resulting optimizationlevels. The level at which a function will be optimized is the lower ofthe two values specified by the command line optimization level and theoptimization pragma in force.
The values of OPTIMIZE and OPT_LEVEL are summarizedin Table 12: Optimizer Control Pragmas
The NOINLINE pragma disables inlining for all functions orspecified functionname(s).
The syntax for performing inlining is:
#pragma INLINE [functionname(1), ..., functionname(n)]#pragma NOINLINE [functionname(1), ..., functionname(n)]For example, to specify inlining of the two subprograms checkstatand getinput, use:
#pragma INLINE checkstat, getinputTo specify that an infrequently called routine should not be inlined whencompiling at optimization level 3 or 4, use:
#pragma NOINLINE opendbSee also the related +O[no]inline optimization option.
The compiler gathers information about each function (such as informationabout function calls, variables, parameters, and return values) and passesthis information t
o the optimizer. TheNO_SIDE_EFFECTS and ALLOCS_NEW_MEMORYpragma tell the optimizer to make assumptions it can not normally make,resulting in improved compile-time and run-time speed. They change thedefault information the compiler collects.
If used, the NO_SIDE_EFFECTS and ALLOCS_NEW_MEMORYpragmas should appear before the first function defined in a file and arein effect for the entire file. When used appropriately, these optionalpragmas provide better optimization.
NO_SIDE_EFFECTS has the following form:
#pragma NO_SIDE_EFFECTS functionname(1), ..., functionname(n)All functions in functionname are the names of functions that donot modify the values of global variables. Global variable references canbe optimized to a greater extent in the presence of calls to the listedfunctions. Note that you need the NO_SIDE_EFFECTS pragma in thefiles where the calls are made, not where the function is defined.This pragma takes effect from the line it first occurs on to the end ofthe file.
#pragma ALLOCS_NEW_MEMORY functionname(1), ..., functionname(n)The new memory must be memory that was either newly allocated orwas previously freed and is now reallocated. For example, the standardroutines malloc() and calloc() satisfy this requirement.
Large applications might have routines that are layered abovemalloc()andcalloc(). These interface routines make the calls to malloc()and calloc(), initialize the memory, and return the pointer thatmalloc()or calloc() returns. For example, in the program below:
struct_type *get_new_record(void) { struct_type *p; if ((p=malloc(sizeof(*p))) == NULL) { printf("get_new_record():out of memory\n"); abort(); } else { /* initialize the struct */ . . . return p; }the routine get_new_record falls under this category, and canbe included in the ALLOCS_NEW_MEMORY pragma.#pragma FLOAT_TRAPS_ON {functionname,...functionname}#pragma FLOAT_TRAPS_ON { _ALL }
For example:
#pragma FLOAT_TRAPS_ON xyz,abcinforms the compiler and optimizer that xyz and abc havefloating-point traps turned on and therefore LICM optimization should notbe performed.
#pragma PTRS_STRONGLY_TYPED END #pragma NOPTRS_STRONGLY_TYPED BEGIN #pragma NOPTRS_STRONGLY_TYPED ENDAny types that are defined between the begin-end pair are taken to applytype-safe assumptions. These pragmas are not allowed to nest. For eachBEGINan associated END must be defined in the compilation unit.
The pragma will take precedence over the command-line option. Although,sometimes both are required (see example 2).
Example 1
double *d;#pragma PTRS_STRONGLY_TYPED BEGINint *i;float *f;#pragma PTRS_STRONGLY_TYPED ENDmain(){ . . .}In this example only two types, pointer-to-int and pointer-to-float willbe assumed to be type-safe.Example 2
cc +Optrs_strongly_typed foo.c /*source for Ex.2 */double *d; ...#pragma NOPTRS_STRONGLY_TYPED BEGINint *i;float *f;#pragma NOPTRS_STRONGLY_TYPED END ...main(){ ...}In this example all types are assumed to be type-safe except the typesbracketed by pragma NOPTRS_STRONGLY_TYPED. The command-line optionis required because the default option is+Onoptrs_strongly_typed.Here are the type-inferred aliasing rules:
int *p, *q;*p and*q are assumed to alias with any objects of typeint.Also *p and *q are assumed to alias with each other.
typedef struct foo_st type_old;typedef type_old type_new;Each field of a structure type is placed in a separate equivalent groupwhich is distin ct from the equivalent group of the field"s base type. (Theassumption here is that a pointer to int will not be assignedthe address of a structure field whose type is int). The actualtype name of a structure type is not considered significant in constructingequivalent groups (e.g., dereferences of a struct foo pointerand a struct bar pointer will be assumed to alias with each othereven if struct foo and struct bar have identical fielddeclarations).
| NOTE | Variables declared to be of type void * need to be typecastinto a pointer to a specific type before they can be dereferenced. |
| WARNING | This pragma must be at the function definition and at allcall sites. If the pragma is omitted from the function definition or fromany call, the linker will generate parameter relocation code and the applicationwill behave incorrectly since floating point parameters will not be inexpected registers. |
#pragma HP_NO_RELOCATIONname1[,name2[, ...]]
where name1, name2, and so forth are names of functionsin shared libraries.
| NOTE | Do not use this option with functions that use the varargs macros.See the HP C/HP-UX Reference Manual or the varargs(5) manpage for information on the varargs macros. |
Put this pragma in header files of functions that will go in sharedlibraries so it is specified at the function definition and at all calls.For functions with floating point parameters or returns, use the HP_NO_RELOCATIONpragma along with this pragma.
| WARNING | This pragma must be at the function definition and at allcall sites. If the pragma is omitted from the function definition or fromany call, the compiler will generate incompatible return code and the applicationwill behave incorrectly. |
#pragma HP_LONG_RETURNname1[,name2[, ...]]
where name1, name2, and so forth are names of functionsin shared libraries.
The HP_LONG_RETURN pragma generates a long return sequence in the exportstub instead of an interspace branch. If you also use the HP_NO_RELOCATIONpragma (for functions taking floating point parameters) with the HP_LONG_RETURNpragma, all the code in the export stub is omitted, eliminating the exportstub entirely. The HP_LONG_RETURN pragma by itself eliminates the needfor export stubs for functions taking non-floating-point parameters.
| NOTE | Using HP_LONG_RETURN without using HP_NO_RELOCATION with floating pointparameters, could actually degrade performance by creating export stubsand relocation stubs. These pragmas improve performance of calls to shared library functionsfrom outside the shared library. Therefore do not use this pragma for hiddenfunctions (see the -h and +e linker options) or for functions called onlyfrom within the same shared library linked with the -B symbolic linkeroption, otherwise this pragma may degrade performance. (See the HP-UXLinker & Libraries User's Guide for information on the above mentionedoptions.) Do not use this pragma if you compile on PA-RISC 2.0 or later or withthe +DA2.0 option since the effect is the default. That is, if no relocationsare generated, export stubs are not generated on PA-RISC 2.0 and later,and a long return instruction sequence is generated by default, so thispragma has no effect. |
| WARNING | Do not use this pragma at function definitions, only at functioncalls. Specifying it at function definitions will result in incorrect behavior. On PA-RISC 1.1, use this pragma only when calling a shared libraryfrom an e xecutable file. Using it on calls within an executable file willcause the program to abort. |
#pragma HP_DEFINED_EXTERNALname1[,name2[, ...]]wherename1, name2, andso forth are names of functions in shared libraries.
| NOTE | If your function takes floating-point parameters, you should also usethe HP_NO_RELOCATION pragma (if floating point parameters are present).You should also use the HP_LONG_RETURN pragma with this pragma. If youdon't, the import stub may be too large to inline. Use this pragma only on calls to functions in shared libraries. On PA-RISC2.0, it will degrade performance of calls to any other functions. |