Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP-UX Floating-Point Guide: HP 9000 Computers > Chapter 7 Performance Tuning

Inefficient Code

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

The HP-UX compilers are highly optimizing and generally produce extremely efficient code. However, you can control the degree of efficiency and the types of optimizations with compiler options and directives. Particularly important from a performance standpoint are the compiler options that do the following:

  • Optimize your program

  • Specify the architecture type, causing the compiler to generate code for a specific type of machine

  • Cause the compiler to emit debugger information

  • Cause the compiler to produce position-independent code, which generally runs more slowly than absolute code

  • Enable performance-based optimization (PBO)

  • (Fortran only) Cause the compiler to make all local variables static and to initialize all uninitialized static data to zero

The following sections describe each of these options. Many of the options are available both as command-line options and as directives or pragmas that you can place in your source code. For more information and specific syntax, refer to the appropriate HP language reference manual.

If the compiler generates inefficient code even when you use the appropriate options, you may choose to write parts of your program in assembly language. “Writing Routines in Assembly Language” describes the advantages and disadvantages of this choice.

Optimizing Your Program

For a thorough discussion of optimization on HP 9000 systems, see the HP PA-RISC Compiler Optimization Technology White Paper. See the appropriate HP language manual for additional information.

The most important compiler option affecting efficiency is the optimization option, +O, which allows you to optimize your program in several different ways:

Levels of optimization

Optimization levels, numbered from 0 to 4, allow you to select a broad category of optimizations, from minimal optimization to full optimization.

Types of optimization

Optimization types allow you to select groups of optimizations that fall into a particular category. For example, +Osize suppresses optimizations that significantly increase code size. If you specify an optimization type, you must also specify an optimization level.

Specific optimizations

Specific optimizations allow you to turn on or off particular optimizations that may be appropriate or inappropriate for your program. For example, +Opipeline (the default at optimization levels 2, 3, and 4) enables software pipelining. If you specify a specific optimization, you must also specify an optimization level.

In general, the higher the optimization level, the more efficient the code. In performing optimizations, the compiler often rearranges code and makes assumptions about the way variables will be used in other modules. There is some risk, therefore, in choosing a high optimization level, since the compiler may make some invalid assumptions that can cause code to run more slowly. This is particularly true if your code makes frequent use of pointers. It is always a good idea to compile a program at different optimization levels and compare the results to make sure that the optimizations are not affecting either the performance or the results. See “Compiler Behavior and Compiler Version” and “Compiler Options” for information about how compiler optimizations can affect program results.

The following specific optimizations are particularly relevant to floating-point programs. Most of them are available at optimization levels 2, 3, and 4.

+O[no]dataprefetch

+Odataprefetch, which has an effect only when you use the +DA2.0 option, inserts instructions within innermost loops to fetch data from memory into the data cache ahead of time so that it is already there when it is needed. Data prefetch instructions are inserted only for data structures referenced within innermost loops using simple loop varying addresses (that is, in a simple linear sweep across large amounts of memory). It is useful for applications that have high data cache miss overhead; that is, it improves the performance of operations on arrays that are so large they exceed the size of the cache.

As a general rule of thumb, using +Odataprefetch will probably help performance if your application contains numerous references to arrays, and if the sum of the sizes of all the arrays in your program totals more than a megabyte. It can also help if your application contains only a single pass through an extremely large array (tens of megabytes in size). However, if your program contains very frequent references to small arrays, +Odataprefetch can actually impair performance. Therefore, the only way to find out for sure whether this option will help your program is to try it.

The +Odataprefetch option is effective with both vectorized and unvectorized loops. In fact, if your PA2.0 application uses very large arrays, you may gain considerable performance benefit from using +Odataprefetch in conjunction with +Ovectorize. The math library contains special prefetching versions of the vector routines, which are called if you specify both options.

+O[no]fltacc

+Ofltacc, which is the default at levels 2, 3, and 4, disables optimizations that are algebraically correct but that may result in numerical differences. (Usually these differences are insignificant.) To enable these optimizations, use +Onofltacc.

On PA2.0 systems at level 2 and higher, if you specify neither +Ofltacc nor +Onofltacc, or if you specify +Onofltacc, the compiler generates FMA (fused multiply-add) instructions (see “Architecture Type of Run-Time System” for details). Specify +Ofltacc to suppress the generation of these instructions.

The +Onofltacc option is invoked by default when you specify the optimization type +Oaggressive; use +Oaggressive +Ofltacc if you want aggressive optimization without sacrificing floating-point accuracy.

+O[no]inline, +Oinline_budget=n

+Oinline, which is available at levels 3 and 4 and is the default at those levels, enables inlining of function calls. Inlining can improve performance significantly if your application makes many math library calls. It is especially effective on PA2.0 systems. The +Oinline_budget option, also available at levels 3 and 4, can be used to specify how aggressively you want the compiler to pursue inlining opportunities. The default value of n is 100.

+O[no]libcalls

+Olibcalls, which is available at all optimization levels (0 through 4), invokes millicode versions of several frequently called math library functions. It also inlines the double-precision versions of the sqrt and fabs C functions.

This option is invoked by default when you specify the optimization type +Oaggressive; use +Oaggressive +Onolibcalls if you want aggressive optimization without using millicode routines.

Do not use this option on a C program that depends on the setting of errno by math library functions. See “Millicode Versions of Math Library Functions” for details.

+O[no]moveflops

+Omoveflops, which is the default at levels 2, 3, and 4, moves conditional floating-point instructions out of loops. This option may alter floating-point exception behavior. Use +Onomoveflops if you depend on floating-point exception behavior and you do not want this behavior to be altered by the relocation of floating-point instructions.

+O[no]vectorize

+Ovectorize, available with the Fortran and C compilers only, replaces eligible loops with calls to vector routines in the math library. The +Ovectorize option is invoked by default when you specify the optimization type +Oaggressive; use +Oaggressive +Onovectorize if you want aggressive optimization without vector calls.

Any files that were compiled with +Ovectorize must also be linked with +Ovectorize (this happens automatically when the compiler invokes the linker).

This option can be used at optimization levels 3 and 4. The default is +Onovectorize. This option is valid only when you compile for PA1.1 and PA2.0 systems.

If your PA2.0 application uses very large arrays, you may gain considerable performance benefit from using +Odataprefetch in conjunction with +Ovectorize. The math library contains special prefetching versions of the vector routines, which are called if you specify both options.

Specifying the Architecture Type

All HP 9000 compilers support the +DA option, which specifies a particular target architecture type, either PA-RISC 1.1 or PA-RISC 2.0. Use of this option causes the compiler to produce architecture-specific instructions and calls to special architecture-specific run-time libraries.

Specifying the architecture type of the systems on which your code will run will probably improve the performance of your code if it makes substantial use of floating-point arithmetic or math library calls. See “Selecting Different Versions of the Math Libraries”, “Architecture Type of Run-Time System”, and “BLAS Library Versions” for more information.

Use of the +DA2.0 option to generate PA2.0 code will improve the performance of your application even more if the source provides opportunities for the compiler to generate FMA (fused multiply-add) instructions (see “Architecture Type of Run-Time System” for details). For example, if two statements like

c = a * b

and

e = c - d

are separated by intervening statements in your program, you may want to place them one right after the other or to combine them into

e = a * b - d

This kind of rearrangement will be most effective if done within loops.

The +DS option also has a significant effect on performance, because it specifies an architecture-specific instruction scheduler. If your code must be portable across all HP 9000 architectures, you must compile with +DA1.1, but you may compile with either +DS1.1 or +DS2.0. Use +DS2.0 if you want to achieve the best possible performance on PA2.0 systems. See the appropriate HP language reference manual for more information about this option.

Including Debugging Information

All HP 9000 compilers allow you to include debugging information in the object file at optimization levels 0, 1, and 2. Debugging information increases the size of the object code. The debugging option is extremely useful during program development, but for the final product you should compile without it.

Producing Position-Independent Code

By default, compilers produce absolute code for HP 9000 systems. You can produce position-independent code (PIC) for use in building shared libraries. In general, absolute code is faster than PIC because addressing calculations are simpler and shorter. Consult Programming on HP-UX for more information about absolute and position-independent code. See “Shared Libraries versus Archive Libraries” for more information on the performance impact of shared libraries.

Using Profile-Based Optimization

HP C, HP C++, HP FORTRAN/9000, and HP Pascal support profile-based optimization (PBO) on HP 9000 systems. PBO can improve the performance of programs that are branch-intensive and that exhibit poor instruction memory locality. Although these tend not to be issues in ­floating-point-intensive applications, if you suspect that they may be degrading the performance of your program, you can use PBO to minimize their impact on your program. Under PBO, the compiler and linker work to optimize the executable file, using profile data for a typical data set to produce an executable file that will result in fewer instruction cache misses, Translation Lookaside Buffer (TLB) misses, and memory page faults. For information about PBO, see the HP-UX Linker and Libraries Online User Guide and the appropriate compiler documentation.

Creating and Zeroing Static Data (Fortran only)

HP Fortran 90 provides an option, +save, that forces static storage for all local variables and that forces the compiler to initialize all uninitialized static variables to zero. HP FORTRAN/9000 provides an equivalent option, -K; the +e option also automatically saves all local variables, if possible.

Use these options judiciously. They are costly from a performance standpoint and also from a software engineering perspective because they change the semantics of an entire module rather than altering specific problem areas.

The optimization option +Oinitcheck performs initialization in a more selective way that has less impact on the performance of your program. Use this option in Fortran 90 programs. See the f90(1) or f77(1) man page for details.

See “Static Variables” for more information about static data.

Writing Routines in Assembly Language

If you have compiled with all of the correct compiler options and you are still not satisfied with the program's performance, you may want to examine the generated code to see exactly what is happening. To get an expanded listing, specify the -S option. You can also code parts of your program directly in assembly language. Assembly language is useful if performance is critical and portability is not.

When deciding whether to write something in assembly language, keep in mind that the HP 9000 compilers are highly optimizing. If the code section is large, the compiler can probably generate code as good as or better than an assembly language program. Good candidates for assembly language are short, frequently called routines. However, using the +Oinline compiler option may improve the performance of these routines enough to make it unnecessary to rewrite them in assembly language.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 1997 Hewlett-Packard Development Company, L.P.