Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Fortran 90, Fortran 77, C, aC++: Exemplar Programming Guide > Chapter 3 Compiler optimizations

+O2 level optimizations

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

At optimization level +O2, the compiler performs optimizations on a routine level. The compiler continues to perform the optimizations performed at +O1, with the following additions:

Global register allocation

Scalar variables can often be stored in registers, eliminating the need for costly memory accesses. Global register allocation (GRA) attempts to store commonly referenced scalar variables in registers throughout the code in which they are most frequently accessed.

The compiler automatically determines which scalar variables are the best candidates for GRA and allocates registers accordingly.

GRA can sometimes cause problems when parallel threads attempt to update a shared variable that has been allocated a register. In this case, each parallel thread will allocate a register for the shared variable; it is then unlikely that the copy in memory will be updated correctly as each thread executes.

Parallel assignments to the same shared variables from multiple threads make sense only if the assignments are contained inside critical or ordered sections, or are executed conditionally based on thread ID. GRA will not allocate registers for shared variables that are assigned within critical or ordered sections, as long as the sections are implemented using compiler directives or sync_routine-defined functions (refer to Chapter 6, "Chapter 6 “Advanced shared-memory programming”"). However, for conditional assignments based on thread ID, GRA may allocate registers that may cause wrong answers when stored.

In such cases, GRA can be disabled only for shared variables that are visible to multiple threads by specifying the +Onosharedgra compiler option.

In procedures with large numbers of loops, GRA can contribute to long compile times; therefore, GRA is only performed if the number of loops in the procedure is below a predetermined limit. You can remove this limit (and possibly increase compile time) by specifying the +Onolimit compiler option.

This optimization is also known as coloring register allocation because of the similarity to map-coloring algorithms in graph theory.

Register allocation in C and C++

In C and C++, you can help the optimizer understand when certain variables are heavily used within a function by declaring these variables with the register qualifier.

The global register allocator may override your choices and promote a variable not declared register to a register over a variable that is declared register, based on estimated speed improvements.

Strength reduction of induction variables
and constants

This optimization removes expressions that are linear functions of a loop counter and replaces each of them with a variable that contains the value of the function. Variables of the same linear function are computed only once. This optimization also replaces multiplication instructions with addition instructions wherever possible.

For example, in the following C/C++ code:

for (i=0; i<25; i++) {
r[i] = i * k;
}

becomes:

t1 = 0;
for (i=0; i<25; i++) {
r[i] = t1;
t1 += k;
}

Common subexpression elimination

The common subexpression elimination optimization identifies expressions that appear more than once and have the same result, computes the result, and substitutes the result for each occurrence of the expression. The subexpression types include instructions that load values from memory, as well as arithmetic evaluation.

In Fortran, for example, the code:

A = X + Y + Z
B = X + Y + W

becomes:

T1 = X + Y
A = T1 + Z
B = T1 + W

Advanced constant folding and propagation

Constant folding computes the value of a constant expression at compile time. Constant propagation is the automatic compile-time replacement of variable references with a constant value previously assigned to that variable.

For example, consider the following C/C++ code:

a = 10;
b = a + 5;
c = 4 * b;

Once a is assigned, its value is propagated to the statement where b is assigned so that the assignment reads:

b = 10 + 5;

The expression 10 + 5 can then be folded. Now that b has been assigned a constant, the value of b is propagated to the statement where c is assigned. After all the folding and propagation, the original code is replaced by:

a = 10;
b = 15;
c = 60;

Loop-invariant code motion

The loop-invariant code motion optimization recognizes instructions inside a loop whose results do not change and then moves the instructions outside the loop. This optimization ensures that the invariant code is only executed once.

For example, the C/C++ code:

x = z;
for(i=0; i<10; i++)
a[i] = 4 * x + i;

becomes:

x = z;
t1 = 4 * x;
for(i=0; i<10; i++)
a[i] = t1 + i;

Store/copy optimization

Where possible, the store/copy optimization substitutes registers for memory locations, by replacing store instructions with copy instructions and deleting load instructions.

Unused definition elimination

The unused definition elimination optimization removes unused memory location and register definitions. These definitions are often a result of transformations made by other optimizations.

For example, the function:

f(int x){
int a,b,c;

a = 1;
b = 2;
c = x * b;
return c;
}

becomes:

f(int x) {
int a,b,c;

c = x * 2;
return c;
}

The assignment a = 1 is removed because a is not used after it is defined. Due to another +O2 optimization (constant propagation), the c = x * b statement becomes c = x * 2. The assignment b = 2 is then removed as well.

Software pipelining

Software pipelining is a code transformation that optimizes program loops. It rearranges the order in which instructions are executed in a loop. It generates code that overlaps operations from different loop iterations. Software pipelining is particularly useful for loops that contain arithmetic operations on REAL*4 and REAL*8 data in Fortran or on float and double data in C or C++.

The goal of this optimization is to avoid processor stalls due to memory or hardware pipeline latencies. The software pipelining transformation partially unrolls a loop and adds code before and after the loop to achieve a high degree of optimization within the loop.

You can enable [disable] software pipelining using the +O[no]pipeline command-line option at +O2 and above. The default is +Opipeline. Use +Onopipeline if a smaller program size and/or faster compile time is more important than faster execution speed.

The following pseudo-code shows a loop before and after the software pipelining optimization. Four significant things happen:

  • A portion of the first iteration of the loop is performed before the loop.

  • A portion of the last iteration of the loop is performed after the loop.

  • The loop is unrolled twice.

  • Operations from different loop iterations are interleaved with each other.

Consider the following C/C++ for loop:

#define SIZ 10000
float x[SIZ], y[SIZ];
int i;
init();
for (i = 0;i<= SIZ;i++)
x[i] = x[i] / y[i] + 4.00;

When this loop is compiled with software pipelining, the optimization can be expressed in pseudo-code as follows:

R1 = 0;Initialize array index
R2 = 4.00;Load constant value
R3 = X[0];Load first X value
R4 = Y[0];Load first Y value
R5 = R3 / R4;Perform division on first element: n = X[0]/Y[0]
do {Begin loop
R6 = R1;Save current array index
R1++;Increment array index
R7 = X[R1];Load current X value
R8 = Y[R1];Load current Y value
R9 = R5 + R2;Perform addition on prior row: X[i] = n + 4.00
R10 = R7 / R8;Perform division on current row: m = X[i+1]/Y[i+1]
X[R6] = R9;Save result of operations on prior row
R6 = R1;Save current array index
R1++;Increment array index
R3 = X[R1];Load next X value
R4 = Y[R1];Load next Y value
R11 = R10 + R2;Perform addition on current row: X[i+1] = m + 4.00
R5 = R3 / R4;Perform division on next row: n = X[i+2]/Y[i+2]
X[R6] = R11;Save result of operations on current row
} while (R1 <= 100);End loop
R9 = R5 + R2;Perform addition on last row: X[i+2] = n + 4.00
X[R6] = R9;Save result of operations on last row

This transformation stores intermediate results of the division instructions in unique registers (noted as n and m). These registers are not referenced until several instructions after the division operations. This decreases the possibility that the long latency period of the division instructions will stall the instruction pipeline and cause processing delays.

Prerequisites of Pipelining

Software pipelining is attempted on a loop that meets the following criteria:

  • It is the innermost loop

  • There are no branches or function calls within the loop

  • The loop is of moderate size

This optimization produces slightly larger program files and increases compile time. It is most beneficial in programs containing loops that are executed a large number of times.

Register reassociation

Array references often require one or more instructions to compute the virtual memory address of the array element specified by the subscript expression. The register reassociation optimization implemented in PA-RISC compilers tries to reduce the cost of computing the virtual memory address expression for array references found in loops.

Within loops, the virtual memory address expression can be rearranged and separated into a loop-variant term and a loop-invariant term. Loop-variant terms are those items whose values may change from one iteration of the loop to another. Loop-invariant terms are those items whose values are constant throughout all iterations of the loop. The loop-variant term corresponds to the difference in the virtual memory address associated with a particular array reference from one iteration of the loop to the next.

The register reassociation optimization dedicates a register to track the value of the virtual memory address expression for one or more array references in a loop and updates the register appropriately in each iteration of a loop.

The register is initialized outside the loop to the loop-invariant portion of the virtual memory address expression, and the register is incremented or decremented within the loop by the loop-variant portion of the virtual memory address expression.

The net result is that array references in loops are converted into equivalent, but more efficient, pointer dereferences.

Consider the following C/C++ code:

int a[10][20][30];

void example (void)
{
int i, j, k;

for (k = 0; k < 10; k++)
for (j = 0; j < 10;j++)
for (i = 0; i < 10; i++)
a[i][j][k] = 1;
}

After register reassociation is applied, the innermost loop becomes:

int a[10][20][30];

void example (void)
{
int i, j, k;
register int (*p)[20][30];

for (k = 0; k < 10; k++)
for (j = 0; j < 10; j++)
for (p = (int (*)[20][30]) &a[0][j][k], i = 0; i < 10; i++)
*(p++[0][0]) = 1;
}

In the above example, the compiler-generated temporary register variable, p, strides through the array a in the innermost loop. This register pointer variable is initialized outside the innermost loop and auto-incremented within the innermost loop as a side-effect of the pointer dereference.

Register reassociation can often enable another loop optimization. After performing the register reassociation optimization, the loop variable may be needed only to control the iteration count of the loop. If this is the case, the original loop variable can be eliminated altogether by using the PA-RISC ADDIB and ADDB machine instructions to control the loop iteration count.

You can enable [disable] register reassociation using the +O[no]regreassoc command-line option at +O2 and above. The default is +Oregreassoc.

Loop unrolling

Loop unrolling increases a loop's step value and replicates the loop body. Each replication is appropriately offset from the induction variable so that all iterations are performed, given the new step.

Unrolling can be total or partial. Total unrolling involves eliminating the loop structure completely by replicating the loop body a number of times equal to the iteration count and replacing the iteration variable with constants. This makes sense only for loops with small iteration counts.

Consider the following Fortran example:

SUBROUTINE FOO(A,B)
REAL A(10,10), B(10,10)
DO J=1, 4
DO I=1, 4
A(I,J) = B(I,J)
ENDDO
ENDDO
END

This loop nest is completely unrolled as shown below:

A(1,1) = B(1,1)
A(2,1) = B(2,1)
A(3,1) = B(3,1)
A(4,1) = B(4,1)

A(1,2) = B(1,2)
A(2,2) = B(2,2)
A(3,2) = B(3,2)
A(4,2) = B(4,2)

A(1,3) = B(1,3)
A(2,3) = B(2,3)
A(3,3) = B(3,3)
A(4,3) = B(4,3)

A(1,4) = B(1,4)
A(2,4) = B(2,4)
A(3,4) = B(3,4)
A(4,4) = B(4,4)

Partial unrolling is performed on loops with larger or unknown iteration counts. It retains the loop structure, but replicates the body a number of times equal to the unroll factor and adjusts references to the iteration variable accordingly.

Consider the following Fortran example:

DO I = 1, 100
A(I) = B(I) + C(I)
ENDDO

This example can be unrolled to a depth of four as shown below:

DO I = 1, 100, 4
A(I) = B(I) + C(I)
A(I+1) = B(I+1) + C(I+1)
A(I+2) = B(I+2) + C(I+2)
A(I+3) = B(I+3) + C(I+3)
ENDDO

Each iteration of the loop now computes four values of A instead of one value. The compiler also generates code for the case where the range is not evenly divisible by the unroll factor.

Loop unrolling and the unroll factor can be controlled using the +O[no]loop_unroll[=unroll_factor] option. See Appendix D, "Appendix D “Optimization options”," for more information on this option.

Some loop transformations cause loops to be fully or partially replicated. Because unlimited loop replication can significantly increase compile times, loop replication is limited by default. You can increase this limit (and possibly increase your program's compile time and code size) by specifying the +Onosize and +Onolimit compiler options.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.