| United States-English |
|
|
|
![]() |
Parallel Programming Guide for HP-UX Systems: K-Class and V-Class Servers > Chapter 5 Loop and cross-module optimization featuresLoop unroll and jam |
|
The loop unroll and jam transformation is primarily intended to increase register exploitation and decrease memory loads and stores per operation within an iteration of a nested loop. Improved register usage decreases the need for main memory accesses and allows better exploitation of certain machine instructions. Unroll and jam involves partially unrolling one or more loops higher in the nest than the innermost loop, and fusing ("jamming") the resulting loops back together. For unroll and jam to be effective, a loop must be nested and must contain data references that are temporally reused with respect to some loop other than the innermost (temporal reuse is described in “Data reuse”). The unroll and jam optimization is automatically applied only to those loops that consist strictly of a basic block. Loop unroll and jam takes place at +O3 and above and is not enabled by default in the HP compilers. To enable loop unroll and jam on the command line, use the +Oloop_unroll_jam option. This allows both automatic and directive-specified unroll and jam. Specifying +Onoloop_transform disables loop unroll and jam, loop distribution, loop interchange, loop blocking, loop fusion, and loop unroll. The unroll_and_jam directive and pragma also enables this transformation. The no_unroll_and_jam directive and pragma is used to disable loop unroll and jam for an individual loop. The forms of these directives and pragmas are shown in Table 5-7 “Forms of unroll_and_jam, no_unroll_and_jam directives and pragmas”. Table 5-7 Forms of unroll_and_jam, no_unroll_and_jam directives and pragmas
where
Unroll and jam Consider the following matrix multiply loop:
Here, the compiler can exploit a maximum of 3 registers: one for A(I,J), one for B(I,K), and one for C(K,J). Register exploitation is vastly increased on this loop by unrolling and jamming the I and J loops. First, the compiler unrolls the I loop. To simplify the illustration, an unrolling factor of 2 for I is used. This is the number of times the contents of the loop are replicated. The following Fortran example shows this replication:
The "jam" part of unroll and jam occurs when the loops are fused back together, to create the following:
This new loop can exploit registers for two additional references: A(I,J) and A(I+1,J). However, the compiler still has the J loop to unroll and jam. An unroll factor of 4 for the J loop is used, in which case unrolling gives the following:
Fusing (jamming) the unrolled loop results in the following:
This new loop exploits more registers and requires fewer loads and stores than the original. Recall that the original loop could use no more than 3 registers. This unrolled-and-jammed loop can use 14, one for each of the following references:
Fewer loads and stores per operation are required because all of the registers containing these elements are referenced at least twice. This particular example can also benefit from the PA-RISC FMPYFADD instruction, which is available with PA-8x00 processors. This instruction doubles the speed of the operations in the body of the loop by simultaneously performing related adds and multiplies. This is a very simplified example. In reality, the compiler attempts to exploit as many of the PA-RISC processor's registers as possible. For the matrix multiply algorithm used here, the compiler would select a larger unrolling factor, creating a much larger K loop body. This would result in increased register exploitation and fewer loads and stores per operation.
You should attempt to select unroll factor values that align data references in the innermost loop on cache boundaries. As a result, references to the consecutive memory regions in the innermost loop can have very high cache hit ratios. Unroll factors of 5 or 7 may not be good choices because most array element sizes are either 4 bytes or 8 bytes and the cache line size is 32 bytes. Therefore, an unroll factor of 2 or 4 is more likely to effectively exploit cache line reuse for the references that access consecutive memory regions. As with all optimizations that replicate code, the number of new loops created when the compiler performs the unroll and jam optimization is limited by default to ensure reasonable compile times. To increase the replication limit and possibly increase your compile time and code size, specify the +Onosize and +Onolimit compiler options. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||