| United States-English |
|
|
|
![]() |
Parallel Programming Guide for HP-UX Systems: K-Class and V-Class Servers > Chapter 9 Parallel programming
techniquesParallelizing loops |
|
The HP compilers automatically exploit loop parallelism in dependence-free loops. The prefer_parallel, loop_parallel, and parallel directives and pragmas allow you to increase parallelization opportunities and to manually control many aspects of parallelization using simple manual loop parallelization. The prefer_parallel and loop_parallel directives and pragmas, apply to the immediately following loop. Data privatization is necessary when using loop_parallel; this is achieved by using the loop_private directive, discussed in Chapter 10 “Data privatization”. Manual data privatization using memory classes is discussed in Chapter 11 “Memory classes” and Chapter 12 “Parallel synchronization”. The parallel directives and pragmas should only be used on Fortran DO and C for loops that have iteration counts that are determined prior to loop invocation at runtime. The prefer_parallel directive and pragma causes the compiler to automatically parallelize the immediately following loop if it is free of dependences and other parallelization inhibitors. The compiler automatically privatizes any loop variables that must be privatized. prefer_parallel requires less manual intervention. However, it is less powerful than the loop_parallel directive and pragma. See “prefer_parallel, loop_parallel attributes” for a description of attributes for this directive. prefer_parallel can also be used to indicate the preferred loop in a nest to parallelize, as shown in the following Fortran code:
This code indicates that PREFER_PARALLEL causes the compiler to choose the innermost loop for parallelization, provided it is free of dependences. PREFER_PARALLEL does not inhibit loop interchange. The ordered attribute in a prefer_parallel directive is only useful if the loop contains synchronized dependences. The ordered attribute is most useful in the loop_parallel directive, described in the next section. The loop_parallel directive forces parallelization of the immediately following loop. The compiler does not check for data dependences, perform variable privatization, or perform parallelization analysis. You must synchronize any dependences manually and manually privatize loop data as necessary. loop_parallel defaults to thread parallelization. See “prefer_parallel, loop_parallel attributes” for a description of attributes for this directive. loop_parallel(ordered) is useful for manually parallelizing loops that contain ordered dependences. This is described in Chapter 12 “Parallel synchronization”. loop_parallel is useful for manually parallelizing loops containing procedure calls. This is shown in the following Fortran code:
The call to FUNC in this loop would normally prevent it from parallelizing. To verify that the FUNC has no side effects, review the following conditions. A function does not have side effects if:
If you are sure that FUNC has no side effects and is compiled for reentrancy (the default), this loop can be safely parallelized.
The compiler does not parallelize any loop that does not have a number of iterations that can be determined prior to loop invocation at execution time, even when loop_parallel is specified. This is shown in the following Fortran code:
In general, there is no way the compiler can determine the loop's iteration count prior to loop invocation here, so the loop cannot be parallelized. The prefer_parallel and loop_parallel directives and pragmas maintain the same attributes. The forms of these directives and pragmas are shown in Table 9-2 “Forms of prefer_parallel and loop_parallel directives and pragmas”. Table 9-2 Forms of prefer_parallel and loop_parallel directives and pragmas
where ivar =indvar specifies that the primary loop induction variable is indvar. ivar = indvar is optional in Fortran, but required in C. Use only with loop_parallel. attribute-list can contain one of the case-insensitive attributes noted in Table 9-3 “Attributes for loop_parallel, prefer_parallel ”.
Table 9-3 Attributes for loop_parallel, prefer_parallel
Any loop under the influence of loop_parallel(dist) or prefer_parallel(dist) appears in the Optimization Report as serial. This is because it is already inside a parallel region. You can generate an Optimization Report by specifying the +Oreport option. For more information, see Chapter 8 “Optimization Report”. Table 9-3 “Attributes for loop_parallel, prefer_parallel ” shown above describes the acceptable combinations of loop_parallel and prefer_parallel attributes. In such combinations, the attributes are listed in any order. The loop_parallel C pragma requires the ivar = indvar attribute, which specifies the primary loop induction variable. If this is not present, the compiler issues a warning and ignores the pragma. ivar should specify only the primary induction variable. Any other loop induction variables should be a function of this variable and should be declared loop_private. In Fortran, ivar is optional for DO loops. If it is not provided, the compiler picks the primary induction variable for the loop. ivar is required for DO, WHILE and customized loops in Fortran.
The prefer_parallel and loop_parallel directives and pragmas are used to parallelize loops. Table 9-4 “Comparison of loop_parallel and prefer_parallel” provides an overview of the differences between the two pragmas/directives. See the sections “prefer_parallel” and “loop_parallel” for more information. Table 9-4 Comparison of loop_parallel and prefer_parallel
Stride-based parallelism differs from the default strip-based parallelism described in that:
For chunk_size=n, with n > 1, the distribution is round-robin. However, it is not the same as specifying the ordered attribute. For example, using the same loop as above, specifying chunk_size=5 produces the distribution shown in Table 9-6 “Iteration distribution using chunk_size = 5”. Table 9-6 Iteration distribution using chunk_size = 5
For more information and examples on using the chunk_size = n attribute, see Chapter 13 “Troubleshooting”. prefer_parallel, loop_parallel The following Fortran example uses the PREFER_PARALLEL directive, but applies to LOOP_PARALLEL as well:
In this example, the loop is parallelized by parcelling out chunks of four iterations to each available thread. Figure 9-1 “Stride-parallelized loop” uses Fortran 90 array syntax to illustrate the iterations performed by each thread, assuming eight available threads. Figure 9-1 “Stride-parallelized loop” shows that the 100 iterations of I are parcelled out in chunks of four iterations to each of the eight available threads. After the chunks are distributed evenly to all threads, there is one chunk left over (iterations 97:100), which executes on thread 0. prefer_parallel, loop_parallel The chunk_size = n attribute is most useful on loops in which the amount of work increases or decreases as a function of the iteration count. These loops are also known as triangular loops. The following Fortran example shows such a loop. As with the previous example, PREFER_PARALLEL is used here, but the concept also applies to LOOP_PARALLEL.
Here, the work of the I loop decreases as J increases. By specifying a chunk_size for the J loop, the load is more evenly balanced across the threads executing the loop. If this loop was strip-mined in the traditional manner, the amount of work contained in the strips would decrease with each successive strip. The threads performing early iterations of J would do substantially more work than those performing later iterations. The critical_section and end_critical_section directives and pragmas allow you to specify sections of code in parallel loops or tasks that must be executed by only one thread at a time. These directives cannot be used for ordered synchronization within a loop_parallel(ordered) loop, but are suitable for simple synchronization in any other loop_parallel loops. Use the ordered_section and end_ordered_section directives or pragmas for ordered synchronization within a loop_parallel(ordered) loop. A critical_section directive or pragma and its associated end_critical_section must appear in the same procedure and under the same control flow. They do not have to appear in the same procedure as the parallel construct in which they are used. For instance, the pair can appear in a procedure called from a parallel loop. The forms of these directives and pragmas are shown in Chapter 9 “Parallel programming techniques”. Table 9-7 Forms of critical_section/end_critical_section directives and pragmas
The critical_section directive/pragma can take an optional gate attribute that allows the declaration of multiple critical sections. This is described in “Using gates and barriers”. Only simple critical sections are discussed in this section. Consider the following Fortran example:
Because FUNC has no side effects and is called in parallel, the I loop is parallelized as long as the SUM variable is only updated by one thread at a time. The critical section created around SUM ensures this behavior. The LOOP_PARALLEL directive and the critical section directive are required to parallelize this loop because the call to FUNC would normally inhibit parallelization. If this call were not present, and if the loop did not contain other parallelization inhibitors, the compiler would automatically parallelize the reduction of SUM as described in the section “Reductions”. However, the presence of the call necessitates the LOOP_PARALLEL directive, which prevents the compiler from automatically handling the reduction. This, in turn, requires using either a critical section directive or the reduction directive. Placing the call to FUNC outside of the critical section allows FUNC to be called in parallel, decreasing the amount of serial work within the critical section. In order to justify the cost of the compiler-generated synchronization code associated with the use of critical sections, loops that contain them must also contain a large amount of parallelizable (non-critical section) code. If you are unsure of the profitability of using a critical section to help parallelize a certain loop, time the loop with and without the critical section. This helps to determine if parallelization justifies the overhead of the critical section. For this particular example, the reduction directive or pragma could have been used in place of the critical_section, end_critical_section combination. For more information, see the section “Reductions”. You can disable automatic loop thread-parallelization by specifying the +Onoautopar option on the compiler command line. +Onoautopar is only meaningful when specified with the +Oparallel option at +O3 or +O4. This option causes the compiler to parallelize only those loops that are immediately preceded by prefer_parallel or loop_parallel. Because the compiler does not automatically find parallel tasks or regions, user-specified task and region parallelization is not affected by this option. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||