Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Parallel Programming Guide for HP-UX Systems: K-Class and V-Class Servers > Chapter 9 Parallel programming techniques

Parallelizing loops

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

The HP compilers automatically exploit loop parallelism in dependence-free loops. The prefer_parallel, loop_parallel, and parallel directives and pragmas allow you to increase parallelization opportunities and to manually control many aspects of parallelization using simple manual loop parallelization.

The prefer_parallel and loop_parallel directives and pragmas, apply to the immediately following loop. Data privatization is necessary when using loop_parallel; this is achieved by using the loop_private directive, discussed in Chapter 10 “Data privatization”. Manual data privatization using memory classes is discussed in Chapter 11 “Memory classes” and Chapter 12 “Parallel synchronization”.

The parallel directives and pragmas should only be used on Fortran DO and C for loops that have iteration counts that are determined prior to loop invocation at runtime.

prefer_parallel

The prefer_parallel directive and pragma causes the compiler to automatically parallelize the immediately following loop if it is free of dependences and other parallelization inhibitors. The compiler automatically privatizes any loop variables that must be privatized. prefer_parallel requires less manual intervention. However, it is less powerful than the loop_parallel directive and pragma.

See prefer_parallel, loop_parallel attributes” for a description of attributes for this directive.

prefer_parallel can also be used to indicate the preferred loop in a nest to parallelize, as shown in the following Fortran code:

      DO J = 1, 100
C$DIR PREFER_PARALLEL
DO I = 1, 100
.
.
.
ENDDO
ENDDO

This code indicates that PREFER_PARALLEL causes the compiler to choose the innermost loop for parallelization, provided it is free of dependences. PREFER_PARALLEL does not inhibit loop interchange.

The ordered attribute in a prefer_parallel directive is only useful if the loop contains synchronized dependences. The ordered attribute is most useful in the loop_parallel directive, described in the next section.

loop_parallel

The loop_parallel directive forces parallelization of the immediately following loop. The compiler does not check for data dependences, perform variable privatization, or perform parallelization analysis. You must synchronize any dependences manually and manually privatize loop data as necessary. loop_parallel defaults to thread parallelization.

See prefer_parallel, loop_parallel attributes” for a description of attributes for this directive.

loop_parallel(ordered) is useful for manually parallelizing loops that contain ordered dependences. This is described in Chapter 12 “Parallel synchronization”.

Parallelizing loops with calls

loop_parallel is useful for manually parallelizing loops containing procedure calls.

This is shown in the following Fortran code:

C$DIR LOOP_PARALLEL
DO I = 1, N
X(I) = FUNC(I)
ENDDO

The call to FUNC in this loop would normally prevent it from parallelizing. To verify that the FUNC has no side effects, review the following conditions. A function does not have side effects if:

  • It does not modify its arguments.

  • It does not modify the same memory location from one call to the next.

  • It performs no I/O.

  • It does not call any procedures that have side effects. If FUNC does have side effects or is not reentrant, this loop may yield wrong answers.

If you are sure that

FUNC

has no side effects and is compiled for reentrancy (the default), this loop can be safely parallelized.

NOTE: In some cases, global register allocation can interfere with the routine being called. Refer to the “Global register allocation (GRA)” for more information.

Unparallelizable loops

The compiler does not parallelize any loop that does not have a number of iterations that can be determined prior to loop invocation at execution time, even when loop_parallel is specified.

This is shown in the following Fortran code:

C$DIR LOOP_PARALLEL
DO WHILE(A(I) .GT. 0)!WILL NOT PARALLELIZE
.
.
A(I) = ...
.
.
ENDDO

In general, there is no way the compiler can determine the loop's iteration count prior to loop invocation here, so the loop cannot be parallelized.

prefer_parallel, loop_parallel attributes

The prefer_parallel and loop_parallel directives and pragmas maintain the same attributes. The forms of these directives and pragmas are shown in Table 9-2 “Forms of prefer_parallel and loop_parallel directives and pragmas”.

Table 9-2 Forms of prefer_parallel and loop_parallel directives and pragmas

LanguageForm
Fortran

C$DIR PREFER_PARALLEL[(attribute-list) ]

C$DIR LOOP_PARALLEL[(attribute-list)]

C

#pragma _CNX prefer_parallel[(attribute-list)]

#pragma _CNX loop_parallel(ivar = indvar[, attribute-list])

 

where

ivar =indvar

specifies that the primary loop induction variable is indvar. ivar = indvar is optional in Fortran, but required in C. Use only with loop_parallel.

attribute-list can contain one of the case-insensitive attributes noted in Table 9-3 “Attributes for loop_parallel, prefer_parallel .

NOTE: The values of n andm must be compile-time constants for the loop parallelization attributes in which they appear.

Table 9-3 Attributes for loop_parallel, prefer_parallel

AttributeDescription
dist

Causes the compiler to distribute the iterations of a loop across active threads instead of spawning new threads. This significantly reduces parallelization overhead.

Must be used with prefer_parallel or loop_parallel inside a parallel/end_parallel region.

Can be used with any prefer_parallel or loop_parallel attribute, except threads.

ordered

Causes the iterations of the loop to be initiated in iteration order across the processors. This is useful only in loops with manually-synchronized dependences, constructed using loop_parallel.

To achieve ordered parallelism, dependences must be synchronized within ordered sections, constructed using the ordered_section and end_ordered_section directives.

max_threads = m

Restricts execution of the specified loop to no more than m threads if specified alone. m must be an integer constant.

max_threads = m is useful when you know the maximum number of threads your loop runs on efficiently.

If specified with the chunk_size = n attribute, the chunks are parallelized across no more than m threads.

chunk_size = n

Divides the loop into chunks of n or fewer iterations by which to strip mine the loop for parallelization. n must be an integer constant.

If chunk_size = n is present alone, n or fewer loop iterations are distributed round-robin to each available thread until there are no remaining iterations. This is shown in Table 9-5 “Iteration distribution using chunk_size = 1 and Table 9-6 “Iteration distribution using chunk_size = 5.

If the number of threads does not evenly divide the number of iterations, some threads perform one less chunk than others.

dist, orderedCauses ordered invocation of each iteration across existing threads.
dist, max_threads = mCauses thread-parallelism on no more than m existing threads.
ordered, max_threads = mCauses ordered parallelism on no more than m threads.
dist, chunk_size = nCauses thread-parallelism by chunks.
dist, ordered,max_threads =m

Causes ordered thread-parallelism on no more than m existing threads.

chunk_size = n, max_threads = m

Causes chunk parallelism on no more than m threads.

dist, chunk_size = n, max_threads = mCauses thread-parallelism by chunks on no more than m existing threads.

 

Any loop under the influence of loop_parallel(dist) or prefer_parallel(dist) appears in the Optimization Report as serial. This is because it is already inside a parallel region. You can generate an Optimization Report by specifying the +Oreport option. For more information, see Chapter 8 “Optimization Report”.

Combining the attributes

Table 9-3 “Attributes for loop_parallel, prefer_parallel shown above describes the acceptable combinations of loop_parallel and prefer_parallel attributes. In such combinations, the attributes are listed in any order.

The loop_parallel C pragma requires the ivar = indvar attribute, which specifies the primary loop induction variable. If this is not present, the compiler issues a warning and ignores the pragma. ivar should specify only the primary induction variable. Any other loop induction variables should be a function of this variable and should be declared loop_private.

In Fortran, ivar is optional for DO loops. If it is not provided, the compiler picks the primary induction variable for the loop. ivar is required for DO, WHILE and customized loops in Fortran.

NOTE: prefer_parallel does not require ivar. The compiler issues an error if it encounters this combination.

Comparing prefer_parallel, loop_parallel

The prefer_parallel and loop_parallel directives and pragmas are used to parallelize loops. Table 9-4 “Comparison of loop_parallel and prefer_parallel provides an overview of the differences between the two pragmas/directives. See the sections prefer_parallel and loop_parallel for more information.

Table 9-4 Comparison of loop_parallel and prefer_parallel

 prefer_parallelloop_parallel

Description

Requests compiler to perform parallelization analysis on the following loop then parallelize the loop if it is safe to do so.

When used with the +Oautopar option (the default), it overrides the compiler heuristic for picking which loop in a loop nest to parallelize.

When used with +Onoautopar, the compiler only performs directive-specified parallelization. No heuristic is used to pick the loop in a nest to parallelize. In such cases, prefer_parallel requests loop parallelization.

Forces the compiler to parallelize the following loop—assuming the iteration count can be determined prior to loop invocation.

AdvantagesCompiler automatically performs parallelization analysis and variable privatization.Allows you to parallelize loops that the compiler is not able to automatically parallelize because it cannot determine dependences or side effects.
DisadvantagesLoop may or may not execute in parallel.

Requires you to:

—Check for and synchronize any data dependences

—Perform variable privatization

 

Stride-based parallelism

Stride-based parallelism differs from the default strip-based parallelism described in that:

  • Strip-based parallelism divides the loop's iterations into a number of contiguous chunks equal to the number of available threads, and each thread computes one chunk.

  • Stride-based parallelism, set by the chunk_size=n attribute, allows each thread to do several noncontiguous chunks.

    Specifying chunk_size = ((number of iterations - 1) / number of threads) + 1 is similar to default strip mining for parallelization.

    Using chunk_size = 1 distributes individual iterations cyclically across the processors. For example, if a loop has 1000 iterations to be distributed among 4 processors, specifying chunk_size=1 causes the distribution shown in Table 9-5 “Iteration distribution using chunk_size = 1.

Table 9-5 Iteration distribution using chunk_size = 1

 CPU0CPU1CPU2CPU3
Iterations1234
567...

 

For chunk_size=n, with n > 1, the distribution is round-robin. However, it is not the same as specifying the ordered attribute. For example, using the same loop as above, specifying chunk_size=5 produces the distribution shown in Table 9-6 “Iteration distribution using chunk_size = 5.

Table 9-6 Iteration distribution using chunk_size = 5

 CPU0CPU1CPU2CPU3
Iterations1, 2, 3, 4, 56, 7, 8, 9, 1011, 12, 13, 14, 1516, 17, 18, 19, 20
21, 22, 23, 24, 2526, 27, 28, 29, 3031, 32, 33, 34, 35, ...

 

For more information and examples on using the chunk_size = n attribute, see Chapter 13 “Troubleshooting”.

prefer_parallel, loop_parallel

The following Fortran example uses the PREFER_PARALLEL directive, but applies to LOOP_PARALLEL as well:

C$DIR PREFER_PARALLEL(CHUNK_SIZE = 4)
DO I = 1, 100
A(I) = B(I) + C(I)
ENDDO

In this example, the loop is parallelized by parcelling out chunks of four iterations to each available thread. Figure 9-1 “Stride-parallelized loop” uses Fortran 90 array syntax to illustrate the iterations performed by each thread, assuming eight available threads.

Figure 9-1 “Stride-parallelized loop” shows that the 100 iterations of I are parcelled out in chunks of four iterations to each of the eight available threads. After the chunks are distributed evenly to all threads, there is one chunk left over (iterations 97:100), which executes on thread 0.

Figure 9-1 Stride-parallelized loop

Stride-parallelized loop

prefer_parallel, loop_parallel

The chunk_size = n attribute is most useful on loops in which the amount of work increases or decreases as a function of the iteration count. These loops are also known as triangular loops. The following Fortran example shows such a loop. As with the previous example, PREFER_PARALLEL is used here, but the concept also applies to LOOP_PARALLEL.

C$DIR PREFER_PARALLEL(CHUNK_SIZE = 4)
DO J = 1,N
DO I = J, N
A(I,J) = ...
.
.
.
ENDDO
ENDDO

Here, the work of the I loop decreases as J increases. By specifying a chunk_size for the J loop, the load is more evenly balanced across the threads executing the loop.

If this loop was strip-mined in the traditional manner, the amount of work contained in the strips would decrease with each successive strip. The threads performing early iterations of J would do substantially more work than those performing later iterations.

critical_section, end_critical_section

The critical_section and end_critical_section directives and pragmas allow you to specify sections of code in parallel loops or tasks that must be executed by only one thread at a time. These directives cannot be used for ordered synchronization within a loop_parallel(ordered) loop, but are suitable for simple synchronization in any other loop_parallel loops. Use the ordered_section and end_ordered_section directives or pragmas for ordered synchronization within a loop_parallel(ordered) loop.

A critical_section directive or pragma and its associated end_critical_section must appear in the same procedure and under the same control flow. They do not have to appear in the same procedure as the parallel construct in which they are used. For instance, the pair can appear in a procedure called from a parallel loop.

The forms of these directives and pragmas are shown in Chapter 9 “Parallel programming techniques”.

Table 9-7 Forms of critical_section/end_critical_section directives and pragmas

LanguageForm
Fortran

C$DIR CRITICAL_SECTION [ (gate) ]

C$DIR END_CRITICAL_SECTION

C

#pragma _CNX critical_section [ (gate) ]

#pragma _CNX end_critical_section

 

The critical_section directive/pragma can take an optional gate attribute that allows the declaration of multiple critical sections. This is described in “Using gates and barriers”. Only simple critical sections are discussed in this section.

critical_section

Consider the following Fortran example:

C$DIR LOOP_PARALLEL, LOOP_PRIVATE(FUNCTEMP)
DO I = 1, N ! LOOP IS PARALLELIZABLE
.
.
.
FUNCTEMP = FUNC(X(I))
C$DIR CRITICAL_SECTION
SUM = SUM + FUNCTEMP
C$DIR END_CRITICAL_SECTION
.
.
.
ENDDO

Because FUNC has no side effects and is called in parallel, the I loop is parallelized as long as the SUM variable is only updated by one thread at a time. The critical section created around SUM ensures this behavior.

The LOOP_PARALLEL directive and the critical section directive are required to parallelize this loop because the call to FUNC would normally inhibit parallelization. If this call were not present, and if the loop did not contain other parallelization inhibitors, the compiler would automatically parallelize the reduction of SUM as described in the section “Reductions”. However, the presence of the call necessitates the LOOP_PARALLEL directive, which prevents the compiler from automatically handling the reduction.

This, in turn, requires using either a critical section directive or the reduction directive. Placing the call to FUNC outside of the critical section allows FUNC to be called in parallel, decreasing the amount of serial work within the critical section.

In order to justify the cost of the compiler-generated synchronization code associated with the use of critical sections, loops that contain them must also contain a large amount of parallelizable (non-critical section) code. If you are unsure of the profitability of using a critical section to help parallelize a certain loop, time the loop with and without the critical section. This helps to determine if parallelization justifies the overhead of the critical section.

For this particular example, the reduction directive or pragma could have been used in place of the critical_section, end_critical_section combination. For more information, see the section “Reductions”.

Disabling automatic loop thread-parallelization

You can disable automatic loop thread-parallelization by specifying the +Onoautopar option on the compiler command line. +Onoautopar is only meaningful when specified with the +Oparallel option at +O3 or +O4.

This option causes the compiler to parallelize only those loops that are immediately preceded by prefer_parallel or loop_parallel. Because the compiler does not automatically find parallel tasks or regions, user-specified task and region parallelization is not affected by this option.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.