Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Fortran 90, Fortran 77, C, aC++: Exemplar Programming Guide > Chapter 4 Basic shared-memory programming

Simple manual loop, task,and region parallelization

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

The Exemplar compilers automatically exploit strip-based loop parallelism in loops that are clearly dependence-free, as described in Chapter 3, "Chapter 3 “Compiler optimizations”." The prefer_parallel, loop_parallel, and parallel directives and pragmas allow you to increase parallelization opportunities and to manually control many aspects of parallelization.

The compiler cannot automatically locate task parallelism, but the tasking directives and pragmas mentioned in Chapter 3, "Chapter 3 “Compiler optimizations”," (and discussed here) allow you to specify consecutive blocks of code that can be run in parallel. Similarly, the parallel and end_parallel directives and pragmas allow you to specify a code region that can be run in its entirety on several processors.

The subsections that follow discuss specifying simple, unordered loop, task, and region parallelism using the prefer_parallel, loop_parallel, begin_tasks/next_task/end_tasks, and parallel/end_parallel directives and pragmas. These directives and pragmas can be nested in any order as long as node-parallelism is outside thread-parallelism.

Critical sections that do not rely on ordered execution are also covered here. Any necessary variable privatization is provided by the loop_private, task_private and parallel_private directives and pragmas, which are described in detail in the "“Loop-specific, task-specific, and
region-specific data privatization”
" section of this chapter.

For a detailed discussion of ordered parallelism, parallel synchronization, and the effective use of memory classes, refer to Chapter 5, "Chapter 5 “Memory classes”," and Chapter 6, "Chapter 6 “Advanced shared-memory programming”."

Loop parallelization

This section discusses simple uses of the prefer_parallel and loop_parallel directives and pragmas, which, when specified, apply to the immediately following loop. The data privatization necessary when using loop_parallel is illustrated in this chapter's examples using the loop_private directive, which is discussed in the section loop_private. Manual data privatization using memory classes is discussed in Chapter 5, "Chapter 5 “Memory classes”," and Chapter 6, "Chapter 6 “Advanced shared-memory programming”."

NOTE: Use these directives and pragmas only on Fortran DO and C for loops that have iteration counts that can be determined prior to loop invocation at runtime.

prefer_parallel and loop_parallel generally take the same attributes. The threads attribute is the default attribute for both prefer_parallel and loop_parallel. In Fortran, these directives have the following form:

C$DIR PREFER_PARALLEL[(attribute-list)]

and

C$DIR LOOP_PARALLEL[(attribute-list)]

In C, they have the form:

#pragma _CNX prefer_parallel[(attribute-list)]

and

#pragma _CNX loop_parallel(ivar = indvar[,attribute-list])

where

ivar = indvar

s pecifies that the primary loop induction variable is indvar. ivar = indvar is optional in Fortran, but required in C. Use it only with loop_parallel.

the optional attribute-list

can contain one of the following case-insensitive attributes.

NOTE: The values of n andm must be compile-time constants for all of the below attributes in which they appear.
threads

Causes thread-parallelism. This is the default.

nodes

Causes thread-based node-parallelism. See the section nodes for additional information.

dist

Causes the compiler to distribute the iterations of a loop across active threads instead of spawning new threads. Use dist with prefer_parallel or loop_parallel inside a parallel/end_parallel region. The level of parallelism is determined by using either the threads or nodes attribute in the parallel directive or pragma. See “Region parallelization” for more information.

ordered

C auses ordered invocation of each loop iteration; provides no automatic synchronization. Designed for use with the loop_parallel directive and pragma on loops containing ordered sections.

max_threads = m

A llows no more than m threads to be allocated to the execution of the loop. m must be an integer constant.

chunk_size = n

D ivides the loop into chunks of n or fewer iterations, and distributes the chunks round-robin to the processors as shown in Table 4-1 “Iteration distribution using chunk_size = 1 and Table 4-2 “Iteration distribution using chunk_size = 5. n must be an integer constant.

threads, ordered

Causes ordered invocation of each iteration across threads.

nodes, ordered

Causes ordered invocation of each iteration across hypernodes.

dist, ordered

Causes ordered invocation of each iteration across threads or nodes, as specified in the attribute list to the parallel directive.

threads, max_threads = m

Causes thread-parallelism on no more than m threads.

nodes, max_threads = m

Causes node-parallelism on no more than m nodes; this starts one thread per node on no more than m hypernodes.

dist, max_threads = m

Causes thread-parallelism or node-parallelism (as determined by the attribute list to the parallel directive) on no more than m threads (if thread-parallelism) or nodes (if node-parallelism).

ordered, max_threads = m

Causes ordered parallelism on no more than m threads.

threads, chunk_size = n

Causes thread-parallelism by chunks.

nodes, chunk_size = n

Causes node-parallelism by chunks.

dist, chunk_size = n

Causes thread-parallelism or node-parallelism (as determined by the attribute list to the parallel directive) by chunks.

threads, ordered, max_threads = m

Causes ordered thread-parallelism on no more than m threads.

nodes, ordered, max_threads = m

Causes ordered node-parallelism on no more than m hypernodes.

dist, ordered, max_threads =m

Causes ordered thread-parallelism on no more than m threads, or ordered node-parallelism on no more than m hypernodes—depending on the attribute list used with the parallel directive.

chunk_size = n, max_threads = m

Causes chunk parallelism on no more than m threads.

threads, chunk_size = n, max_threads = m

Causes thread-parallelism by chunks of size n on no more than m threads.

nodes, chunk_size = n, max_threads = m

Causes node-parallelism by chunks of size n on no more than m hypernodes.

dist, chunk_size = n, max_threads = m

Causes thread-parallelism by chunks on no more than m threads, or node-parallelism by chunks on no more than m hypernodes—depending on the attribute list used with the parallel directive.

Combining the attributes

The allowed combinations of attributes are those combinations listed in the preceding section. In such combinations the attributes can be listed in any order.

The loop_parallel C pragma requires the ivar = indvar attribute, which specifies the primary loop induction variable. If this is not present, the compiler will issue a warning and ignore the pragma. ivar should specify only the primary induction variable; any other loop induction variables should be a function of this variable and should be declared loop_private.

In Fortran, ivar is optional for DO loops; if not provided, the compiler will pick the primary induction variable for the loop. ivar is required for DO WHILE and hand-rolled loops in Fortran .

prefer_parallel does not require ivar, and the compiler will issue an error if it encounters this combination.

Using the attributes

The attributes associated with the prefer_parallel and loop_parallel directives and pragmas are explained in the following sections.

threads

The optional threads attribute causes parallelization across threads; this is the default for loop_parallel and prefer_parallel. If the threads attribute appears in a parallelization directive on the outermost loop in a nest, the loop will go parallel on all the threads available to the process. If the threads attribute appears in a parallelization directive nested within a node-parallel construct, the specified loop will go thread-parallel on the threads available within each parallel hypernode.

nodes

The nodes attribute causes parallelization across hypernodes in a multinode, scalable SMP system. In this case, a single thread on each available hypernode will execute a portion of the specified loop. A node-parallel construct cannot exist inside a thread-parallel construct. See the section “Node-parallelism vs. thread-parallelism” for a comparison of the two levels of parallelism.

dist

The dist attribute tells the compiler to distribute the iterations of a loop across the currently active threads—instead of spawning new threads. Using currently active threads significantly reduces the parallelization overhead. loop_parallel(dist) and prefer_parallel(dist) should be used inside a parallel/end_parallel region. The level of parallelism is determined by using either the threads or nodes attribute to the parallel directive (or pragma). See “Region parallelization” for information on the attributes available to the parallel directive and pragma.

The dist attribute can be used with any prefer_parallel or loop_parallel attribute, except the nodes or threads attributes.

NOTE: Any loop under the influence of loop_parallel(dist) or of prefer_parallel(dist) will appear in the Optimization Report (generated by specifying the +Oreport option) as being serial, because it is already inside a parallel region. For more information on the Optimization Report, see Appendix E “Optimization Report”

In the following example, threads are spawned when the parallel directive is used. No additional threads are spawned until the loop_parallel directive is used without the dist attribute:

C$DIR PARALLEL (NODES, MAX_THREADS = 4), PARALLEL_PRIVATE(A, C)
C SPAWN ONE THREAD PER NODE, UP TO A MAXIMUM OF 4

A = B ! THIS STATEMENT WILL BE EXECUTED BY ALL 4 NODE-WAY THREADS

C$DIR LOOP_PARALLEL(DIST, MAX_THREADS = 3)
DO I = 1, 10000
! THIS LOOP WILL BE DISTRIBUTED TO AT MOST 3 OF
X(I) = Y(I) ! THE 4 ACTIVE NODE-WAY THREADS; THIS MEANS THAT
! EACH NODE-WAY THREAD EXECUTES
! ABOUT 10000/3 ITERATIONS
ENDDO

C = X(1) ! THIS STATEMENT WILL BE EXECUTED BY ALL
! NODE-WAY THREADS

C$DIR LOOP_PARALLEL(DIST)
DO J = 1, 10000
! THIS LOOP WILL BE DISTRIBUTED TO THE 4 ACTIVE
Y(J) = X(J) ! NODE-WAY THREADS, MEANING THAT EACH THREAD
! EXECUTES 10000/4 ITERATIONS
ENDDO

C$DIR LOOP_PARALLEL
DO K = 1, 10000
! SPAWN ADDITIONAL THREADS ON EACH NODE, UP TO THE
W(K, MY_NODE()) = X(K) ! MAXIMUM AVAILABLE (TYPICALLY 16) AND
! ON EACH NODE, DISTRIBUTE THE WORK ACROSS
! ALL THE THREADS SO THAT
! EACH THREAD EXECUTES 10000/16 ITERATIONS
ENDDO
C$DIR END_PARALLEL

loop_parallel and loop_parallel(dist) directives can be nested as long as node-parallel loops are outside all thread-parallel loops. The compiler will pick the loop that is most appropriate for the directive or pragma being processed (the loop picked is usually the outermost parallel loop).

ordered

The ordered attribute causes the iterations of the loop to be initiated in loop order across the processors. It is useful only in loops with manually-synchronized dependences, so it is only useful with the loop_parallel directive. To achieve ordered parallelism, dependences must be synchronized within ordered sections, such as those constructed using the ordered_section and end_ordered_section directives. Using loop_parallel(ordered) and its associated synchronization directives is covered in Chapter 6, "Chapter 6 “Advanced shared-memory programming”."

max_threads = m

The max_threads = m attribute restricts execution of the specified loop to no more than m threads if specified alone or with the threads attribute; if specified with the nodes attribute, execution is restricted to m nodes running one thread each. If specified with the chunk_size = n attribute, the chunks are parallelized across no more than m threads. max_threads = m is useful when you know the maximum number of threads your loop will run on efficiently.

chunk_size = n

The chunk_size = n attribute specifies a number of iterations by which to strip mine the loop for parallelization. If this attribute is present alone or with the threads attribute, n or fewer loop iterations are distributed round-robin (as shown in Table 4-1 “Iteration distribution using chunk_size = 1 and Table 4-2 “Iteration distribution using chunk_size = 5) to each available thread until there are no remaining iterations. If chunk_size = n is combined with the nodes attribute, the chunks are distributed round-robin to each available hypernode until there are no remaining chunks. If the number of threads does not evenly divide the number of iterations, some threads will perform one less chunk than others. n must be a compile-time integer constant.

This stride-based parallelism differs from the default strip-based parallelism described in Chapter 3, "Chapter 3 “Compiler optimizations”," that divides the loop's iterations into a number of contiguous chunks equal to the number of available threads, and each thread computes one chunk. The chunk_size = n attribute allows each thread to do several noncontiguous chunks.

Specifying chunk_size = ((number of iterations - 1) / number of threads) + 1 is similar to default strip mining for parallelization.

Using chunk_size = 1 distributes individual iterations cyclically across the processors. For example, if a loop has 1000 iterations to be distributed among 4 processors, specifying chunk_size = 1 causes the distribution shown in Table 4-1 “Iteration distribution using chunk_size = 1.

Table 4-1 Iteration distribution using chunk_size = 1

 CPU0CPU1CPU2CPU3
Iterations1234
567...

 

For chunk_size = n, with n > 1, the distribution is round-robin, however it is not the same as specifying the ordered attribute. For example, using the same loop as above, specifying chunk_size = 5 produces the distribution shown in Table 4-2 “Iteration distribution using chunk_size = 5.

Table 4-2 Iteration distribution using chunk_size = 5

 CPU0CPU1CPU2CPU3
Iterations1, 2, 3, 4, 56, 7, 8, 9, 1011, 12, 13, 14, 1516, 17, 18, 19, 20
21, 22, 23, 24, 2526, 27, 28, 29, 3031, 32, 33, 34, 35, ...

 

Consider the following Fortran example, which uses the PREFER_PARALLEL directive, but applies to LOOP_PARALLEL as well:

C$DIR PREFER_PARALLEL(CHUNK_SIZE = 4)
DO I = 1, 100
A(I) = B(I) + C(I)
ENDDO

In this example, the loop is parallelized by parcelling out chunks of 4 iterations to each available thread. Figure 4-1 “Stride-parallelized loop” uses Fortran 90 array syntax to illustrate the iterations performed by each thread, assuming 8 available threads.

Figure 4-1 “Stride-parallelized loop” shows that the 100 iterations of I are parcelled out in chunks of 4 iterations to each of the 8 available threads; after the chunks are distributed evenly to all threads, there is one chunk left over (iterations 97:100), which executes on thread 0.

Figure 4-1 Stride-parallelized loop

Stride-parallelized loop

An analogous C example follows:

#pragma _CNX prefer_parallel(chunk_size = 4)
for(i=0;i<100;i++)
a[i] = b[i] + c[i];

The chunk_size = n attribute is most useful on loops in which the amount of work increases or decreases as a function of the iteration count. (These loops are also known as triangular loops.) The following Fortran example shows such a loop. Again, PREFER_PARALLEL is used here, but the concept applies to LOOP_PARALLEL also.

C$DIR PREFER_PARALLEL(CHUNK_SIZE = 4)
DO J = 1,N
DO I = J, N
A(I,J) = ...
.
.
.
ENDDO
ENDDO

Here, the work of the I loop decreases as J increases. By specifying a chunk_size for the J loop, we more evenly balance the load across the threads executing the loop. If this loop was strip mined in the traditional manner, the amount of work contained in the strips would decrease with each successive strip; the threads performing early iterations of J would do substantially more work than those performing later iterations.

An analogous C example follows:

#pragma _CNX prefer_parallel(chunk_size = 4)
for(j=0;j<n;j++)
for(i=j;i<n;i++) {
a[i][j] = ...
.
.
.
}

For more information and examples on using the chunk_size = n attribute, see the sections “Distributing iterations on cache line boundaries” and “Triangular loops”.

prefer_parallel

The prefer_parallel directive and pragma cause the compiler to parallelize the immediately following loop if it is free of dependences and other parallelization inhibitors. The compiler automatically privatizes any loop variables that must be privatized. prefer_parallel requires less manual intervention and is less forceful than the loop_parallel directive and pragma.

On multihypernode systems, when prefer_parallel is specified without a nodes or threads attribute, the compiler will determine if opportunities for parallelism exist within the loop and, if possible, parallelize the loop across threads. If the threads attribute is specified, the compiler attempts to find and exploit thread-parallelism within the loop. If the nodes attribute is specified, the compiler trys to locate and exploit node-parallelism within the loop.

prefer_parallel can also be used to indicate the preferred loop in a nest to parallelize, as shown in the following Fortran example:

      DO J = 1, 100
C$DIR PREFER_PARALLEL
DO I = 1, 100
.
.
.
ENDDO
ENDDO

In this example, PREFER_PARALLEL causes the compiler to choose the innermost loop for parallelization, provided it is free of dependences. PREFER_PARALLEL does not inhibit loop interchange.

An analogous C example follows:

for(j=0;j<100;j++)
#pragma _CNX prefer_parallel
for(i=0;i<100;i++) {
.
.
.
}

Do not use the ordered attribute in a prefer_parallel directive, as it is only useful if the loop contains synchronized dependences, and prefer_parallel will not parallelize a loop containing any loop-carried dependences. The ordered attribute is useful in the loop_parallel directive, as described in Chapter 6, "Chapter 6 “Advanced shared-memory programming”."

loop_parallel

The loop_parallel directive forces parallelization of the immediately following loop. The compiler does not check for data dependences, perform variable privatization, or perform parallelization analysis. You must synchronize any dependences manually and manually privatize loop data as necessary. In absence of a nodes or threads attribute, loop_parallel defaults to thread parallelization.

The section “Critical sections” contains an example of using loop_parallel to parallelize a loop with a dependence; the dependence is manually handled in a critical section.

The threads, nodes, chunk_size = n and max_threads = m attributes and combinations of these attributes have exactly the same effect as explained for prefer_parallel. loop_parallel(ordered) is useful for manually parallelizing loops containing manually-ordered dependences as described in Chapter 6, "Chapter 6 “Advanced shared-memory programming”."

Parallelizing loops with calls

loop_parallel can be useful for manually parallelizing loops containing procedure calls.

Consider the following Fortran example:

C$DIR LOOP_PARALLEL
DO I = 1, N
X(I) = FUNC(I)
ENDDO

The call to FUNC in this loop would normally prevent it from parallelizing. However, if you are sure that FUNC has no side effects and is compiled for reentrancy (the default on Exemplar compilers), this loop can be safely parallelized as shown. (A function does not have side effects if it does not modify its arguments, it does not modify the same memory location from one call to the next, it performs no I/O, and it does not call any procedures that have side effects. If FUNC does have side effects or is not reentrant, this loop may yield wrong answers.

An analogous C example follows:

#pragma _CNX loop_parallel(ivar=i)
for(i=0;i<n;i++)
x[i] = func(i);
NOTE: In some cases, global register allocation can interfere with loop_parallel loops that contain procedure calls. Refer to the "“Global register allocation”" section of Chapter 3, "Chapter 3 “Compiler optimizations”," for more information.
Unparallelizable loops

The compiler will not parallelize any loop that does not have a number of iterations that can be determined prior to loop invocation at execution time, even when loop_parallel is specified.

Consider the following Fortran example:

C$DIR LOOP_PARALLEL
DO WHILE(A(I) .GT. 0)!WILL NOT PARALLELIZE
.
.
A(I) = ...
.
.
ENDDO

There is no way the compiler can determine the loop's iteration count prior to loop invocation here, so the loop cannot be parallelized.

Comparing prefer_parallel and loop_parallel

The prefer_parallel and loop_parallel directives (and pragmas) are both used in parallelizing loops. Table 4-3 “Comparison of prefer_parallel and loop_parallel gives an overview of the differences between the two directives (pragmas). See the sections prefer_parallel and loop_parallel for more information.

Table 4-3 Comparison of prefer_parallel and loop_parallel

Directive/pragmaAdvantagesDisadvantages

prefer_parallel

Requests compiler to perform parallelization analysis on the following loop then parallelize the loop if it is safe to do so.

When used with the +Oautopar option (the default), prefer_parallel overrides the compiler heuristic for picking the loop in a loop nest to parallelize.

When used with +Onoautopar, the compiler only performs directive-specified parallelization (no heuristic is used to pick the loop in a nest to parallelize); in such cases, prefer_parallel requests loop parallelization.

Compiler performs parallelization analysis and variable privatization for you.Loop may or may not execute in parallel.

loop_parallel

Forces compiler to parallelize the following loop—assuming the iteration count can be determined prior to loop invocation.

Allows you to parallelize loops that the compiler is not able to automatically parallelize because it cannot determine dependences or side effects.

You are
responsible for:

  • Checking for and synchronizing data dependences

  • Performing variable privatization

 

Task parallelization

The compiler does not automatically parallelize code outside a loop, but you can use tasking directives and pragmas to instruct the compiler to parallelize such code. The begin_tasks directive and pragma tells the compiler to begin parallelizing a series of tasks. The next_task directive and pragma marks the end of a task and the start of the next task. The end_tasks directive and pragma marks the end of a series of tasks to be parallelized and prevents execution from continuing until all tasks have completed. The sections of code delimited by these directives are referred to as a task list.

Within a task list, the compiler does not check for data dependences, perform variable privatization, or perform parallelization analysis. You must manually synchronize any dependences between tasks and manually privatize data as necessary. In absence of a nodes or threads attribute, begin_tasks defaults to thread parallelization.

The Fortran tasking directives have the following forms:

C$DIR BEGIN_TASKS[(attribute-list)]

C$DIR NEXT_TASK

C$DIR END_TASKS

The C tasking pragmas have the forms:

#pragma _CNX begin_tasks[(attribute-list)]

#pragma _CNX next_task

#pragma _CNX end_tasks

The optional attribute-list can contain one of the following attribute combinations (m is an integer constant):

  • threads

  • nodes

  • dist

  • ordered

  • max_threads = m

  • threads, ordered

  • nodes, ordered

  • dist, ordered

  • threads, max_threads = m

  • nodes, max_threads = m

  • dist, max_threads = m

  • ordered, max_threads = m

  • threads, ordered, max_threads =m

  • nodes, ordered, max_threads = m

  • dist, ordered, max_threads = m

The threads attribute causes the tasks to run thread-parallel, and is the default. As with parallel loops, node-parallelism cannot be nested within thread-parallelism in task lists.

The nodes attribute causes the tasks to run node-parallel, on one thread per available hypernode.

The dist attribute tells the compiler to distribute the tasks across the currently active threads—instead of spawning new threads. Use the dist attribute (along with other valid attributes) to begin_tasks inside a parallel/end_parallel region. begin_tasks and parallel/end_parallel must appear inside the same function. The attribute list to the parallel directive (or pragma) determines the level of parallelism. See “Region parallelization” for information on the attributes available to the parallel directive and pragma.

The ordered attribute causes the tasks to be initiated in their lexical order; that is, the first task in the sequence begins to run on its respective thread before the second and so on. In the absence of the ordered argument, the starting order will be indeterminate. While this argument ensures an ordered starting sequence, it does not provide any synchronization between tasks, and does not guarantee any particular ending order. You can manually synchronize the tasks as described in Chapter 6, "Chapter 6 “Advanced shared-memory programming”," if necessary.

The attributes specifying max_threads = mwill run on no more than m threads, where m is an integer constant of known value at compile time. As shown, these attributes can include any combination of thread- or node-parallel, ordered or unordered execution.

The ordered, nodes and ordered, threads attributes cause the tasks to run ordered node-parallel and ordered thread-parallel, respectively.

NOTE: Do not use tasking directives or pragmas unless you ensure that dependences do not exist or you insert your own synchronization code, if necessary, in the code delimited by the tasking directives or pragmas. The compiler performs no dependence checking or synchronization on the code in these regions. Synchronization is discussed in Chapter 6, "Chapter 6 “Advanced shared-memory programming”."

The following Fortran example shows how to insert tasking directives into a section of code containing three tasks that can be run in parallel:

C$DIR BEGIN_TASKS

parallel task 1

C$DIR NEXT_TASK

parallel task 2

C$DIR NEXT_TASK

parallel task 3

C$DIR END_TASKS

The example above specifies thread-parallelism by default. The compiler transforms the code into a parallel loop and creates machine code equivalent to the following Fortran code:

C$DIR LOOP_PARALLEL(THREADS)

DO 40 I = 1,3

GOTO (10,20,30)I

10 parallel task 1

GOTO 40

20 parallel task 2

GOTO 40

30 parallel task 3

GOTO 40

40 CONTINUE

If there are more tasks than available threads, some threads will execute multiple tasks; if there are more threads than tasks, some threads will not execute tasks.

The END_TASKS directive and pragma acts as a barrier; all parallel tasks must complete before the code following END_TASKS can execute.

Examples

The following Fortran example illustrates how to use these directives to specify simple task-parallelism:

C$DIR BEGIN_TASKS
DO I = 1, N - 1
A(I) = A(I+1) + B(I)
ENDDO
C$DIR NEXT_TASK
CALL TSUB(X,Y)
C$DIR NEXT_TASK
C(1:1000:2) = D(1:500)
C$DIR END_TASKS

In this example, one thread executes the DO I loop, another thread executes the CALL TSUB(X,Y), and a third thread assigns the elements of the array D to every other element of C. These threads execute in parallel, but their starting and ending orders are indeterminate.

Unless the nodes attribute is supplied with the BEGIN_TASKS directive, the tasks are thread-parallelized. This means that there is no room for nested parallelization within the individual parallel tasks of this example, so the forward LCD on the DO I loop is inconsequential; there is no way for the loop to run but serially. The Fortran 90 array assignment in the last task will not parallelize either, even though it is technically parallelizable.

An analogous C example follows:

#pragma _CNX begin_tasks, task_private(i)
for(i=0;i<n-1;i++)
a[i] = a[i+1] + b[i];
#pragma _CNX next_task
tsub(x,y);
#pragma _CNX next_task
for(i=0;i<500;i++)
c[i*2] = d[i];
#pragma _CNX end_tasks

The loop induction variable i must be manually privatized here because it is used to control loops in two different tasks. If i was not private, both tasks would modify it, causing wrong answers. This is not necessary in the Fortran example because the second loop is implemented as a Fortran 90 array assignment, for which the compiler generates an independent induction variable. The task_private directive and pragma is described in detail in the section task_private.

Nested task parallelism is also possible. In order to nest any parallelism on an X2000 server, thread-parallelism must be nested within node-parallelism; when nesting tasking directives or pragmas, begin_tasks(nodes) must enclose begin_tasks(threads). Also, if a node-parallel task contains a parallel loop, the loop cannot go node-parallel. Thread-parallelism nested within node-parallelism can only run on the threads of the hypernode it is contained within.

The following Fortran example is more involved and exploits two-dimensional parallelism:

C$DIR BEGIN_TASKS(NODES)
C$DIR LOOP_PARALLEL(THREADS)
DO I = 1,N
IF(B(I) .NE. 0) THEN
A(I) = B(I)*C(I)
ELSE
A(I) = C(I)*D(I)
ENDIF
ENDDO
C$DIR NEXT_TASK
C$DIR BEGIN_TASKS(THREADS)
CALL T1SUB()
C$DIR NEXT_TASK
CALL T2SUB()
C$DIR NEXT_TASK
CALL T3SUB()
C$DIR END_TASKS !(THREADS)
C$DIR NEXT_TASK
X(1:1000) = Y(1:1000)
C$DIR END_TASKS !(NODES)

Here, the first node-parallel task contains a LOOP_PARALLEL(THREADS) loop that goes parallel on the threads of the hypernode on which this task is running. The second node-parallel task contains a task list of three subroutine calls, each of which runs on a separate thread within the hypernode. The third node-parallel task contains a Fortran 90 array section assignment which is a candidate for parallelization.

An analogous C example follows:

#pragma _CNX begin_tasks(nodes)
#pragma _CNX loop_parallel(threads, ivar=i)
for(i=0;i<n;i++)
if(b[i] != 0)
a[i] = b[i]*c[i];
else
a[i] = c[i]*d[i];
#pragma _CNX next_task
#pragma _CNX begin_tasks(threads)
t1sub();
#pragma _CNX next_task
t2sub();
#pragma _CNX next_task
t3sub();
#pragma _CNX end_tasks /* (threads) */
#pragma _CNX next_task
for(j=0;j<1000;j++)
x[j] = y[j];
#pragma _CNX end_tasks /* (nodes) */

Task parallelism can become even more involved, as described in Chapter 6, "Chapter 6 “Advanced shared-memory programming”."

Region parallelization

A parallel region is a single block of code that is written to run replicated on several (or many) threads. The idea is that any scalar code within the parallel region is run by each thread in preparation for work-sharing parallel constructs such as prefer_parallel(dist), loop_parallel(dist), or begin_tasks(dist). The scalar code typically assigns data into parallel_private variables so that subsequent references to the data have a high cache hit rate. Within a parallel region, code execution can be restricted to subsets of threads by using conditional blocks that test the thread ID. For an example of how to use the dist attribute, see the section “Using the attributes”.

Region parallelism differs from task parallelism in that parallel tasks are separate, contiguous blocks of code; when parallelized using the tasking directives and pragmas, each block generally runs on a separate thread, whereas a single parallel region runs on several threads. Specifying parallel tasks is also typically less time consuming because each thread's work is implicitly defined by the task boundaries; in region parallelism, you must manually modify the region to identify thread-specific code. However, region parallelism can reduce parallelization overhead as discussed in the section explaining the dist attribute dist .

The beginning of a parallel region is denoted by the parallel directive or pragma; the end is denoted by the end_parallel directive or pragma. end_parallel also prevents execution from continuing until all copies of the parallel region have completed.

Within a parallel region, the compiler does not check for data dependences, perform variable privatization, or perform parallelization analysis; you must manually synchronize any dependences between copies of the region and manually privatize data as necessary. In absence of a nodes or threads attribute, parallel defaults to thread parallelization.

The parallel/end_parallel Fortran directives have the following form:

C$DIR PARALLEL[(attribute-list)]

C$DIR END_PARALLEL

The C pragmas have the form:

#pragma _CNX parallel(attribute-list)

#pragma _CNX end_parallel

The optional attribute-list can contain one of the following attributes (m is an integer constant):

  • threads

  • nodes

  • max_threads = m

  • threads, max_threads = m

  • nodes, max_threads = m

The threads attribute causes the region to run thread-parallel and is the default. As with parallel loops, node-parallelism cannot be nested within thread-parallelism in regions.

The nodes attribute causes the region to run node-parallel, on one thread per available hypernode.

The max_threads = m attribute will cause the region to run on no more than m threads, where m is an integer constant. As shown, these attributes can include any combination of thread- or node-parallel execution.

NOTE: Do not use the parallel region directives or pragmas unless you ensure that dependences do not exist or you insert your own synchronization code, if necessary, in the region. The compiler performs no dependence checking or synchronization on the code delimited by the parallel region directives and pragmas. Synchronization is discussed in Chapter 6, "Chapter 6 “Advanced shared-memory programming”."

Consider the following Fortran example:

      REAL A(1000,8), B(1000,8), C(1000,8), RDONLY(1000), SUM(8)
INTEGER MYTID
.
.
.
C FIRST INITIALIZATION OF RDONLY IN SERIAL CODE:
CALL INIT1(RDONLY)
IF(NUM_THREADS() .LT. 8) STOP "NOT ENOUGH THREADS; EXITING"
C$DIR PARALLEL(MAX_THREADS = 8), PARALLEL_PRIVATE(I, J, K, MYTID)
MYTID = MY_THREAD() + 1 !ADD 1 FOR PROPER SUBSCRIPTING
DO I = 1, 1000
A(I, MYTID) = B(I, MYTID) * RDONLY(I)
ENDDO
IF(MYTID .EQ. 1) THEN ! ONLY THREAD 0 EXECUTES SECOND
CALL INIT2(RDONLY) ! INITIALIZATION
ENDIF
DO J = 1, 1000
B(J, MYTID) = B(J, MYTID) * RDONLY(J)
C(J, MYTID) = A(J, MYTID) * B(J, MYTID)
ENDDO
DO K = 1, 1000
SUM(MYTID) = SUM(MYTID) + A(K,MYTID) + B(K,MYTID) + C(K,MYTID)
ENDDO
C$DIR END_PARALLEL

In this example, all arrays that are written to in the parallel code have one dimension for each of the anticipated number of parallel threads; each thread can work on disjoint data, there is no chance of two threads attempting to update the same element, and, therefore, there is no need for explicit synchronization. The RDONLY array is one-dimensional, but it is never written to by parallel threads. Before the parallel region, RDONLY is initialized in serial code.

The PARALLEL_PRIVATE directive is used to privatize the induction variables used in the parallel region. This must be done so that the various threads processing the region do not attempt to write to the same shared induction variables. PARALLEL_PRIVATE is covered in more detail in the section parallel_private.

At the beginning of the parallel region, the NUM_THREADS() intrinsic, which is described in detail in Chapter 6, "Chapter 6 “Advanced shared-memory programming”," is called to ensure that the expected number of threads are available. Then the MY_THREAD() intrinsic, which is also described in Chapter 6, is called by each thread to determine its thread ID; all subsequent code in the region is executed based on this ID. In the I loop, each thread computes one row of A using RDONLY and the corresponding row of B.

RDONLY is reinitialized in a subroutine call that is only executed by thread 0 before it is used again in the computation of B in the J loop, where again each thread computes a row. The J loop similarly computes C.

Finally, the K loop sums each dimension of A, B, and C into the SUM array. No synchronization is necessary here because each thread is running the entire loop serially and assigning into a discrete element of SUM.

An analogous C example follows:

float a[8][1000], b[8][1000], c[8][1000], rdonly[1000], sum[8];
int i, j, k, mytid;
.
.
.
/* first initialization of rdonly in serial code: */
init1(rdonly);
if(num_threads() < 8) {
fprintf(stderr, "not enough threads; exiting\n");
exit(2);
}
#pragma _CNX parallel(max_threads = 8), parallel_private(i,j,k,mytid)
mytid = my_thread();
for(i=0; i<1000; i++)
a[mytid][i] = b[mytid][i] * rdonly[i];
if(mytid == 0) init2(rdonly);
for(j=0; j<1000; j++) {
b[mytid][j] = b[mytid][j] * rdonly[j];
c[mytid][j] = a[mytid][j] * b[mytid][j];
}
for(k=0; k<1000; k++)
sum[mytid] = sum[mytid] + a[mytid][k] + b[mytid][k] + c[mytid][k];
#pragma _CNX end_parallel

Critical sections

The critical_section and end_critical_section directives and pragmas allow you to specify sections of code in parallel loops or tasks that must be executed by only one thread at a time. These directives cannot be used for ordered synchronization within a loop_parallel(ordered) loop, but are suitable for simple synchronization in any other loop_parallel loops. (Use the ordered_section and end_ordered_section directives or pragmas for ordered synchronization within a loop_parallel(ordered) loop.)

A critical_section directive or pragma and its associated end_critical_section must appear in the same procedure and under the same control flow. They do not have to appear in the same procedure as the parallel construct in which they are used. In other words, the pair can appear in a procedure called from a parallel loop.

As discussed in this chapter, these directives have the following form in Fortran:

C$DIR CRITICAL_SECTION
C$DIR END_CRITICAL_SECTION

The C pragmas have the form:

#pragma _CNX critical_section
#pragma _CNX end_critical_section

The critical_section directive and pragma can take an optional gate attribute that allows the declaration of multiple critical sections as described in Chapter 6, "Chapter 6 “Advanced shared-memory programming”;" however, we will only discuss simple critical sections here.

Consider the following Fortran example:

C$DIR LOOP_PARALLEL, LOOP_PRIVATE(FUNCTEMP)
DO I = 1, N ! LOOP IS PARALLELIZABLE
.
.
.
FUNCTEMP = FUNC(X(I))
C$DIR CRITICAL_SECTION
SUM = SUM + FUNCTEMP
C$DIR END_CRITICAL_SECTION
.
.
.
ENDDO

Because FUNC has no side effects and can be called in parallel, the I loop can be parallelized as long as the SUM variable is only updated by one thread at a time. The critical section created around SUM ensures this behavior.

The LOOP_PARALLEL directive and the critical section are required to parallelize this loop because the call to FUNC would normally inhibit parallelization. If this call were not present, and if the loop did not contain other parallelization inhibitors, the compiler would automatically parallelize the reduction of SUM as described in the section “Reductions”. However, the presence of the call necessitates the LOOP_PARALLEL directive, which prevents the compiler from automatically handling the reduction; this, in turn, requires using either a critical section or the reduction directive. We will use a critical section for this example. Placing the call to FUNC outside of the critical section allows FUNC to be called in parallel, decreasing the amount of serial work within the critical section.

An analogous C example follows:

#pragma _CNX loop_parallel(ivar=i)
#pragma _CNX loop_private(functemp)
for(i=0;i<n;i++) { /* loop is parallelizable */
.
.
.
functemp = func(x(i));
#pragma _CNX critical_section
sum = sum + functemp;
#pragma _CNX end_critical_section
.
.
.
}

In order to justify the cost of the compiler-generated synchronization code associated with the use of critical sections, loops that contain them must also contain a large amount of parallelizable (non-critical section) code. If you are unsure of the profitability of using a critical section to help parallelize a certain loop, time the loop with and without the critical section to see if parallelization justifies the overhead of the critical section.

Again, for this particular example, the reduction directive or pragma could have been used in place of the critical_section, end_critical_section combination. For more information, see the section “Reductions”.

+Onoautopar compiler option

You can disable automatic loop thread- parallelism by specifying the +Onoautopar option on the compiler command line. +Onoautopar is only meaningful when specified with the +Oparallel option at +O3 or +O4.

This option causes the compiler to parallelize only those loops that are immediately preceded by a loop_parallel or prefer_parallel directive or pragma; all other loops, even if they could normally be automatically parallelized, are not analyzed for parallelization. Because the compiler does not automatically find parallel tasks or regions, user-specified task and region parallelization is not affected by this option.

+O[no]nodepar compiler option

By default, loop, task, and region node-parallelism is disabled. In other words, +Ononodepar is the default. The +O[no]nodepar option is only meaningful when specified with the +Oparallel option at +O3 or +O4.

The +Ononodepar option causes the compiler to generate code for a single-node machine. When this option is used, serial code is generated for node-parallel constructs; thus, node-parallelism is not implemented. Thread-parallelism—both automatic and directive-specified—is still implemented.

Use the +Onodepar option to enable directive-specified node-parallelism when compiling with +Oparallel at +O3 or +O4 on a multinode, scalable SMP.

Reentrant compilation

Exemplar compilers compile for reentrancy by default in that the compiler itself does not introduce static or global references beyond what exist in the original code. Reentrant compilation causes procedures to store uninitialized local variables on the stack; no locals can carry values from one invocation of the procedure to the next (unless the variables appear in Fortran COMMON blocks or DATA or SAVE statements or in C/C++ static statements). This allows loops containing procedure calls to be manually parallelized, assuming no other inhibitors of parallelization exist.

When procedures are called in parallel, each thread receives a private stack on which to allocate local variables. This allows each parallel copy of the procedure to manipulate its local variables without interfering with any other copy's locals of the same name. When the procedure returns and the parallel threads join, all values on the stack are lost.

Default stack size

Thread 0's stack can grow to the size specified in the maxssiz configurable kernel parameter. Refer to the Managing Systems and Workgroups manual for more information on configurable kernel parameters.

Any threads your program spawns (as the result of loop_parallel or tasking directives or pragmas, for example) receive a default stack size of 80 Mbytes. This means that if:

  • A parallel construct declares more than 80 Mbytes of loop_private, task_private, or parallel_private data, or

  • A subprogram with more than 80 Mbytes of local data is called in parallel, or

  • The cumulative size of all local variables in a chain of subprograms called in parallel exceeds 80 Mbytes,

you must modify the stack size of the spawned threads via the CPS_STACK_SIZE environment variable. Under csh, this can be done with the following command:

setenv CPS_STACK_SIZE size_in_kbytes

where

size_in_kbytes

is the desired stack size in kbytes. This value is read at program startup; it cannot be changed during execution.

For example, the following command sets the thread stack size to 100 Mbytes:

setenv CPS_STACK_SIZE 102400

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.