| United States-English |
|
|
|
![]() |
Fortran 90, Fortran 77, C, aC++: Exemplar Programming Guide > Chapter 4 Basic shared-memory programmingSimple manual loop, task,and region parallelization |
|
The Exemplar compilers automatically exploit strip-based loop parallelism in loops that are clearly dependence-free, as described in Chapter 3, "Chapter 3 “Compiler optimizations”." The prefer_parallel, loop_parallel, and parallel directives and pragmas allow you to increase parallelization opportunities and to manually control many aspects of parallelization. The compiler cannot automatically locate task parallelism, but the tasking directives and pragmas mentioned in Chapter 3, "Chapter 3 “Compiler optimizations”," (and discussed here) allow you to specify consecutive blocks of code that can be run in parallel. Similarly, the parallel and end_parallel directives and pragmas allow you to specify a code region that can be run in its entirety on several processors. The subsections that follow discuss specifying simple, unordered loop, task, and region parallelism using the prefer_parallel, loop_parallel, begin_tasks/next_task/end_tasks, and parallel/end_parallel directives and pragmas. These directives and pragmas can be nested in any order as long as node-parallelism is outside thread-parallelism. Critical sections that do not rely on ordered execution are
also covered here. Any necessary variable privatization is provided
by the loop_private, task_private
and parallel_private directives
and pragmas, which are described in detail in the "“Loop-specific, task-specific, and For a detailed discussion of ordered parallelism, parallel synchronization, and the effective use of memory classes, refer to Chapter 5, "Chapter 5 “Memory classes”," and Chapter 6, "Chapter 6 “Advanced shared-memory programming”." This section discusses simple uses of the prefer_parallel and loop_parallel directives and pragmas, which, when specified, apply to the immediately following loop. The data privatization necessary when using loop_parallel is illustrated in this chapter's examples using the loop_private directive, which is discussed in the section “loop_private”. Manual data privatization using memory classes is discussed in Chapter 5, "Chapter 5 “Memory classes”," and Chapter 6, "Chapter 6 “Advanced shared-memory programming”."
prefer_parallel and loop_parallel generally take the same attributes. The threads attribute is the default attribute for both prefer_parallel and loop_parallel. In Fortran, these directives have the following form: C$DIR PREFER_PARALLEL[(attribute-list)] and C$DIR LOOP_PARALLEL[(attribute-list)] In C, they have the form: #pragma _CNX prefer_parallel[(attribute-list)] and #pragma _CNX loop_parallel(ivar = indvar[,attribute-list]) where
The allowed combinations of attributes are those combinations listed in the preceding section. In such combinations the attributes can be listed in any order. The loop_parallel C pragma requires the ivar = indvar attribute, which specifies the primary loop induction variable. If this is not present, the compiler will issue a warning and ignore the pragma. ivar should specify only the primary induction variable; any other loop induction variables should be a function of this variable and should be declared loop_private. In Fortran, ivar is optional for DO loops; if not provided, the compiler will pick the primary induction variable for the loop. ivar is required for DO WHILE and hand-rolled loops in Fortran . prefer_parallel does not require ivar, and the compiler will issue an error if it encounters this combination. The attributes associated with the prefer_parallel and loop_parallel directives and pragmas are explained in the following sections.
The prefer_parallel directive and pragma cause the compiler to parallelize the immediately following loop if it is free of dependences and other parallelization inhibitors. The compiler automatically privatizes any loop variables that must be privatized. prefer_parallel requires less manual intervention and is less forceful than the loop_parallel directive and pragma. On multihypernode systems, when prefer_parallel is specified without a nodes or threads attribute, the compiler will determine if opportunities for parallelism exist within the loop and, if possible, parallelize the loop across threads. If the threads attribute is specified, the compiler attempts to find and exploit thread-parallelism within the loop. If the nodes attribute is specified, the compiler trys to locate and exploit node-parallelism within the loop. prefer_parallel can also be used to indicate the preferred loop in a nest to parallelize, as shown in the following Fortran example:
In this example, PREFER_PARALLEL causes the compiler to choose the innermost loop for parallelization, provided it is free of dependences. PREFER_PARALLEL does not inhibit loop interchange. An analogous C example follows:
Do not use the ordered attribute in a prefer_parallel directive, as it is only useful if the loop contains synchronized dependences, and prefer_parallel will not parallelize a loop containing any loop-carried dependences. The ordered attribute is useful in the loop_parallel directive, as described in Chapter 6, "Chapter 6 “Advanced shared-memory programming”." The loop_parallel directive forces parallelization of the immediately following loop. The compiler does not check for data dependences, perform variable privatization, or perform parallelization analysis. You must synchronize any dependences manually and manually privatize loop data as necessary. In absence of a nodes or threads attribute, loop_parallel defaults to thread parallelization. The section “Critical sections” contains an example of using loop_parallel to parallelize a loop with a dependence; the dependence is manually handled in a critical section. The threads, nodes, chunk_size = n and max_threads = m attributes and combinations of these attributes have exactly the same effect as explained for prefer_parallel. loop_parallel(ordered) is useful for manually parallelizing loops containing manually-ordered dependences as described in Chapter 6, "Chapter 6 “Advanced shared-memory programming”." loop_parallel can be useful for manually parallelizing loops containing procedure calls. Consider the following Fortran example:
The call to FUNC in this loop would normally prevent it from parallelizing. However, if you are sure that FUNC has no side effects and is compiled for reentrancy (the default on Exemplar compilers), this loop can be safely parallelized as shown. (A function does not have side effects if it does not modify its arguments, it does not modify the same memory location from one call to the next, it performs no I/O, and it does not call any procedures that have side effects. If FUNC does have side effects or is not reentrant, this loop may yield wrong answers. An analogous C example follows:
The compiler will not parallelize any loop that does not have a number of iterations that can be determined prior to loop invocation at execution time, even when loop_parallel is specified. Consider the following Fortran example:
There is no way the compiler can determine the loop's iteration count prior to loop invocation here, so the loop cannot be parallelized. The prefer_parallel and loop_parallel directives (and pragmas) are both used in parallelizing loops. Table 4-3 “Comparison of prefer_parallel and loop_parallel” gives an overview of the differences between the two directives (pragmas). See the sections “prefer_parallel” and “loop_parallel” for more information. Table 4-3 Comparison of prefer_parallel and loop_parallel
The compiler does not automatically parallelize code outside a loop, but you can use tasking directives and pragmas to instruct the compiler to parallelize such code. The begin_tasks directive and pragma tells the compiler to begin parallelizing a series of tasks. The next_task directive and pragma marks the end of a task and the start of the next task. The end_tasks directive and pragma marks the end of a series of tasks to be parallelized and prevents execution from continuing until all tasks have completed. The sections of code delimited by these directives are referred to as a task list. Within a task list, the compiler does not check for data dependences, perform variable privatization, or perform parallelization analysis. You must manually synchronize any dependences between tasks and manually privatize data as necessary. In absence of a nodes or threads attribute, begin_tasks defaults to thread parallelization. The Fortran tasking directives have the following forms: C$DIR BEGIN_TASKS[(attribute-list)] C$DIR NEXT_TASK C$DIR END_TASKS The C tasking pragmas have the forms: #pragma _CNX begin_tasks[(attribute-list)] #pragma _CNX next_task #pragma _CNX end_tasks The optional attribute-list can contain one of the following attribute combinations (m is an integer constant):
The threads attribute causes the tasks to run thread-parallel, and is the default. As with parallel loops, node-parallelism cannot be nested within thread-parallelism in task lists. The nodes attribute causes the tasks to run node-parallel, on one thread per available hypernode. The dist attribute tells the compiler to distribute the tasks across the currently active threads—instead of spawning new threads. Use the dist attribute (along with other valid attributes) to begin_tasks inside a parallel/end_parallel region. begin_tasks and parallel/end_parallel must appear inside the same function. The attribute list to the parallel directive (or pragma) determines the level of parallelism. See “Region parallelization” for information on the attributes available to the parallel directive and pragma. The ordered attribute causes the tasks to be initiated in their lexical order; that is, the first task in the sequence begins to run on its respective thread before the second and so on. In the absence of the ordered argument, the starting order will be indeterminate. While this argument ensures an ordered starting sequence, it does not provide any synchronization between tasks, and does not guarantee any particular ending order. You can manually synchronize the tasks as described in Chapter 6, "Chapter 6 “Advanced shared-memory programming”," if necessary. The attributes specifying max_threads = mwill run on no more than m threads, where m is an integer constant of known value at compile time. As shown, these attributes can include any combination of thread- or node-parallel, ordered or unordered execution. The ordered, nodes and ordered, threads attributes cause the tasks to run ordered node-parallel and ordered thread-parallel, respectively.
The following Fortran example shows how to insert tasking directives into a section of code containing three tasks that can be run in parallel: C$DIR BEGIN_TASKS parallel task 1 C$DIR NEXT_TASK parallel task 2 C$DIR NEXT_TASK parallel task 3 C$DIR END_TASKS The example above specifies thread-parallelism by default. The compiler transforms the code into a parallel loop and creates machine code equivalent to the following Fortran code: C$DIR LOOP_PARALLEL(THREADS) DO 40 I = 1,3 GOTO (10,20,30)I 10 parallel task 1 GOTO 40 20 parallel task 2 GOTO 40 30 parallel task 3 GOTO 40 40 CONTINUE If there are more tasks than available threads, some threads will execute multiple tasks; if there are more threads than tasks, some threads will not execute tasks. The END_TASKS directive and pragma acts as a barrier; all parallel tasks must complete before the code following END_TASKS can execute. The following Fortran example illustrates how to use these directives to specify simple task-parallelism:
In this example, one thread executes the DO I loop, another thread executes the CALL TSUB(X,Y), and a third thread assigns the elements of the array D to every other element of C. These threads execute in parallel, but their starting and ending orders are indeterminate. Unless the nodes attribute is supplied with the BEGIN_TASKS directive, the tasks are thread-parallelized. This means that there is no room for nested parallelization within the individual parallel tasks of this example, so the forward LCD on the DO I loop is inconsequential; there is no way for the loop to run but serially. The Fortran 90 array assignment in the last task will not parallelize either, even though it is technically parallelizable. An analogous C example follows:
The loop induction variable i must be manually privatized here because it is used to control loops in two different tasks. If i was not private, both tasks would modify it, causing wrong answers. This is not necessary in the Fortran example because the second loop is implemented as a Fortran 90 array assignment, for which the compiler generates an independent induction variable. The task_private directive and pragma is described in detail in the section “task_private”. Nested task parallelism is also possible. In order to nest any parallelism on an X2000 server, thread-parallelism must be nested within node-parallelism; when nesting tasking directives or pragmas, begin_tasks(nodes) must enclose begin_tasks(threads). Also, if a node-parallel task contains a parallel loop, the loop cannot go node-parallel. Thread-parallelism nested within node-parallelism can only run on the threads of the hypernode it is contained within. The following Fortran example is more involved and exploits two-dimensional parallelism:
Here, the first node-parallel task contains a LOOP_PARALLEL(THREADS) loop that goes parallel on the threads of the hypernode on which this task is running. The second node-parallel task contains a task list of three subroutine calls, each of which runs on a separate thread within the hypernode. The third node-parallel task contains a Fortran 90 array section assignment which is a candidate for parallelization. An analogous C example follows:
Task parallelism can become even more involved, as described in Chapter 6, "Chapter 6 “Advanced shared-memory programming”." A parallel region is a single block of code that is written to run replicated on several (or many) threads. The idea is that any scalar code within the parallel region is run by each thread in preparation for work-sharing parallel constructs such as prefer_parallel(dist), loop_parallel(dist), or begin_tasks(dist). The scalar code typically assigns data into parallel_private variables so that subsequent references to the data have a high cache hit rate. Within a parallel region, code execution can be restricted to subsets of threads by using conditional blocks that test the thread ID. For an example of how to use the dist attribute, see the section “Using the attributes”. Region parallelism differs from task parallelism in that parallel tasks are separate, contiguous blocks of code; when parallelized using the tasking directives and pragmas, each block generally runs on a separate thread, whereas a single parallel region runs on several threads. Specifying parallel tasks is also typically less time consuming because each thread's work is implicitly defined by the task boundaries; in region parallelism, you must manually modify the region to identify thread-specific code. However, region parallelism can reduce parallelization overhead as discussed in the section explaining the dist attribute dist . The beginning of a parallel region is denoted by the parallel directive or pragma; the end is denoted by the end_parallel directive or pragma. end_parallel also prevents execution from continuing until all copies of the parallel region have completed. Within a parallel region, the compiler does not check for data dependences, perform variable privatization, or perform parallelization analysis; you must manually synchronize any dependences between copies of the region and manually privatize data as necessary. In absence of a nodes or threads attribute, parallel defaults to thread parallelization. The parallel/end_parallel Fortran directives have the following form: C$DIR PARALLEL[(attribute-list)] C$DIR END_PARALLEL The C pragmas have the form: #pragma _CNX parallel(attribute-list) #pragma _CNX end_parallel The optional attribute-list can contain one of the following attributes (m is an integer constant):
The threads attribute causes the region to run thread-parallel and is the default. As with parallel loops, node-parallelism cannot be nested within thread-parallelism in regions. The nodes attribute causes the region to run node-parallel, on one thread per available hypernode. The max_threads = m attribute will cause the region to run on no more than m threads, where m is an integer constant. As shown, these attributes can include any combination of thread- or node-parallel execution.
Consider the following Fortran example:
In this example, all arrays that are written to in the parallel code have one dimension for each of the anticipated number of parallel threads; each thread can work on disjoint data, there is no chance of two threads attempting to update the same element, and, therefore, there is no need for explicit synchronization. The RDONLY array is one-dimensional, but it is never written to by parallel threads. Before the parallel region, RDONLY is initialized in serial code. The PARALLEL_PRIVATE directive is used to privatize the induction variables used in the parallel region. This must be done so that the various threads processing the region do not attempt to write to the same shared induction variables. PARALLEL_PRIVATE is covered in more detail in the section “parallel_private”. At the beginning of the parallel region, the NUM_THREADS() intrinsic, which is described in detail in Chapter 6, "Chapter 6 “Advanced shared-memory programming”," is called to ensure that the expected number of threads are available. Then the MY_THREAD() intrinsic, which is also described in Chapter 6, is called by each thread to determine its thread ID; all subsequent code in the region is executed based on this ID. In the I loop, each thread computes one row of A using RDONLY and the corresponding row of B. RDONLY is reinitialized in a subroutine call that is only executed by thread 0 before it is used again in the computation of B in the J loop, where again each thread computes a row. The J loop similarly computes C. Finally, the K loop sums each dimension of A, B, and C into the SUM array. No synchronization is necessary here because each thread is running the entire loop serially and assigning into a discrete element of SUM. An analogous C example follows:
The critical_section and end_critical_section directives and pragmas allow you to specify sections of code in parallel loops or tasks that must be executed by only one thread at a time. These directives cannot be used for ordered synchronization within a loop_parallel(ordered) loop, but are suitable for simple synchronization in any other loop_parallel loops. (Use the ordered_section and end_ordered_section directives or pragmas for ordered synchronization within a loop_parallel(ordered) loop.) A critical_section directive or pragma and its associated end_critical_section must appear in the same procedure and under the same control flow. They do not have to appear in the same procedure as the parallel construct in which they are used. In other words, the pair can appear in a procedure called from a parallel loop. As discussed in this chapter, these directives have the following form in Fortran:
The C pragmas have the form:
The critical_section directive and pragma can take an optional gate attribute that allows the declaration of multiple critical sections as described in Chapter 6, "Chapter 6 “Advanced shared-memory programming”;" however, we will only discuss simple critical sections here. Consider the following Fortran example:
Because FUNC has no side effects and can be called in parallel, the I loop can be parallelized as long as the SUM variable is only updated by one thread at a time. The critical section created around SUM ensures this behavior. The LOOP_PARALLEL directive and the critical section are required to parallelize this loop because the call to FUNC would normally inhibit parallelization. If this call were not present, and if the loop did not contain other parallelization inhibitors, the compiler would automatically parallelize the reduction of SUM as described in the section “Reductions”. However, the presence of the call necessitates the LOOP_PARALLEL directive, which prevents the compiler from automatically handling the reduction; this, in turn, requires using either a critical section or the reduction directive. We will use a critical section for this example. Placing the call to FUNC outside of the critical section allows FUNC to be called in parallel, decreasing the amount of serial work within the critical section. An analogous C example follows:
In order to justify the cost of the compiler-generated synchronization code associated with the use of critical sections, loops that contain them must also contain a large amount of parallelizable (non-critical section) code. If you are unsure of the profitability of using a critical section to help parallelize a certain loop, time the loop with and without the critical section to see if parallelization justifies the overhead of the critical section. Again, for this particular example, the reduction directive or pragma could have been used in place of the critical_section, end_critical_section combination. For more information, see the section “Reductions”. You can disable automatic loop thread- parallelism by specifying the +Onoautopar option on the compiler command line. +Onoautopar is only meaningful when specified with the +Oparallel option at +O3 or +O4. This option causes the compiler to parallelize only those loops that are immediately preceded by a loop_parallel or prefer_parallel directive or pragma; all other loops, even if they could normally be automatically parallelized, are not analyzed for parallelization. Because the compiler does not automatically find parallel tasks or regions, user-specified task and region parallelization is not affected by this option. By default, loop, task, and region node-parallelism is disabled. In other words, +Ononodepar is the default. The +O[no]nodepar option is only meaningful when specified with the +Oparallel option at +O3 or +O4. The +Ononodepar option causes the compiler to generate code for a single-node machine. When this option is used, serial code is generated for node-parallel constructs; thus, node-parallelism is not implemented. Thread-parallelism—both automatic and directive-specified—is still implemented. Use the +Onodepar option to enable directive-specified node-parallelism when compiling with +Oparallel at +O3 or +O4 on a multinode, scalable SMP. Exemplar compilers compile for reentrancy by default in that the compiler itself does not introduce static or global references beyond what exist in the original code. Reentrant compilation causes procedures to store uninitialized local variables on the stack; no locals can carry values from one invocation of the procedure to the next (unless the variables appear in Fortran COMMON blocks or DATA or SAVE statements or in C/C++ static statements). This allows loops containing procedure calls to be manually parallelized, assuming no other inhibitors of parallelization exist. When procedures are called in parallel, each thread receives a private stack on which to allocate local variables. This allows each parallel copy of the procedure to manipulate its local variables without interfering with any other copy's locals of the same name. When the procedure returns and the parallel threads join, all values on the stack are lost. Thread 0's stack can grow to the size specified in the maxssiz configurable kernel parameter. Refer to the Managing Systems and Workgroups manual for more information on configurable kernel parameters. Any threads your program spawns (as the result of loop_parallel or tasking directives or pragmas, for example) receive a default stack size of 80 Mbytes. This means that if:
you must modify the stack size of the spawned threads via the CPS_STACK_SIZE environment variable. Under csh, this can be done with the following command: setenv CPS_STACK_SIZE size_in_kbytes where
For example, the following command sets the thread stack size to 100 Mbytes: setenv CPS_STACK_SIZE 102400 |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||