 |
» |
|
|
 |
Misused
directives
and
pragmas
are a common cause of wrong answers. Some of the more common misuses
of directives and pragmas involve the following: Nondeterminism of parallel execution
Descriptions of, and methods for, avoiding the items listed
above are in the sections below. Loop-carried dependences |  |
For example, forcing parallelization of a loop containing
a call is safe only if the called routine contains no
dependences. Do not assume that it is always safe to parallelize a loop whose
data is safe to localize. You can safely localize loop data in loops
that do not contain a
loop-carried dependence (LCD) of the form shown in the following
Fortran loop
: DO I = 2, M DO J = 1, N A(I,J) = A(I+IADD,J+JADD) + B(I,J) ENDDO ENDDO |
where one of IADD and JADD
is negative and the other is positive. This is explained in detail
in the section “Inhibitors of localization”. You cannot safely parallelize a loop that contains any kind
of LCD, except by using ordered sections around the LCDs as described
in the section “Ordered sections”.
Also see the section “Inhibitors of parallelization”. The MAIN section of the Fortran
program that follows initializes A,
calls CALC, and outputs the new
array values. In subroutine CALC,
the indirect index used in A(IN(I))
introduces a potential dependence that prevents the compiler from
parallelizing CALC's I
loop
. PROGRAM MAIN REAL A(1025) INTEGER IN(1025) COMMON /DATA/ A DO I = 1, 1025 IN(I) = I ENDDO CALL CALC(IN) CALL
OUTPUT(A) END SUBROUTINE CALC(IN) INTEGER IN(1025) REAL A(1025) COMMON /DATA/ A DO I = 1, 1025 A(I) = A(IN(I)) ENDDO RETURN END |
An analogous C example follows: float arra[1025]; void calc(int in[]) { int i,j; for(i = 0; i < 1025; i++) arra[i] = arra[in[i]]; } main() { int i,j,in[1025]; for(i = 0; i < 1025; i++) in[i] = i; calc(in); output(arra); } |
Because you know that IN(I) = I,
you can use the NO_LOOP_DEPENDENCE
directive, as shown below. This directive allows the compiler to
ignore the
apparent dependence
and parallelize the loop, when compiling with
+O3 +Oparallel. SUBROUTINE CALC(IN) INTEGER IN(1025) REAL A(1025) COMMON /DATA/ A C$DIR NO_LOOP_DEPENDENCE(A) DO I = 1, 1025 A(I) = A(IN(I)) ENDDO RETURN END |
In C: void calc(int in[]) { int i,j; #pragma _CNX no_loop_dependence(arra) for(i = 0; i < 1025; i++) arra[i] = arra[in[i]]; } |
Reductions |  |
Reductions
are a special class of dependence that the compiler can parallelize.
An
apparent LCD can prevent the compiler from parallelizing a loop
containing a reduction. The loop in the following Fortran example
is not parallelized because of an
apparent dependence between the references to A(I)
on line 6 and the assignment to A(JA(J))
on line 7. The compiler does not realize that the values of the
elements of JA never coincide with
the values of I, and so, assuming
that they might, conservatively avoids parallelizing the loop. DO I = 1,100 JA(I) = I + 10 ENDDO DO I = 1, 100 DO J = I, 100 A(I) = A(I) + B(J) * C(J) !LINE 6 A(JA(J)) = B(J) + C(J) !LINE 7 ENDDO ENDDO |
 |  |  |  |  | NOTE: In this example as well as the examples that follow,
the apparent dependence becomes real if any of the values of the
elements of JA are equal to the
values iterated over by I. |  |  |  |  |
A no_loop_dependence
directive or pragma placed before the J
loop tells the compiler that the indirect subscript does not cause
a true dependence. Because reductions are a form of dependence,
this directive also tells the compiler to ignore the reduction on
A(I), which it would normally handle.
Ignoring this reduction causes the compiler to generate incorrect
code for the assignment on line 6; the apparent dependence on line
7 is properly handled because of the directive.
The
resulting code runs fast but produces incorrect answers.
In the following analogous C example, the apparent dependence
is between the reference to a[i]
on line 5 and a[ja[j]] on line 6: for (i=0;i<100;i++) ja[i] = i + 10; for (i=0; i<100; i++) for (j=0; j<100; j++) { a[i] += b[j] * c[j]; /* line 5 */ a[ja[j]] = b[j] + c[j]; /* line 6 */ } |
To solve this problem, distribute the J
loop, isolating the reduction from the other statements, as shown
in the following Fortran example: DO I = 1, 100 DO J = I, 100 A(I) = A(I) + B(J) * C(J) ENDDO ENDDO C$DIR NO_LOOP_DEPENDENCE(A) DO I = 1, 100 DO J = I, 100 A(JA(J)) = B(J) + C(J) ENDDO ENDDO |
And in C: for (i=0; i<100; i++) for (j=i; j<100; j++) a[i] += b[j] * c[j]; #pragma _CNX no_loop_dependence(a) for (i=0; i<100; i++) for (j=i; j<100; j++) a[ja[j]] = b[j] + c[j]; |
The
apparent dependence is removed, and both loops can be optimized. Nondeterminism of parallel execution |  |
In
a parallel program, threads do not execute in a
predictable or
determined order. If you force the compiler to parallelize a loop
when a dependence exists, the results are unpredictable and can
vary from one execution to the next. Consider the following Fortran example: DO I = 1, N-1 A(I) = A(I+1) * B(I) . . . ENDDO |
The compiler will not parallelize this code as written because
of the dependence on A(I). This
dependence requires that the original value of A(I+1)
is available for the computation of A(I).
If this code was parallelized, some values of A
would be assigned by some processors before they were used by others,
resulting in incorrect assignments. Because the results depend on
the order in which statements execute, the errors are nondeterministic.
The loop must therefore execute in iteration order to ensure that
all values of A are computed correctly. The analogous C code follows: for(i=0;i<n-1;i++) { a[i] = a[i+1] * b[i]; . . . } |
Loops containing dependences can sometimes be manually parallelized
using the LOOP_PARALLEL(ORDERED)
directive as described in Chapter 6, "Chapter 6 “Advanced shared-memory programming”." Otherwise,
unless you are sure that no loop-carried dependence exists, it is
safest to let the compiler choose which loops to parallelize. Hidden ordered sections |  |
While
it is legal and sometimes
useful to place ordered sections in separate routines from their
parent ordered loops, this practice can cause runtime deadlock in
some situations.
Consider the following Fortran example: PROGRAM SEPMAIN REAL A(100) . . . C$DIR BEGIN_TASKS(NODES) CALL SUB1(A) . . . C$DIR NEXT_TASK CALL SUBN . . . C$DIR END_TASKS . . . END SUBROUTINE SUB1(A) COMMON LOCK REAL A(100) C$DIR GATE(LOCK) LK = ALLOC_GATE(LOCK) C$DIR LOOP_PARALLEL(ORDERED) DO I = 2, 100 CALL SUB2(LOCK,A,I) ENDDO LK = FREE_GATE(LOCK) END |
SUBROUTINE SUB2(LOCK,A,I) C$DIR GATE(LOCK) REAL A(100) INTEGER I C$DIR ORDERED_SECTION(LOCK) A(I) = A(I-1) C$DIR END_ORDERED_SECTION END |
The ordered section in SUB2
should be associated with the loop_parallel
construct. However, if for any reason that construct does not execute
in parallel, the ordered section is then associated with the closest
parallel construct; in this case, the begin_tasks
construct. The ordered section in SUB2 expects
that it will be executed by all parallel threads. Only thread 0
from the node-level spawn is executing SUB1,
because only the first task calls it; thread 0 therefore runs the
DO loop in SUB1
and passes through the ordered section in SUB2.
After this, the ordered section will wait for thread 1 to enter
before allowing thread 0 back in on the next iteration
of the I loop. Thread 1 never calls
SUB1, so it never has the opportunity
to enter the ordered section. This can cause the program to deadlock
in the ordered section. If you encounter this kind of problem, try moving the ordered
section into the same routine as its parent loop. Also, you should
apply the NO_DYNSEL
directive
to the loop_parallel loop or use
the +Onodynsel command-line option so that it does
not run serially. The analogous C code follows: void sub2(gate_t lock, float *a, int i) { #pragma _CNX ordered_section(lock) a[i] = a[i-1]; #pragma _CNX end_ordered_section } void sub1(float *a) { static gate_t lock; int i, lk; lk = alloc_gate(&lock); #pragma _CNX loop_parallel(ordered, ivar=i) for(i=1;i<100;i++) sub2(lock,a,i); lk = free_gate(&lock); } |
main() { float a[100]; . . . #pragma _CNX begin_tasks(nodes) sub1(a); . . . #pragma _CNX next_task subn(); . . . #pragma _CNX end_tasks . . . } |
|