Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Fortran 90, Fortran 77, C, aC++: Exemplar Programming Guide > Chapter 8 Programming conventions for optimal code

Misused directives and pragmas

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

Misused directives and pragmas are a common cause of wrong answers. Some of the more common misuses of directives and pragmas involve the following:

  • Loop-carried dependences

  • Reductions

  • Nondeterminism of parallel execution

  • Hidden ordered sections

Descriptions of, and methods for, avoiding the items listed above are in the sections below.

Loop-carried dependences

For example, forcing parallelization of a loop containing a call is safe only if the called routine contains no dependences.

Do not assume that it is always safe to parallelize a loop whose data is safe to localize. You can safely localize loop data in loops that do not contain a loop-carried dependence (LCD) of the form shown in the following Fortran loop :

DO I = 2, M
DO J = 1, N
A(I,J) = A(I+IADD,J+JADD) + B(I,J)
ENDDO
ENDDO

where one of IADD and JADD is negative and the other is positive. This is explained in detail in the section “Inhibitors of localization”.

You cannot safely parallelize a loop that contains any kind of LCD, except by using ordered sections around the LCDs as described in the section “Ordered sections”. Also see the section “Inhibitors of parallelization”.

The MAIN section of the Fortran program that follows initializes A, calls CALC, and outputs the new array values. In subroutine CALC, the indirect index used in A(IN(I)) introduces a potential dependence that prevents the compiler from parallelizing CALC's I loop .

PROGRAM MAIN 
REAL A(1025)
INTEGER IN(1025)
COMMON /DATA/ A
DO I = 1, 1025
IN(I) = I
ENDDO
CALL CALC(IN)
CALL OUTPUT(A) END


SUBROUTINE CALC(IN)
INTEGER IN(1025)
REAL A(1025)
COMMON /DATA/ A
DO I = 1, 1025
A(I) = A(IN(I))
ENDDO
RETURN
END

An analogous C example follows:

float arra[1025];

void calc(int in[])
{
int i,j;

for(i = 0; i < 1025; i++)
arra[i] = arra[in[i]];
}
main()
{
int i,j,in[1025];

for(i = 0; i < 1025; i++)
in[i] = i;
calc(in);
output(arra);
}

Because you know that IN(I) = I, you can use the NO_LOOP_DEPENDENCE directive, as shown below. This directive allows the compiler to ignore the apparent dependence and parallelize the loop, when compiling with +O3 +Oparallel.

      SUBROUTINE CALC(IN)
INTEGER IN(1025)
REAL A(1025)
COMMON /DATA/ A
C$DIR NO_LOOP_DEPENDENCE(A)
DO I = 1, 1025
A(I) = A(IN(I))
ENDDO
RETURN
END

In C:

void calc(int in[])
{
int i,j;
#pragma _CNX no_loop_dependence(arra)
for(i = 0; i < 1025; i++)
arra[i] = arra[in[i]];
}

Reductions

Reductions are a special class of dependence that the compiler can parallelize. An apparent LCD can prevent the compiler from parallelizing a loop containing a reduction. The loop in the following Fortran example is not parallelized because of an apparent dependence between the references to A(I) on line 6 and the assignment to A(JA(J)) on line 7. The compiler does not realize that the values of the elements of JA never coincide with the values of I, and so, assuming that they might, conservatively avoids parallelizing the loop.

DO I = 1,100
JA(I) = I + 10
ENDDO
DO I = 1, 100
DO J = I, 100
A(I) = A(I) + B(J) * C(J) !LINE 6
A(JA(J)) = B(J) + C(J) !LINE 7
ENDDO
ENDDO
NOTE: In this example as well as the examples that follow, the apparent dependence becomes real if any of the values of the elements of JA are equal to the values iterated over by I.

A no_loop_dependence directive or pragma placed before the J loop tells the compiler that the indirect subscript does not cause a true dependence. Because reductions are a form of dependence, this directive also tells the compiler to ignore the reduction on A(I), which it would normally handle. Ignoring this reduction causes the compiler to generate incorrect code for the assignment on line 6; the apparent dependence on line 7 is properly handled because of the directive. The resulting code runs fast but produces incorrect answers.

In the following analogous C example, the apparent dependence is between the reference to a[i] on line 5 and a[ja[j]] on line 6:

for (i=0;i<100;i++)
ja[i] = i + 10;
for (i=0; i<100; i++)
for (j=0; j<100; j++) {
a[i] += b[j] * c[j]; /* line 5 */
a[ja[j]] = b[j] + c[j]; /* line 6 */
}

To solve this problem, distribute the J loop, isolating the reduction from the other statements, as shown in the following Fortran example:

      DO I = 1, 100
DO J = I, 100
A(I) = A(I) + B(J) * C(J)
ENDDO
ENDDO
C$DIR NO_LOOP_DEPENDENCE(A)
DO I = 1, 100 DO J = I, 100
A(JA(J)) = B(J) + C(J)
ENDDO
ENDDO

And in C:

for (i=0; i<100; i++)
for (j=i; j<100; j++)
a[i] += b[j] * c[j];
#pragma _CNX no_loop_dependence(a)
for (i=0; i<100; i++)
for (j=i; j<100; j++)
a[ja[j]] = b[j] + c[j];

The apparent dependence is removed, and both loops can be optimized.

Nondeterminism of parallel execution

In a parallel program, threads do not execute in a predictable or determined order. If you force the compiler to parallelize a loop when a dependence exists, the results are unpredictable and can vary from one execution to the next.

Consider the following Fortran example:

DO I = 1, N-1
A(I) = A(I+1) * B(I)
.
.
.
ENDDO

The compiler will not parallelize this code as written because of the dependence on A(I). This dependence requires that the original value of A(I+1) is available for the computation of A(I). If this code was parallelized, some values of A would be assigned by some processors before they were used by others, resulting in incorrect assignments. Because the results depend on the order in which statements execute, the errors are nondeterministic. The loop must therefore execute in iteration order to ensure that all values of A are computed correctly.

The analogous C code follows:

for(i=0;i<n-1;i++) {
a[i] = a[i+1] * b[i];
.
.
.
}

Loops containing dependences can sometimes be manually parallelized using the LOOP_PARALLEL(ORDERED) directive as described in Chapter 6, "Chapter 6 “Advanced shared-memory programming”." Otherwise, unless you are sure that no loop-carried dependence exists, it is safest to let the compiler choose which loops to parallelize.

Hidden ordered sections

While it is legal and sometimes useful to place ordered sections in separate routines from their parent ordered loops, this practice can cause runtime deadlock in some situations. Consider the following Fortran example:

      PROGRAM SEPMAIN
REAL A(100)
.
.
.
C$DIR BEGIN_TASKS(NODES)
CALL SUB1(A)
.
.
.
C$DIR NEXT_TASK
CALL SUBN
.
.
.
C$DIR END_TASKS
.
.
.
END

SUBROUTINE SUB1(A)
COMMON LOCK
REAL A(100)
C$DIR GATE(LOCK)
LK = ALLOC_GATE(LOCK)
C$DIR LOOP_PARALLEL(ORDERED)
DO I = 2, 100
CALL SUB2(LOCK,A,I)
ENDDO
LK = FREE_GATE(LOCK)
END
      SUBROUTINE SUB2(LOCK,A,I)
C$DIR GATE(LOCK)
REAL A(100)
INTEGER I
C$DIR ORDERED_SECTION(LOCK)
A(I) = A(I-1)
C$DIR END_ORDERED_SECTION
END

The ordered section in SUB2 should be associated with the loop_parallel construct. However, if for any reason that construct does not execute in parallel, the ordered section is then associated with the closest parallel construct; in this case, the begin_tasks construct.

The ordered section in SUB2 expects that it will be executed by all parallel threads. Only thread 0 from the node-level spawn is executing SUB1, because only the first task calls it; thread 0 therefore runs the DO loop in SUB1 and passes through the ordered section in SUB2. After this, the ordered section will wait for thread 1 to enter before allowing thread 0 back in on the next iteration of the I loop. Thread 1 never calls SUB1, so it never has the opportunity to enter the ordered section. This can cause the program to deadlock in the ordered section.

If you encounter this kind of problem, try moving the ordered section into the same routine as its parent loop. Also, you should apply the NO_DYNSEL directive to the loop_parallel loop or use the +Onodynsel command-line option so that it does not run serially.

The analogous C code follows:

void sub2(gate_t lock, float *a, int i) {
#pragma _CNX ordered_section(lock)
a[i] = a[i-1];
#pragma _CNX end_ordered_section
}

void sub1(float *a) {
static gate_t lock;
int i, lk;
lk = alloc_gate(&lock);
#pragma _CNX loop_parallel(ordered, ivar=i)
for(i=1;i<100;i++)
sub2(lock,a,i);
lk = free_gate(&lock);
}
main() {
float a[100];
.
.
.
#pragma _CNX begin_tasks(nodes)
sub1(a);
.
.
.
#pragma _CNX next_task
subn();
.
.
.
#pragma _CNX end_tasks
.
.
.
}

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.