Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Fortran 90, Fortran 77, C, aC++: Exemplar Programming Guide > Chapter 4 Basic shared-memory programming

Loop-specific, task-specific, andregion-specific data privatization

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

Once assigned, the memory classes discussed in detail in Chapter 5, "Chapter 5 “Memory classes”," are in effect throughout your entire program. Any loops that manipulate variables that have been explicitly assigned a memory class must be manually parallelized, and once a variable is assigned a class, its class cannot change. While very efficient programs can be written using these memory classes, they also require a great deal of manual intervention.

To get around these problems, the Exemplar Fortran 90, Exemplar Fortran 77, and Exemplar C compilers support the loop_private, task_private, and parallel_private directives and pragmas. The save_last directive and pragma is provided to save the value of loop_private data objects assigned in the last iteration of the loop. These directives and pragmas allow you to easily privatize parallel loop, task, or region data temporarily; when used with prefer_parallel, they do so without inhibiting any automatic compiler optimizations. They can help you further increase the performance of your shared-memory program with less extra work than is required when using the standard memory classes accompanying manual parallelization and synchronization.

You can use the above directives on local variables and arrays of any type, but they should not be used on data assigned one of the static or dynamic memory classes (thread_private, node_private, near_shared, far_shared or block_shared).

In some cases, data declared loop_private, task_private, or parallel_private is stored on the stacks of the spawned threads. Spawned thread stacks default to 80 Mbytes in size; if the amount of loop_private, task_private or parallel_private data declared exceeds this, you can use the CPS_STACK_SIZE environment variable to increase the default. Refer to the section “Default stack size” for more information.

loop_private

The loop_private directive and pragma declares a list of variables and/or arrays private to the immediately following Fortran DO or C for loop. The compiler assumes that data objects declared to be loop_private have no loop-carried dependences with respect to the parallel loops in which they are used. If dependences exist, you must handle them manually using the synchronization directives and techniques described in Chapter 6, "Chapter 6 “Advanced shared-memory programming”."

loop_private array dimensions must be determinable at compile-time.

Each parallel thread of execution receives a private copy of the loop_private data object for the duration of the loop; no starting values can be assumed for the data, and unless a save_last directive or pragma is specified (as described in a following section), no ending value can be assumed. If a loop_private data object is referenced within an iteration of the loop, it must have been assigned a value previously on that same iteration.

In Fortran, the LOOP_PRIVATE directive has the following form:

C$DIR LOOP_PRIVATE(namelist)

In C, the pragma has the form:

#pragma _CNX loop_private(namelist)

where

namelist

is a comma-separated list of variables and/or arrays that are to be private to the immediately following loop. namelist cannot contain structures, dynamic arrays, allocatable arrays, or automatic arrays.

Consider the following Fortran example:

C$DIR LOOP_PRIVATE(S)
DO I = 1, N
C S IS ONLY CORRECTLY PRIVATE IF AT LEAST
C ONE IF TEST PASSES ON EACH ITERATION:
IF(A(I) .GT. 0) S = A(I)
IF(U(I) .LT. V(I)) S = V(I)
IF(X(I) .LE. Y(I)) S = Z(I)
B(I) = S * C(I) + D(I)
ENDDO

An apparent LCD on S exists in this example; if none of the IF tests are true on a given iteration, the value of S must wrap around from the previous iteration. The LOOP_PRIVATE(S) directive indicates to the compiler that S does, in fact, get assigned on every iteration, and therefore it is safe to parallelize this loop.

If on any iteration none of the IF tests pass, an actual LCD exists and privatizing S will result in wrong answers.

An analogous C example follows:

#pragma _CNX loop_private(s)
for(i=0;i<=n;i++) {
/* s is only private if at least one if
test passes: */
if(a[i] > 0) s = a[i];
if(u[i] < v[i]) s = v[i];
if(x[i] < y[i]) s = z[i];
b[i] = s * c[i] + d[i];
}

Using loop_private with loop_parallel

Because the compiler does not automatically perform variable privatization in loop_parallel loops, you must manually privatize loop data requiring privatization. This can be easily done using the loop_private directive or pragma.

Consider the following Fortran example:

      SUBROUTINE PRIV(X,Y,Z)
REAL X(1000), Y(4,1000), Z(1000)
REAL XMFIED(1000)
C$DIR LOOP_PARALLEL, LOOP_PRIVATE(XMFIED, J)
DO I = 1, 4
C INITIALIZE XMFIED; MFY MUST NOT WRITE TO X:
CALL MFY(X, XMFIED)
DO J = 1, 999
IF (XMFIED(J) .GE. Y(I,J)) THEN
Y(I,J) = XMFIED(J) * Z(J)
ELSE
XMFIED(J+1) = XMFIED(J)
ENDIF
ENDDO
ENDDO
END

Here, the LOOP_PARALLEL directive is required to parallelize the I loop because of the call to MFY. The X and Y arrays are in shared memory by default. X and Z are not written to, and the portions of Y written to in the J loop's IF statement are disjoint, so these shared arrays require no special attention. The local array XMFIED, however, is written to. But because XMFIED carries no values into or out of the I loop, it can be privatized using LOOP_PRIVATE. This gives each thread running the I loop its own private copy of XMFIED, eliminating the expensive necessity of synchronized access to XMFIED. Note that a loop-carried dependence exists for XMFIED in the J loop, but because this loop runs serially on each processor, this dependence is safe.

J is privatized as discussed in the section “Privatizing induction variables in nested loops”.

An analogous C example follows:

void priv(float x[1000], float y[4][1000], float z[1000]) {
float xmfied[1000];
int i,j;
#pragma _CNX loop_parallel(ivar=i), loop_private(xmfied,j)
for(i=0;i<4;i++) {
mfy(x,xmfied);
for(j=0;j<999;j++) {
if(xmfied[j] >= y[i][j]) y[i][j] = xmfied[j]*z[j];
else xmfied[j+1] = xmfied[j];
}
}
}

Denoting induction variables in parallel loops

To safely parallelize a loop with the loop_parallel directive or pragma, the compiler must be able to correctly determine the loop's primary induction variable.

The compiler can find primary Fortran DO loop induction variables; it may, however, have trouble with DO WHILE or hand-rolled Fortran loops, and with all loop_parallel loops in C. Therefore, when you use the loop_parallel directive or pragma to manually parallelize a loop other than an explicit Fortran DO loop, you should indicate the loop's primary induction variable using the IVAR=indvar attribute to loop_parallel.

Consider the following Fortran example:

      I = 1
C$DIR LOOP_PARALLEL(IVAR = I)
10 A(I) = ...
.
. ! ASSUME NO DEPENDENCES
.
I = I + 1
IF(I .LE. N) GOTO 10

This is a hand-rolled loop that uses I as its primary induction variable. To ensure parallelization, the LOOP_PARALLEL directive has been placed immediately before the start of the loop, and the induction variable, I, has been specified.

Primary induction variables in C loops can be difficult for the compiler to find, so ivar is required in all loop_parallel C loops. Its use is shown in the following example:

#pragma _CNX loop_parallel(ivar=i)
for(i=0; i<n; i++) {
a[i] = ...;
.
. /* assume no dependences */
.
}
}

Secondary induction variables are variables used to track loop iterations even though they do not appear in the Fortran DO statement. They cannot appear in addition to the primary induction variable in the C for statement. Such variables must be a function of the primary loop induction variable; they cannot be independent. Secondary induction variables must also either be assigned a memory class manually (as described in Chapter 5, "Chapter 5 “Memory classes”") or declared loop_private.

The following Fortran example contains an incorrectly incremented secondary induction variable:

C WARNING: INCORRECT EXAMPLE!!!!
J = 1
C$DIR LOOP_PARALLEL
DO I = 1, N
J = J + 2 ! WRONG!!!

Here, J will not produce expected values in each iteration because multiple threads are overwriting its value with no synchronization. The compiler cannot privatize J because it is a loop-carried dependence (LCD). This example can be corrected by privatizing J and making it a function of I, as shown below.

C CORRECT EXAMPLE:
J = 1
C$DIR LOOP_PARALLEL
C$DIR LOOP_PRIVATE(J) ! J IS PRIVATE
DO I = 1, N
J = (2*I)+1 ! J IS PRIVATE

Here, J will be assigned correct values on each iteration because it is a function of I, and can be safely privatized.

In C, secondary induction variables are sometimes included in for statements, as shown in the following example:

/* warning: unparallelizable code follows */
#pragma _CNX loop_parallel(ivar=i)
for(i=j=0; i<n;i++,j+=2) {
a[i] = ...;
.
.
.
}
}

Because secondary induction variables must be private to the loop and must be a function of the primary induction variable, this example cannot be safely parallelized using loop_parallel(ivar=i). In the presence of this directive, the secondary induction variable will not be recognized. To manually parallelize this loop, you must remove j from the for statement and either privatize it and make it a function of i, or declare j to be shared (which is the default storage class), specify the ordered attribute on the loop_parallel directive, and increment it within an ordered critical section inside the loop. This latter method is costly in terms of synchronization overhead and may degrade the performance of the loop.

The following example demonstrates how to restructure the loop so that j is a valid secondary induction variable:

#pragma _CNX loop_parallel(ivar=i)
#pragma _CNX loop_private(j)
for(i=0; i<n; i++) {
j = 2*i;
a[i] = ...;
.
.
.
}
}

This method runs faster than placing j in a critical section because it requires no synchronization overhead, and the private copy of j used here can typically be more quickly accessed than a shared variable.

Privatizing induction variables in nested loops

The induction variables of nonparallel loops that are contained within parallel loops must be declared loop_private with respect to their closest enclosing parallel loop.

Consider the following Fortran example:

C$DIR LOOP_PARALLEL(THREADS)
C$DIR LOOP_PRIVATE(J)
DO I = 1, N ! I LOOP GOES PARALLEL
DO J = 1, M ! J LOOP IS SERIAL
.
.
.
ENDDO
ENDDO

Here, LOOP_PARALLEL causes the I loop to be parallelized across threads. The J loop, then, runs serially. J must be private with respect to the I loop so that the threads that run the I loop do not attempt to update the same copy of J. If the loop is automatically parallelized by the compiler, or parallelized due to the presence of a PREFER_PARALLEL directive, this privatization will be automatic. But the presence of the LOOP_PARALLEL directive requires manual privatization.

An analogous C example follows:

#pragma _CNX loop_parallel(threads, ivar=i)
#pragma _CNX loop_private(j)
for(i=0;i<50;i++) {
for(j=0;j<50;j++) {
.
.
.
}
}

This also applies to nested parallel outer loops. In this case, loop variables contained within a parallel construct—even if they are used in a parallel loop themselves—must be declared private with respect to the innermost enclosing parallel loop.

Consider the following Fortran example:

C$DIR LOOP_PARALLEL(NODES), LOOP_PRIVATE(J)
DO I = 1, N ! I LOOP GOES NODE PAR
C$DIR LOOP_PARALLEL(THREADS)
C$DIR LOOP_PRIVATE(K)
DO J = 1, M ! J LOOP GOES THREAD PAR
DO K = 1, L ! K LOOP IS SERIAL
.
.
.
ENDDO
ENDDO
ENDDO

Here, LOOP_PARALLEL is used to parallelize the I loop across hypernodes, and the J loop across processors on each hypernode. K must be declared private to the J loop to ensure that the thread-parallel threads do not interfere with each other in updating it. J must be declared private to the I loop to ensure that each node-parallel thread gets its own copy.

An analogous C example follows:

#pragma _CNX loop_parallel(nodes, ivar=i), loop_private(j)
for(i=0;i<n;i++) {
#pragma _CNX loop_parallel(threads, ivar=j)
#pragma _CNX loop_private(k)
for(j=0;j<m;j++) {
for(k=0;k<l;k++) {
.
.
.
}
}
}

task_private

The task_private directive declares a list of variables and/or arrays private to the immediately following tasks; it serves the same purpose for parallel tasks that loop_private serves for loops.

The task_private directive must immediately precede or appear on the same line as its corresponding begin_tasks directive. The compiler assumes that data objects declared to be task_private have no dependences between the tasks in which they are used. If dependences exist, you must handle them manually using the synchronization directives and techniques described in Chapter 6, "Chapter 6 “Advanced shared-memory programming”."

Each parallel thread of execution receives a private copy of the task_private data object for the duration of the tasks; no starting or ending values can be assumed for the data. If a task_private data object is referenced within a task, it must have been assigned a value previously in that task.

In Fortran, the TASK_PRIVATE directive has the following form:

C$DIR TASK_PRIVATE(namelist)

In C, the pragma has the form:

#pragma _CNX task_private(namelist)

where

namelist

is a comma-separated list of variables and/or arrays that are to be private to the immediately following tasks. namelist cannot contain dynamic, allocatable, or automatic arrays.

Consider the following Fortran example:

      REAL*8 A(1000), B(1000), WRK(1000)
.
.
.
C$DIR BEGIN_TASKS, TASK_PRIVATE(WRK)
DO I = 1, N
WRK(I) = A(I)
ENDDO
DO I = 1, N
A(I) = WRK(N+1-I)
.
.
.
ENDDO
C$DIR NEXT_TASK
DO J = 1, M
WRK(J) = B(J)
ENDDO
DO J = 1, M
B(J) = WRK(M+1-J)
.
.
.
ENDDO
C$DIR END_TASKS

Here, the WRK array is used in the first task to temporarily hold the A array so that its order can be reversed. It serves the same purpose for the B array in the second task. WRK is assigned before it is used in each task.

An analogous C example follows:

float a[1000], b[1000], wrk[1000];
.
.
.
#pragma _CNX task_private(wrk)
#pragma _CNX begin_tasks
for(i=0;i<n;i++)
wrk[i] = a[i];
for(i=0;i<n;i++) {
a[i] = wrk[n-1-i];
.
.
.
}
#pragma _CNX next_task
for(j=0;j<m;j++)
wrk[j] = b[j];
for(j=0;j<m;j++) {
b[j] = wrk[m-1-j];
.
.
.
}
#pragma _CNX end_tasks

parallel_private

The parallel_private directive declares a list of variables and/or arrays private to the immediately following parallel region; it serves the same purpose for parallel regions that task_private serves for tasks.

The parallel_private directive must immediately precede or appear on the same line as its corresponding parallel directive. Using parallel_private asserts that there are no dependences in the parallel region; do not use this directive if there are dependences.

Each parallel thread of execution receives a private copy of the parallel_private data object for the duration of the region; no starting or ending values can be assumed for the data. If a parallel_private data object is referenced within a region, it must have been assigned a value previously in the region.

In Fortran, the PARALLEL_PRIVATE directive has the form:

C$DIR PARALLEL_PRIVATE(namelist)

In C, the pragma has the form:

#pragma _CNX parallel_private(namelist)

where

namelist

is a comma-separated list of variables and/or arrays that are to be private to the immediately following parallel region. namelist cannot contain dynamic, allocatable, or automatic arrays.

Consider the following Fortran example:

      REAL A(1000,8), B(1000,8), C(1000,8), AWORK(1000), SUM(8)
INTEGER MYTID
.
.
.
C$DIR PARALLEL(MAX_THREADS = 8), PARALLEL_PRIVATE(I,J,K,L,M,AWORK,MYTID)
IF(NUM_THREADS() .LT. 8) STOP "NOT ENOUGH THREADS; EXITING"
MYTID = MY_THREAD() + 1 !ADD 1 FOR PROPER SUBSCRIPTING
DO I = 1, 1000
AWORK(I) = A(I, MYTID)
ENDDO
DO J = 1, 1000
A(J, MYTID) = AWORK(J) + B(J, MYTID)
ENDDO
DO K = 1, 1000
B(K, MYTID) = B(K, MYTID) * AWORK(K)
C(K, MYTID) = A(K, MYTID) * B(K, MYTID)
ENDDO
DO L = 1, 1000
SUM(MYTID) = SUM(MYTID) + A(L,MYTID) + B(L,MYTID) + C(L,MYTID)
ENDDO
DO M = 1, 1000
A(M, MYTID) = AWORK(M)
ENDDO
C$DIR END_PARALLEL

This example is similar to the one presented in the section “Region parallelization” in the way it checks for a certain number of threads and divides up the work among those threads. However, the parallel_private variable AWORK is introduced.

Each thread initializes its private copy of AWORK to the values contained in a dimension of the array A at the beginning of the parallel region; this allows the threads to reference AWORK without regard to thread ID, because no thread can access any other thread's copy of AWORK. Note that AWORK cannot carry values into or out of the region, so it must be initialized within the region.

All induction variables contained in a parallel region must be privatized. Remember that the code contained in the region runs on all available threads, so failing to privatize an induction variable would allow each thread to update the same shared variable, creating indeterminate loop counts on every thread.

In the J loop after AWORK is initialized, AWORK is effectively used in a reduction on A (since at this point its contents are identical to the MYTID dimension of A). After A is modified here and used in the K and L loops, each thread restores a dimension of A's original values from its private copy of AWORK, which carried the appropriate dimension through the region unaltered.

An analogous C example follows:

float a[8][1000], b[8][1000], c[8][1000], awork[1000];
int i, mytid;
.
.
.
#pragma _CNX parallel(max_threads = 8)
#pragma _CNX parallel_private(i,j,k,l,m,awork,mytid)
if(num_threads() < 8) {
fprintf(stderr, "not enough threads; exiting\n");
exit(2);
}
mytid = my_thread();
for(i=0; i<1000; i++)
awork[i] = a[mytid][i];
for(j=0; j<1000; j++)
a[mytid][j] = awork[j] + b[mytid][j];
for(k=0; k<1000; k++) {
b[mytid][k] = b[mytid][k] * awork[k];
c[mytid][k] = a[mytid][k] * b[mytid][k];
}
for(l=0; l<1000; l++)
sum[mytid] = sum[mytid] + a[mytid][l] + b[mytid][l] + c[mytid][l];
for(m=0; m<1000; m++)
a[mytid][m] = awork[m];
#pragma _CNX end_parallel

save_last[(list)]

The save_last directive and pragma allow you to save the final value of loop_private data objects assigned in the last iteration of the immediately following loop. If list (the optional, comma-separated list of loop_private data objects) is specified, only the final values of those data objects in list are saved. If list is not specified, the final values of all loop_private data objects assigned in the last loop iteration are saved.

The values must be assigned in the last iteration; if the assignment is executed conditionally, it is your responsibility to ensure that the condition is met and the assignment executes. Incorrect answers can result if the assignment does not execute on the last iteration. For loop_private arrays, only those elements of the array assigned on the last iteration will be saved.

In Fortran, the SAVE_LAST directive has the form:

C$DIR SAVE_LAST[(list)]

In C, the pragma has the form:

#pragma _CNX save_last[(list)]

save_last must appear immediately before or after the associated loop_private directive or pragma, or on the same line.

A save_last directive or pragma causes the thread that executes the last iteration of the loop to write back the private (or local) copy of the variable into the global reference.

Consider the following Fortran example:

C$DIR LOOP_PARALLEL
C$DIR LOOP_PRIVATE(ATEMP, X, Y)
C$DIR SAVE_LAST(ATEMP, X)
DO I = 1, N
IF (I .EQ. D(I)) ATEMP = A(I)
IF (I .EQ. E(I)) ATEMP = B(I)
IF (I .EQ. F(I)) ATEMP = C(I)
A(I) = B(I) + C(I)
B(I) = ATEMP
X = ATEMP * A(I)
Y = ATEMP * C(I)
ENDDO
.
.
.
IF(ATEMP .GT. AMAX) THEN
.
.
.

Here, the LOOP_PRIVATE variable ATEMP is conditionally assigned in the loop; in order for ATEMP to be truly private, you must be sure that at least one of the conditions is met so that ATEMP is assigned on every iteration. When the loop terminates, the SAVE_LAST directive ensures that ATEMP and X contain the values they are assigned on the last iteration. These values can then be used later in the program. The value of Y however is not available once the loop finishes because Y is not specified as an argument to SAVE_LAST.

An analogous C example follows:

#pragma _CNX loop_parallel(ivar=i)
#pragma _CNX loop_private(atemp, x, y)
#pragma _CNX save_last(atemp, x)
for(i=0;i<n;i++) {
if(i==d[i]) atemp = a[i];
if(i==e[i]) atemp = b[i];
if(i==f[i]) atemp = c[i];
a[i] = b[i] + c[i];
b[i] = atemp;
x = atemp * a[i];
y = atemp * c[i];
}
.
.
.
if(atemp > amax) {
.
.
.

Note that the save_last directive can be misleading in certain loop contexts.

Consider the following Fortran example:

C$DIR LOOP_PARALLEL
C$DIR LOOP_PRIVATE(S)
C$DIR SAVE_LAST
DO I = 1, N
IF(G(I) .GT. 0) THEN
S = G(I) * G(I)
ENDIF
ENDDO

While it may appear that the last value of S assigned (on whatever iteration) is saved in this example, you must remember that the SAVE_LAST directive applies only to the last (Nth) iteration, without regard for any conditionals contained in the loop. For SAVE_LAST to be valid here, G(N) must be greater than 0 so that the assignment to S takes place on the final iteration. Obviously, if this condition can be predicted, the loop can be more efficiently written to exclude the IF test, so the presence of a SAVE_LAST in such a loop is suspect.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.