Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Fortran 90, Fortran 77, C, aC++: Exemplar Programming Guide > Chapter 8 Programming conventions for optimal code

Misused memory classes

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

While manually assigned memory classes can substantially boost performance when coupled with manual parallelization, assigning the wrong memory class to data can cause wrong answers and in some cases degrade performance. This section discusses some common misuses of memory classes.

Improper dynamic allocations

Dynamically allocating thread_private memory from serial code can give unexpected results if the memory is later accessed from parallel code.

Consider the following incorrect Fortran example:

C INCORRECT EXAMPLE FOLLOWS!!!!
REAL*8 WRONGTP(:)
C$DIR THREAD_PRIVATE(WRONGTP)
ALLOCATABLE WRONGTP
.
.
.
C THE FOLLOWING ALLOCATE ONLY ALLOCATES
C WRONGTP(N) FOR THREAD 0:
ALLOCATE(WRONGTP(N))
C$DIR LOOP_PARALLEL(THREADS, IVAR = I)
C$DIR LOOP_PRIVATE(J)
DO I = 1, NUM_THREADS()
DO J = 1, N
WRONGTP(J) = ... ! ONLY EXISTS FOR
. ! THREAD 0
.
.
ENDDO
ENDDO

Here, the array WRONGTP is allocated, but because the allocation takes place in serial code, which is run by thread 0, only thread 0 allocates the array. When other threads attempt to access the array in the J loop, it does not exist. To fix this, allocate the array inside the thread-parallel I loop, as discussed in Chapter 5, "Chapter 5 “Memory classes”."

An analogous C example follows:

/* INCORRECT EXAMPLE FOLLOWS!!!                            */
static thread_private double *wrongtp;
.
.
.
/* the following memory_class_malloc only allocates wrongtp
for thread 0 */
wrongtp=(double *)memory_class_malloc(sizeof(double)*n,
THREAD_PRIVATE_MEM);
#pragma _CNX loop_parallel(threads, ivar=i)
#pragma _CNX loop_private(j)
for(i=0;i<num_threads();i++) {
for(j=0;j<n;j++) {
wrongtp[j] = ... /* only exists for thread 0 */
.
.
.
}
}

In general, memory of classes other than thread_private should be dynamically allocated in serial code. Allocating node_private, near_shared, far_shared and block_shared memory from within parallel code will create wasteful redundant copies.

Consider the following incorrect Fortran example:

C INCORRECT EXAMPLE FOLLOWS!!!
REAL*8 WRONGNP(:)
C$DIR NODE_PRIVATE(WRONGNP)
C$DIR FAR_SHARED_POINTER(WRONGNP)
ALLOCATABLE WRONGNP
.
.
.
N = NUM_NODES
C$DIR LOOP_PARALLEL(NODES, IVAR = I)
DO I = 1, N
ALLOCATE(WRONGNP(M))
.
.
.
ENDDO

Recall from Chapter 5, "Chapter 5 “Memory classes”," that when a node_private array is allocated, a physical copy is created on each hypernode on which the program is running. Here, each loop iteration executes the ALLOCATE statement (or memory_class_malloc function in C), thus allocating N copies of the array. This is N-1 times more copies than are actually needed. To further complicate things, node_private arrays manipulated in parallel code must be accessed by shared pointers, which is why the Fortran example includes a far_shared_pointer statement. In the code above, this pointer would be overwritten every time the I loop executed the ALLOCATE statement (or memory_class_malloc function in C), meaning that only the final copy allocated would be accessible. Since the hypernodes' execution of the loop code is not perfectly synchronized, the actual memory accessed by WRONGNP(I) would vary depending on which hypernode was last to perform the allocation.

An analogous C example follows:

/* INCORRECT EXAMPLE FOLLOWS!!!                           */
static far_shared double *wrongnp;
.
.
.
n = numnodes();
#pragma _CNX loop_parallel(nodes, ivar=i)
for(i=0;i<n;i++) {
wrongnp = (double *)memory_class_malloc(sizeof(double)*m,
NODE_PRIVATE_MEM);
.
.
.
}

While dynamically allocated near_shared, far_shared and block_shared arrays do not normally require special pointer types, they suffer from the same redundant-copy problem. Allocating any shared-memory arrays from within parallel code will create as many copies of the data as there are hypernodes (or threads) executing the ALLOCATE (or memory_class_malloc) statement. As with the node_private example above, the actual memory accessed will depend on which hypernode most recently executed the ALLOCATE statement. After all hypernodes have executed the ALLOCATE, the memory allocated by all but the last will be lost. Such lost arrays are not only unusable, they cannot be deallocated.

To avoid such redundancy problems, follow the allocation examples discussed in Chapter 5, "Chapter 5 “Memory classes”," and only allocate memory from within parallel constructs as described there.

Incorrect array pointers

As mentioned in the previous section, sometimes it is necessary to access dynamically allocated arrays using pointers of different memory classes. For example, when accessing node_private arrays from node-parallel code, far_shared pointers must be used (refer to Chapter 5, "Chapter 5 “Memory classes”"). Failing to do this will render the copies of the arrays on all but logical hypernode 0 inaccessible.

Consider the following incorrect Fortran example:

C INCORRECT EXAMPLE FOLLOWS!!!!
REAL*8 N0NP(:)
C$DIR NODE_PRIVATE(N0NP)
ALLOCATABLE N0NP
.
.
.
ALLOCATE(N0NP(M))
N = NUM_NODES
C$DIR LOOP_PARALLEL(NODES), LOOP_PRIVATE(J)
DO I = 1, N
C$DIR LOOP_PARALLEL(THREADS)
DO J = 1, M
N0NP(J) = ...
.
.
.
ENDDO
ENDDO

While the N0NP array is correctly allocated in serial code here, it is not explicitly given a shared pointer, so the arrays created will be accessed by the default node_private pointer. A physical copy of N0NP will be created on every hypernode, but the node_private pointer by which these copies are accessed will only be initialized on logical hypernode 0, because it is the only hypernode executing the ALLOCATE statement (or memory_class_malloc in C). The contents of the (node_private) pointers on other hypernodes are uninitialized and therefore indeterminate. When, in the hypernode-parallel J loop, these other hypernodes attempt to access N0NP, they will do so using the garbage contents of their uninitialized pointers, typically causing a runtime error.

An analogous C example follows:

/* INCORRECT EXAMPLE FOLLOWS!!!                           */
static node_private double *n0np;
.
.
.
n0np = (double *)memory_class_malloc(sizeof(double)*m,
NODE_PRIVATE_MEM);
n = numnodes();
#pragma _CNX loop_parallel(nodes, ivar=i), loop_private(j)
for(i=0;i<n;i++) {
#pragma _CNX loop_parallel(threads, ivar=j)
for(j=0;j<m;j++) {
n0np[j] = ...
.
.
.
}
}

Chapter 5, "Chapter 5 “Memory classes”," covers correct pointer/data combinations and explains the situations in which nondefault pointers should be used. To avoid uninitialized pointer problems such as the one described above, follow the recommendations of Chapter 5 carefully.

Hidden dependences

Improperly accessing a shared variable from parallel threads can create an unapparent dependence that can cause wrong answers.

Consider the following Fortran code:

      PROGRAM HOLDER
REAL HOLD
C$DIR FAR_SHARED(HOLD)
C$DIR TASK_PRIVATE(X,Y)
C$DIR BEGIN_TASKS
X = ...
.
.
.
CALL ADDHOLD(HOLD, X)
C$DIR NEXT_TASK
Y = ...
.
.
.
CALL ADDHOLD(HOLD,Y)
C$DIR END_TASKS
END

SUBROUTINE ADDHOLD(HOLD,Z)
REAL HOLD, Z
HOLD = HOLD+Z
END

Here, the far_shared variable HOLD is updated as a function of itself in the subroutine ADDHOLD, which is called from the potentially parallel tasks. If HOLD was updated within the tasks rather than in a subroutine, the dependence would be more obvious to the programmer, who may not have ready access to ADDHOLD's source.

Isolating the assignment to HOLD inside a critical section would allow the tasks to safely parallelize, whether the assignment took place in a subroutine or inside the tasks themselves.

An analogous C example follows:

void addhold(float *hold, float z) {
*hold = *hold + z;
}
main() {
static far_shared float hold;
static float x,y;
#pragma _CNX task_private(x,y)
#pragma _CNX begin_tasks
x = ...;
.
.
.
addhold(&hold,x);
#pragma _CNX next_task
y = ...;
.
.
.
addhold(&hold,y);
#pragma _CNX end_tasks
}

Always use caution when parallelizing a call to a procedure that passes the same shared variable from every thread.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.