search    
HP aC++ Online Programmer's Guide
Hewlett-Packard
Parallel Processing
HP aC++ generates efficient parallel code by default. For applications running on HP 9000 K-Class and V-Class servers, you can increase the amount of code the compiler can parallelize on multiprocessor systems by using options and supporting library calls.

This section discusses the following topics:

For more information, refer to the Parallel Programming Guide for HP-UX Systems.

Getting Started with Parallelizing HP aC++ Programs
Here are some basic tasks to help you get started with parallelizing HP aC++ programs.

Transforming Loops for Parallel Execution (+Oparallel)

The +Oparallel option causes the compiler to transform eligible loops for parallel execution on multiprocessor machines.

The following command lines compile (without linking) three source files: x.c, y.c, and z.c. The files x.c and y.c are compiled for parallel execution. The file z.c is compiled for serial execution, even though its object file will be linked with x.o and y.o.


aCC +O3 +Oparallel -c x.c y.c 

aCC +O3 -c z.c

The following command line links the three object files, producing the executable file para_prog:


aCC +O3 +Oparallel -o para_prog x.o y.o z.o

As this command line implies, if you link and compile separately, you must use aCC, not ld. The command line to link must also include the +Oparallel and +O3 options in order to link in the right startup files and runtime support.


Setting the Number of Threads Used in Parallel

Use the MP_NUMBER_OF_THREADS environment variable to set the number of processors that are to execute your program in parallel. If you do not set this variable, it defaults to the number of processors on the executing machine.

From the C shell, the following command sets MP_NUMBER_OF_THREADS to indicate that programs compiled for parallel execution can execute on two processors:

setenv MP_NUMBER_OF_THREADS 2

If you use the Korn shell, the command is:

export MP_NUMBER_OF_THREADS=2


Determining Idle Thread States

Use the MP_IDLE_THREADS_WAIT environment variable to determine how threads wait. Idle threads can be suspended or can spin-wait.

This variable takes an integer value n. For n less than 0, the threads spin wait. For n equal to or greater than 0, the threads spin-wait for n milliseconds before being suspended.

By default, idle threads spin-wait briefly after creating a join. They then suspend themselves if they receive no work.


Accessing the Pthreads Library

Pthreads (POSIX threads) refers to the Pthreads library of thread-management routines. For information on Pthread routines see the pthread(3t) man page. To use the Pthread routines, your program must include the <pthreads.h> header file and the Pthreads library must be explicitly linked to your program. For example:

aCC -D_POSIX_C_SOURCE+199506L prog.c -lpthread -D_REENTRANT

The -D_POSIX_C_SOURCE=199506L string specifies the appropriate POSIX revision level. In this case, the level is 199506L.


Profiling Parallelized Programs

Profiling a program that has been compiled for parallel execution is performed in much the same way as it is for non-parallel programs:

  1. Compile the program with the option -G.
  2. Run the program to produce profiling data.
  3. Run gprof against the program.
  4. View the output from gprof.

    The differences are:

    • Running the program in Step 2 produces a gmon.out file for the master process and gmon.out.1, gmon.out.2, and so on for each of the slave processes. If your program executes on two processors, Step 2 produces two files, gmon.out and gmon.out.1.
    • The flat profile that you view in Step 4 indicates loops that were parallelized with the following notation:
      
      routine_name##pr_line_0123
      
      

      where routine_name is the name of the routine containing the loop, pr (parallel region) indicates that the loop was parallelized, and 0123 is the line number of the beginning of the loop or loops that are parallelized.

Guidelines for Parallelizing HP aC++ Programs
To ensure the best performance from a parallel program, do not run more than one parallel program on a multiprocessor machine at the same time. Running two or more parallel programs simultaneously or running one parallel program on a heavily loaded system, will slow performance.

You should run a parallel-executing program at a higher priority than any other user program; see rtprio(1) for information about setting real-time priorities.


Conditions Inhibiting Loop Parallelization

The following sections describe conditions that can inhibit parallelization.


Calling Routines with Side Effects

The compiler will not parallelize any loop containing a call to a routine that has side effects. A routine has side effects if it does any of the following:

  • Modifies its arguments.
  • Modifies an extern, static, or global variable.
  • Redefines variables that are local to the calling routine.
  • Performs I/O.
  • Calls another subroutine or function that does any of the above.

Indeterminate Iteration Counts

If the compiler cannot predict what the runtime loop iteration count is before the loop executes, it does not parallelize the loop. The reason for this precaution is that the runtime code must know the iteration count in order to know how many iterations to distribute to the different processors for execution.

The following conditions can prevent a runtime count:

  • The loop is an infinite loop.
  • A conditional break statement or goto out of the loop appears in the loop.
  • The loop modifies either the loop-control or loop-limit variable.
  • The loop is a while construct and the condition being tested is defined within the loop.

Data Dependence

When a loop is parallelized, the iterations are executed independently on different processors, and the order of execution differs from the serial order that occurs on a single processor. This effect of parallelization is not a problem. The iterations could be executed in any order with no effect on the results. Consider the following loop:

for (i=0; i&lt;05; i++) a[i] = a[i] * b[i];

In this example, the array a would always end up with the same data regardless of whether the order of execution were 0-1-2-3-4, 4-3-2-1-0, 3-1-4-0-2, or any other order. The independence of each iteration from the others makes the loop eligible candidate for parallelization.

Such is not the case in the following:

for (i=1; i&lt;05; i++) a[i] = a[i-1] * b[i];

In this loop, the order of execution does matter. The data used in iteration i is dependent upon the data that was produced in the previous iteration [i-1]. a would end up with very different data if the order of execution were any other than 1-2-3-4. The data dependence in this loop thus makes it ineligible for parallelization.

Not all data dependences must inhibit parallelization. The following paragraphs discuss some of the exceptions.

Nested Loops and Matrices

Some nested loops that operate on matrices may have a data dependence in the inner loop only, allowing the outer loop to be parallelized. Consider the following:


for(i=0; i<10; i++) for(j=1; j<100; j++) 

 a[i][j]=a[i][j-1] + 1;

The data dependence in this nested loop occurs in the inner [j] loop: Each row access of a[i][j] depends upon the preceding row [j-1] having been assigned in the previous iteration. If the iterations of the [j] loop were to execute in any other order than the one in which they would execute on a single processor, the matrix would be assigned different values. The inner loop, therefore, must not be parallelized.

But no such data dependence appears in the outer loop: Each column access is independent of every other column access. Consequently, the compiler can safely distribute entire columns of the matrix to execute on different processors; the data assignments will be the same regardless of the order in which the columns are executed, so long as each executes in serial order.


Assumed Dependencies

When analyzing a loop, the compiler errs on the safe side and assume that what looks like a data dependence really is one and so it does not parallelize the loop. Consider the following:


for (i=100; i<200; i++) a[i] = a[i-k];

The compiler assumes that a data dependence exists in this loop because it appears that data that has been defined in a previous iteration is being used in a later iteration. However, if the value of k is 100, the dependence is assumed rather than real because a[i-k] is defined outside the loop.

Memory Classes

In order to use memory classes in C++ programs, you must include the header file /usr/include/spp_prog_model.h. Memory classes are described in the Parallel Programming Guide for HP-UX Systems.

In C++, the general form for assigning memory is:

#include <spp_prog_model.h>

        .   .  .

[storage_class_specfier] memory_class_name type_specifier namelist

where:

storage_class_specifier
specifies a non-automatic storage class

 

memory_class_name
is thread_private or node_private

 

type_specifier
is a data type (for example, int or float)

 

namelist
is a comma-separated list of variables and/or arrays of type type_specifier. Data objects that are assigned a memory class must have a static storage duration. If the object is declared within a function, it must have the storage class extern or static. Data objects declared at file scope and assigned a memory class need not specify a storage class.

A hypernode is a set of processors and physical memory organized as a symmetric multiprocessor (SMP) running a single image of the operating system microkernel.


node_private namelist

This storage class specifier causes the variables and arrays specified in namelist to be replicated in the physical memory of each hypernode on which the process is executing. While each data object has a single image in virtual memory, it maps to a different physical location on each hypernode. The threads of a process within a hypernode all share access to the copy on their hypernode and cannot access the copies on other hypernodes.


thread_private namelist

This storage class specifier causes the variables and arrays to be treated as being thread_private. These data objects map to unique node_private addresses for each thread of a process. Refer to the Parallel Programming Guide for HP-UX Systems for more information.  

Synchronization Functions

HP aC++ provides functions that can be used with pragmas to achieve synchronization.

Gates allow you to restrict execution of a block of code to a single thread. They can be allocated, locked, unlocked or deallocated. Or, they can be used with the ordered or critical section pragmas, which automate the locking and unlocking functions.

Barrriers block further execution until all executing threads reach the barrier.

You declare gates and barriers by using the following type definitions:

gate_t namelist
declares variables to use in a critical section, ordered section, or passed as arguments to the synchronization functions
barrier_t namelist
declares a list of synchronization variables for the barrier routines namelist is a comma-separated list of one or more gate or barrier names.

Gates and barriers should only appear in definition and declaration statements, and as formal and actual arguments.


Allocate Functions

These functions allocate memory for a gate or barrier. When memory is first allocated, gate variables are unlocked.

int alloc_gate(gate_t *gate_p);

int alloc_barrier(barrier_t *barrier_p);

gate_p and barrier_p are pointers of the indicated type, which have been previously declared as described above.


Deallocate Functions

These functions free the memory assigned to the specified gate or barrier variable. These functions have the following declarations:

int free_gate(gate_t *gate_p);

int free_barrier(barrier_t, *barrier_p);

where gate_p and barrier_p are pointers of the indicated type. Always free gates and barriers when you are done using them.

 


Locking Functions

These functions acquire a gate for exclusive access. If the gate cannot be immediately acquired, the calling thread waits for it. The conditional locking functions, which are prefixed with COND_ or cond_, acquire a gate if doing so does not require a wait. If the gate is acquired, the functions return 0; if not, they return -1.

The functions have the following declarations:

int lock_gate(gate_t *gate_p);

int cond_lock_gate(gate_t *gate_p);

where gate_p is a pointer of the indicated type.


Unlocking Function

This function releases a gate from exclusive access. Gates are typically released by the thread that locks them, unless a gate was locked by thread 0 in serial code. In that case it might be unlocked by single different thread in a parallel construct.

The function has the following declaration:

int unlock_gate(gate_t *gate_p);

where gate_p is a pointer of the indicated type.


Wait Function

This function uses a barrier to cause the calling thread to wait until the specified number of threads call the function, at which point all threads are released from the function simultaneously. The functions have the following declarations:

int wait_barrier(barrier_t *barrier_p, const int *nthr);

where barrier_p is a pointer of the indicated type and nthr is a pointer referencing the number of threads calling the routine.

You can use a barrier variable in multiple calls to the wait() function, as long as you ensure that two barriers are not active at the same time. Also, check that nthr reflects the correct number of threads.