HP C/HP-UX Online Help

Return to the Main HP C Online Help page



NOTE: See the Compiling & Running HP C Programs section of the HP C Online Help for a quick reference of all HP C compiler options and pragmas. See the Optimizing HP C Programs section of the HP C Online Help for a detailed description of HP C optimization options and pragmas.

Parallel Options and Pragmas

Getting Started with Parallelizing C Programs
Guidelines for Parallelizing C Programs
Parallel Processing Options
Parallel Processing Pragmas
Openmp Pragmas
Memory Classes
Synchronization Functions

HP C generates efficient parallel code by default. You can increase the amount of code the compiler can parallelize on multiprocessor systems by using options, pragmas, and supporting library calls. Applications running on HP 9000 K-Class and V-Class servers can benefit from the parallelization features described in this section.

For detailed information and examples, see the Parallel Programming Guide for HP-UX Systems.

Getting Started with Parallelizing C Programs

This section describes the basis tasks required to help you get started with parallelizing C programs, including:

Transforming Loops for Parallel Execution (+Oparallel)

The +Oparallel option causes the compiler to transform eligible loops for parallel execution on multiprocessor machines.

The following command lines compile (without linking) three source files: x.c, y.c, and z.c. The files x.c and y.c are compiled for parallel execution. The file z.c is compiled for serial execution, even though its object file will be linked with x.o and y.o.

cc +O3 +Oparallel -c x.c y.c cc +O3 -c z.c

The following command line links the three object files, producing the executable file para_prog:

cc +O3 +Oparallel -o para_prog x.o y.o z.o

As this command line implies, if you link and compile separately, you must use cc, not ld. The command line to link must also include the +Oparallel and +O3 options in order to link in the right startup files and runtime support.

Setting the Number of Threads Used in Parallel

Use the MP_NUMBER_OF_THREADS environment variable to set the number of processors that are to execute your program in parallel. If you do not set this variable, it defaults to the number of processors on the executing machine.

From the C shell, the following command sets MP_NUMBER_OF_THREADS to indicate that programs compiled for parallel execution can execute on two processors:

setenv MP_NUMBER_OF_THREADS 2

If you use the Korn shell, the command is:

export MP_NUMBER_OF_THREADS=2

Determining Idle Thread States

Use the MP_IDLE_THREADS_WAIT environment variable to determine how threads wait. Idle threads can be suspended or can spin-wait.

This variable takes an integer value n. For n less than 0, the threads spin-wait. For n equal to or greater than 0, the threads spin-wait for n milliseconds before being suspended.

By default, idle threads spin-wait briefly after creation or a join. They then suspend themselves if they receive no work.

Accessing the Pthreads Library

Pthreads (POSIX threads) refers to the Pthreads library of thread-management routines. For information on Pthread routines see the pthread(3t) man page.

To use the Pthread routines, your program must include the <pthreads.h> header file and the Pthreads library must be explicitly linked to your program. For example:

% cc -D_POSIX_C_SOURCE+199506L prog.c -lpthread

The -D_POSIX_C_SOURCE=199506L string specifies the appropriate POSIX revision level. In this case, the level is 199506L.

Profiling Parallelized Programs

Profiling a program that has been compiled for parallel execution is performed in much the same way as it is for non-parallel programs:
  1. Compile the program with the option -G.
  2. Run the program to produce profiling data.
  3. Run gprof against the program.
  4. View the output from gprof.
The differences are:

Guidelines for Parallelizing C Programs

This section describes the following guidelines for parallelizing C programs: To ensure the best performance from a parallel program, do not run more than one parallel program on a multiprocessor machine at the same time. Running two or more parallel programs simultaneously or running one parallel program on a heavily loaded system, will slow performance.

You should run a parallel-executing program at a higher priority than any other user program; see rtprio(1) for information about setting real-time priorities.

Conditions Inhibiting Loop Parallelization

The following sections describe different conditions that can inhibit parallelization.

Calling Routines with Side Effects

The compiler will not parallelize any loop containing a call to a routine that has side effects. A routine has side effects if it does any of the following:

Indeterminate Iteration Counts

If the compiler cannot determine what the runtime loop iteration count is before the loop executes, it does not parallelize the loop. The reason for this limitation is that the runtime code must know the iteration count in order to know how many iterations to distribute to the different processors for execution.

The following conditions can prevent a runtime count:

Data Dependence

When a loop is parallelized, the iterations are executed independently on different processors, and the order of execution differs from the serial order that occurs on a single processor. This effect of parallelization is not a problem. The iterations could be executed in any order with no effect on the results. Consider the following loop:
for (i=0; i<5; i++)
    a[i] = a[i] * b[i];
In this example, the array a would always end up with the same data regardless of whether the order of execution were 0-1-2-3-4, 4-3-2-1-0, 3-1-4-0-2, or any other order. The independence of each iteration from the others makes the loop eligible candidate for parallelization.

Such is not the case in the following:

for (i=1; i<5; i++)
    a[i] = a[i-1] * b[i];
In this loop, the order of execution does matter. The data used in iteration i is dependent upon the data that was produced in the previous iteration [i-1]. a would end up with very different data if the order of execution were any other than 1-2-3-4. The data dependence in this loop thus makes it ineligible for parallelization.

Not all data dependences must inhibit parallelization. The following paragraphs discuss some of the exceptions.

Nested Loops and Matrices

Some nested loops that operate on matrices may have a data dependence in the inner loop only, allowing the outer loop to be parallelized. Consider the following:
for (i=0; i<10; i++)
    for (j=1; j<100; j++)
         a[i][j] = a[i][j-1] + 1;
The data dependence in this nested loop occurs in the inner [j] loop: Each row access of a[i][j] depends upon the preceding row [j-1] having been assigned in the previous iteration. If the iterations of the [j] loop were to execute in any other order than the one in which they would execute on a single processor, the matrix would be assigned different values. The inner loop, therefore, must not be parallelized.

But no such data dependence appears in the outer loop: Each column access is independent of every other column access. Consequently, the compiler can safely distribute entire columns of the matrix to execute on different processors; the data assignments will be the same regardless of the order in which the columns are executed, so long as each executes in serial order.

Assumed Dependencies

When analyzing a loop, the compiler errs on the safe side and assume that what looks like a data dependence really is one and so it does not parallelize the loop. Consider the following:
for (i=100; i<200; i++)
    a[i] = a[i-k];
The compiler assumes that a data dependence exists in this loop because it appears that data that has been defined in a previous iteration is being used in a later iteration. However, if the value of k is 100, the dependence is assumed rather than real because a[i-k] is defined outside the loop.

Parallel Processing Options

HP C provides the following optimization options for parallelizing C programs:

+O[no]autopar

Optimization level(s): 3, 4

Default: +Oautopar if +Oparallel is enabled

When used with +Oparallel, the +Onoautopar option causes the compiler to parallelize only those loops marked by the loop_parallel or prefer_parallel pragmas. Because the compiler does not automatically find parallel tasks or regions, user-specified task and region parallelization is not affected by this option.

A loop is safe to parallelize if it has an iteration count that can be determined at runtime before loop invocation, and contains no loop-carried dependences, procedure calls, or I/O operations. A loop-carried dependence exists when one iteration of a loop assigns a value to an address that is referenced or assigned on another iteration.

+O[no]dynsel

Optimization level(s): 3, 4

Default: +Odynsel if +Oparallel is enabled

When specified with +Oparallel, +Odynsel (the default) enables workload-based dynamic selection. For parallelizable loops whose iteration counts are known at compile time, +Odynsel causes the compiler to generate either a parallel or a serial version of the loop-depending on which is more profitable.

This optimization also causes the compiler to generate both parallel and serial versions of parallelizable loops whose iteration counts are unknown at compile time. At runtime, the loop workload is compared to parallelization overhead, and the parallel version is run only if it is profitable to do so.

The +Onodynsel option disables dynamic selection and tells the compiler that it is profitable to parallelize all parallelizable loops. The dynsel pragma can be used to enable dynamic selection for specific loops when +Onodynsel is in effect.

See Also: dynsel[(trip_count=n)]

+O[no]loop_block

Optimization level(s): 3, 4

Default: +Onoloop_block

The +O[no]loop_block option enables [disables] blocking of eligible loops for improved cache performance. The +Onoloop_block option disables automatic and directive-specified loop blocking. For more information on loop blocking, see the Parallel Programming Guide for HP-UX Systems.

+O[no]loop_unroll_jam

Optimization level(s): 3, 4

Default: +Onoloop_unroll_jam

The +O[no]loop_unroll_jam option enables [disables] loop unrolling and jamming. The +Onoloop_unroll_jam option disables both automatic and directive-specified unroll and jam. Loop unrolling and jamming increases register exploitation. For more information on the unroll and jam optimization, see the Parallel Programming Guide for HP-UX Systems.

+O[no]parallel

Optimization level(s): 3, 4

Default: +Onoparallel

The +Oparallel option optimizes the time it takes to execute a single process running on a multiprocessor system.


NOTE  If you compile one or more files in an application using +Oparallel, then the application must be linked (using the compiler driver) with the +Oparallel option to link in the proper start-up files and runtime support.

The +Oparallel option causes the compiler to:

The following methods can be used to specify the number of processors used in executing your parallel programs: The +Oparallel option disables +Ofailsafe.

See Also: Transforming Loops for Parallel Execution (+Oparallel) .

+O[no]report[= report_type]

Optimization level(s): 3, 4

Default: +Onoreport

This option causes the compiler to display various optimization reports. +Onoreport is the default. The value of report_type determines which report is displayed, as described below.

+Oreport=loop produces the Loop Report. This report gives information on optimizations performed on loops and calls. Using +Oreport (without =report_type) also produces the Loop Report.

+Oreport=private produces the Loop Report and the Privatization Table, which provides information on loop variables that are privatized by the compiler.

+Oreport=all produces all reports.

The +Oreport[=report_type] option is active only at +O3 and above. The +Onoreport option does not accept any of the report_type values. See the Parallel Programming Guide for HP-UX Systems for more information on the optimization reports.

+O[no]sharedgra

Optimization level(s): 2, 3, 4

Default: +Osharedgra

The +Onosharedgra option disables global register allocation for shared-memory variables that are visible to multiple threads. This option can help if a variable shared among parallel threads is causing wrong answers. See the Parallel Programming Guide for HP-UX Systems for more information.

Parallel Processing Pragmas

This section describes the following Parallel Processing Pragmas:


The syntax of a parallel processing pragma is:

#pragma [_CNX] pragma-list

where:

See Specifying Task Parallelism for an example on using these pragmas.

In the sections that follow, namelist represents a comma-separated list of variables or arrays. The occurrence of a lowercase n or m is used to indicate an integer constant. Occurrences of gate_var are for variables that have been, or are being, defined as gates.

begin_tasks[(attribute_list)]

This pragma defines the beginning of sections of code (see next_task) that are to be executed as independent, parallel tasks. Each task is executed by a separate thread. begin_tasks must have an accompanying end_tasks in the same program unit.

The optional attribute_list can be any of the following legal combinations (m is an integer constant):

Attributes may be listed in any order. The compiler flags any attribute combinations other than those listed above with a warning and ignores the pragma.

Refer to the Parallel Programming Guide for HP-UX Systems for a complete discussion of parallel tasking.

block_loop[(block_factor=n)]

This pragma indicates a specific loop to block, and optionally, the block factor n (n must be an integer constant greater than or equal to 2) that is to be used in the compiler's internal computation of loop nest based data reuse. If no block_factor is specified, the compiler uses a heuristic to determine the block_factor. Refer to the Parallel Programming Guide for HP-UX Systems for more information on blocking.

critical_section[(gate_var)]

This pragma defines the beginning of a code block in which only one thread may be executing at a time. The end of the code block must be indicated by an end_critical_section pragma, which must appear in the same flow of control within the same program unit. The optional gate_var can be used to implement a critical section which is not contiguous at the source level. Refer to the Parallel Programming Guide for HP-UX Systems for more information.

dynsel[(trip_count=n)]

This pragma enables workload-based dynamic selection for the immediately following loop. trip_count represents either the thread_trip_count or node_trip_count attribute, and n is an integer constant.

When thread_trip_count=n is specified, the serial version of the loop is run if the iteration count is less than n; otherwise, the thread-parallel version is run. When node_trip_count=n is specified, the serial version of the loop is run if the iteration count is less than n; otherwise, the node-parallel version is run, assuming +Onodepar is specified.

end_critical_section

This pragma defines the end of the critical section that was begun with the critical_section pragma. critical_section and end_critical_section must appear as a pair. Refer to the Parallel Programming Guide for HP-UX Systems for more information.

end_ordered_section

This pragma defines the end of the ordered section that was begun with the ordered_section pragma. ordered_section and end_ordered_section must appear as a pair. Refer to the Parallel Programming Guide for HP-UX Systems for more information on ordered sections.

end_parallel

This pragma signifies the end of a parallel region. The parallel pragma signifies the beginning of a parallel region. Refer to the Parallel Programming Guide for HP-UX Systems for more information.

end_tasks

This pragma terminates the specification of parallel tasks indicated by begin_tasks and next_task. It must appear at the end of the last section of parallel code defined by these pragmas. All of these must appear in the same program unit. Refer to the Parallel Programming Guide for HP-UX Systems for more information.

loop_parallel[(attribute_list)]

This pragma is an explicit instruction to the compiler to parallelize the immediately following loop. The loop iterations are run in an indeterminate order unless the optional ordered attribute appears. You are responsible for any required data privatization and loop synchronization, as described in the Parallel Programming Guide for HP-UX Systems. The optional attribute_list can be any of the following combinations (n and m are integer constants): The ivar= indvar attribute is: Attributes may be listed in any order. The compiler flags any attribute combinations other than those listed above with a warning and ignores the pragma.

Refer to the Parallel Programming Guide for HP-UX Systems for more information.

loop_private(namelist)

This pragma declares a list of variables and/or arrays private to the immediately following loop. No values may be carried into the loop by loop_private variables. To be loop private, the variables and/or arrays must be assigned before they are used on each iteration of the immediately following loop. These private data items should be treated as distinct objects from the shared items of the same name that exist outside the loop. Values assigned to loop_private variables on the final iteration (that is, the nth iteration of a loop with n iterations) may be saved into the shared variables of the same name if the save_last pragma also appears on this loop. If save_last is not used, then the value of any shared variable declared to be loop_private is undefined at loop termination. Refer to the Parallel Programming Guide for HP-UX Systems for more information.

next_task

This pragma starts a block of code following a begin_tasks block that will be executed as a parallel task. The end of the code block is marked by another next_task or by an end_tasks pragma.

This pragma must appear within a begin_tasks and end_tasks pair. There is no limit on the number of next_task pragmas that can appear. Refer to the Parallel Programming Guide for HP-UX Systems for more information.

no_block_loop

This pragma disables loop blocking on the immediately following loop. Refer to the Parallel Programming Guide for HP-UX Systems for more information on loop blocking.

no_distribute

This pragma disables loop distribution for the immediately following loop. Refer to the Parallel Programming Guide for HP-UX Systems for more information on loop distribution.

no_dynsel

This pragma disables workload-based dynamic selection for the immediately following loop. Refer to the Parallel Programming Guide for HP-UX Systems for more information on dynamic selection.

no_loop_dependence(namelist)

This pragma informs the compiler that the arrays in namelist do not have any dependencies for iterations of the immediately following loop. Use no_loop_dependence for arrays only; use loop_private to indicate dependence-free scalar variables.

This pragma causes the compiler to ignore any dependences that it perceives to exist. This can enhance the compiler's ability to optimize the loop, including the possibility of parallelization.

Refer to the Parallel Programming Guide for HP-UX Systems for more information.

no_loop_transform

This pragma prevents the compiler from performing reordering transformations on the following loop. The compiler does not distribute, fuse, block, interchange, unroll, unroll and jam, or parallelize a loop on which this pragma appears. Refer to the Parallel Programming Guide for HP-UX Systems for more information.

no_parallel

This pragma prevents the compiler from generating parallel code for the immediately following loop. Refer to the Parallel Programming Guide for HP-UX Systems for more information.

no_side_effects(funclist)

This pragma (#pragma _CNX no_side_effects) informs the compiler that the functions appearing in funclist have no side effects wherever they appear lexically following the pragma. Side effects include modifying a function argument, performing I/O, or calling another routine that does any of the above. The compiler can sometimes eliminate calls to procedures that have no side effects; also, the compiler may be able to parallelize loops with calls when informed that the called routines do not have side effects.

ordered_section(gate_var)

This pragma defines the beginning of an ordered section. An ordered section is the same as a critical section (a code block in which only one thread may be executing at a time) with the additional restriction that the threads must pass through the ordered section in iteration order of the most recently initiated parallelized loop. The end of the code block must be indicated by an end_ordered_section pragma. Ordered sections must appear within the control flow of a loop_parallel (ordered) pragma. Refer to the Parallel Programming Guide for HP-UX Systems for more information.

parallel[(attribute_list)]

This pragma signifies the beginning of a parallel region of code. All code up to the following end_parallel pragma will be run on all available threads. No loop transformations, data privatization, or parallelization analysis will be performed by the compiler on the region.

The optional attribute_list can be any of the following legal combinations (m is an integer constant):

Attributes may be listed in any order. The compiler flags any attribute combinations other than those listed above with a warning and ignores the pragma.

Refer to the Parallel Programming Guide for HP-UX Systems for more information.

parallel_private(namelist)

This pragma declares a list of variables or arrays private to the immediately following parallel region. It serves the same purpose for parallel regions that task_private serves for tasks. The privatized variables and arrays will not carry their values beyond the end_parallel pragma. Refer to the Parallel Programming Guide for HP-UX Systems for more information.

prefer_parallel[(attribute_list)]

This pragma instructs the compiler to parallelize the following loop, but only if it is safe to do so. A loop is safe to parallelize if it has an iteration count that can be determined at runtime before loop invocation and contains no loop-carried dependences, procedure calls, or I/O operations. (A loop-carried dependence exists when one iteration of a loop assigns a value to an address that is referenced or assigned on another iteration.) Refer to the Parallel Programming Guide for HP-UX Systems for more information.

The optional attribute_list can be any of the following combinations (n and m are integer constants):

Attributes may be listed in any order. The compiler flags any attribute combinations other than those listed above with a warning and ignores the pragma.

save_last[(list)]

This pragma specifies that the variables in the comma-separated list that are also named in an associated loop_private(namelist)pragma must have their last values saved into the "shared" variable of the same name at loop termination. (A variable's last value in a loop of n iterations is the value it is assigned in the nth iteration.)

If the optional list is not used, save_last specifies that all variables named in an associated loop_private(namelist) pragma must have their last values saved into the "shared" variable of the same name at loop termination.

If save_last is not specified then the values in any privatized variables or arrays are indeterminate at loop termination. Refer to the Parallel Programming Guide for HP-UX Systems for more information.

scalar

This pragma prevents the compiler from performing reordering transformations on the following loop. The compiler does not distribute, fuse, block, interchange, unroll, unroll and jam, or parallelize a loop on which this pragma appears.

The no_loop_transform pragma provides the same functionality as the scalar pragma and is recommended in place of the scalar pragma.

task_private(namelist)

This pragma privatizes the variables and arrays specified in namelistfor each task specified in the immediately following begin_tasks/end_tasks block. If a task_private data object is referenced within a task, it must have been assigned a value previously in that task. The privatized variables and arrays do not carry their values beyond the end_tasks pragma. Refer to the Parallel Programming Guide for HP-UX Systems for more information.

Specifying Task Parallelism

The following example uses the begin_tasks, task_private, next_task, and end_tasks pragmas to specify simple task-parallelism:
/* one thread executes the for loop */
#pragma begin_tasks, task_private(i)
 
for(i=0;i<n-1;i++)
    a[i] = a[i +1] + b[i];
 
/* another thread executes the function call */
#pragma next_task
 
tsub(x,y);
 
/* a third thread assigns elements of array d to every
   other element of c */
#pragma next_task
 
for(i=0;i<500;i++)
    c[i*2]=d[i];
 
#pragma end_tasks
The loop induction variable i is manually privatized because it is used to control loops in two different tasks. If i was not private, both tasks would modify it, causing wrong answers. The task_private pragma is described in task_private(namelist).

OpenMP Pragmas

OpenMP is an industry-standard parallel programming model which implements a fork-join model of parallel execution. The HP C OpenMP pragmas included in this release are based on the OpenMP Standard for C, version 1.0.

To view details about the standard and details about usage, syntax and values, please go to OpenMP. You can download either a postscript (ps) or Adobe Acrobat (PDF) version of the C/C++ Version 1.0 OpenMP standard from this website.

OpenMP pragmas has the following options:


+Oopenmp command line option

The OpenMP driver option +Oopenmp  is added to this release of the HP C compiler.


NOTE  The +Oopenmp option is accepted at all optimization levels. However, most of the OpenMP pragmas and pragmas need a minimum optimization level of +O3. To ensure that OpenMP pragmas are recognized, you must specify +O3 on the command line.

When +Oopenmp is seen in the command line, +Onodynsel, +Oparallel, +Onofailsafe, and +Onoautopar are passed by default to the cc driver.


NOTE  +Oopenmp overrides +Odynsel, +Ofailsafe and +Onoparallel.

When +Oopenmp is used, most of the HP Programming Model (HPPM) pragmas are not accepted. The following HPPM pragmas, are accepted by the HP C compiler when +Oopenmp is issued.

New Header File

Every C program that contains OpenMP pragmas is to be compiled for the current version of HP-UX and must include the header file <omp.h>. If it does not, the OpenMP pragmas will be ignored. The default path for <omp.h> is /usr/include.

OpenMP macro _OPENMP

The _OPENMP macro name is defined by OpenMP complaint implementation as the decimal constant yyyymm, which will be the year and month of the approved specification. This macro must not be the subject of #define or #undef preprocessing directive.
#ifdef_OPENMP
iam = omp_get_thread_num() + index;
#endif

Openmp Pragmas

The following work sharing and synchronization pragmas along with the listed clauses are available with HP C compiler.

Work sharing pragmas:

Synchronization pragmas: A directive of control data environment during execution of parallel regions is: Each of the pragmas available in this release of HP C compiler are discussed in brief below.

OMP PARALLEL

The OMP_PARALLEL pragma defines a parallel region, which is a region of the program that is executed by multiple threads in parallel. This is the fundamental contruct that starts parallel execution.
#pragma OMP_PARALLEL [clause1, clause2,...] new-line structured
block
where [clause1, clause2,...] indicates that the clauses are optional. There can be zero (0) or more clauses, where clause may be one of the following:

OMP FOR

The OMP FOR pragma for HP C identifies a construct that specifies a region in which the iterations of the associated loop should be executed in parallel. The iterations of the loop are distributed across threads that already exist.
#pragma OMP FOR [clause1, clause2, ...] newline
where [clause1, clause2,...] indicates that the clauses are optional. There can be zero (0) or more clauses, where clause may be one of the following:
NOTE: chunksize should be an integer constant. Expressions in place of chunksize are not supported and chunksize can be of static, dynamic, guided, or runtime types.

OMP_SECTION/OMP_SECTIONS

The OMP SECTION/SECTIONS pragmas identify a construct that specifies a set of constructs to be divided among threads in a team. Each section is executed  by one of the threads in the team.
#pragma OMP SECTIONS [clause1, clause2, ...]new-line 
{
#pragma OMP SECTION new-line
           structured-block
#pragma OMP SECTION new-line
           structured-block
.
.
.
}
where [clause1, clause2, ...] indicates that the clauses are optional. There can be zero (0) or more clauses, where clause may be one of the following:

OMP PARALLEL FOR

The OMP PARALLEL FOR pragma for HP C is a shortcut for an OMP PARALLEL region that contains a single OMP FOR pragma.
#pragma OMP_PARALLEL_FOR [clause1, clause2, ...]new-line
        for-loop
OMP PARALLEL FOR admits all the allowable clauses of the OMP PARALLEL pragma and the OMP FOR pragma.

OMP PARALLEL SECTIONS

The OMP PARALLEL SECTIONS pragma for HP C is a shortcut for specifying a parallel region containing a single OMP SECTIONS pragma.
#pragma OMP PARALLEL SECTIONS [clause1, clause2, ...]new-line
{
[#pragma OMP SECTION new-line
         structured-block
[#pragma OMP SECTION new-line
         structured-block
.
.
.
}
OMP PARALLEL SECTIONS admits all the allowable clauses of the OMP PARALLEL pragma and the OMP SECTIONS pragma. The PRIVATE clause is supported.

OMP SINGLE

The OMP SINGLE directive identifies a construct that specifies the associated structured block is executed by only one thread in the team (not necessarily the master thread).
#pragma OMP SINGLE [clause[clause] . . .] new-line 
                         structured-block
where [clause] is one of the following:
NOTE: SINGLE does not take any associated classes in this release of HP C compiler.

OMP PARALLEL CRITICAL

The OMP PARALLEL CRITICAL pragma identifies a construct that restricts the execution of the associated structured block to one thread at a time.
#pragma OMP CRITICAL [ (name)] new-line
                structured-block
The critical section name parameter is optional. All unnamed critical sections globally map to a single name; this is provided by the HP C compiler.

OMP BARRIER

The OMP_BARRIER pragma synchronizes all the threads in a team. When encountered, each thread waits until all the threads in the team have reached that point.
#pragma OMP BARRIER new-line
The OMP_BARRIER pragma synchronizes all the threads in a team. When encountered, each thread waits until all the threads in the team have reached that point.

OMP ORDERED

The OMP ORDERED pragma indicates that the following structured block should be executed in the same order in which iterations would be executed in a sequential loop.
#pragma OMP ORDERED new-line
           structured-block
The OMP ORDERED pragmas must be called within the OMP FOR and/or OMP PARALLEL FOR loops. When ORDERED clause is used with SCHEDULE which has a chunksize, then the chunksize is ignored by the compiler.

OMP ATOMIC

The OMP ATOMIC directive ensures that a specific memory location is updated automically, rather than exposing it to the possibility of multiple simultaneous writing threads.
#pragma OMP ATOMIC new-line 
          expression stmt
where expression stmt must have one of the following forms: where, in the above expressions:

OMP FLUSH

The OMP FLUSH directive whether explicit or implied, specifies a cross-thread sequence point at which the implementation is required to ensure that all threads in a team have a consistent view of certain objects in the memory.
#pragma OMP FLUSH [(list)] new-line
A FLUSH directive without a (list)is implied for the following directives: The directive is not implied if a NOWAIT clause is present.

OMP MASTER

The OMP MASTER pragma for HP C directs that the structured block following it should be executed by the master thread(thread 0) of the team.
#pragma OMP MASTER new-line
       structured-block

Other threads in the team do not execute the associated block.

OMP NOWAIT

The NOWAIT pragma when used, removes the implicit barrier synchronization at the end of a FOR or SECTIONS construct.
#pragma omp for nowait

OMP THREADPRIVATE

The OMP THREADPRIVATE directive is provided to make file-scope variables local to a thread. The THREADPRIVATE directive makes the named file-scope or namescope-scope variables specified in the list private to a thread but file-scope visible within the thread.
#pragma OMP THREADPRIVATE (list) new-line

NOTE: THREADPRIVATE variables are not supported in this release of HP C compiler.

Caveats

Observe these known restrictions while you use OpenMP pragmas:

Environment Variables in OpenMP

The OpenMP environment variables available in HP C compiler control the execution of parallel code. The environment variable names are case sensitive and they must be in uppercase. The following environment variables are available in HP C compiler:

OMP_SCHEDULE

This environment variable applies for for and parallel for directives that have the schedule type as runtime. The schedule type and chunk size for all such loops can be set at run-time by setting this environment variable to any of the recognized schedule types and to an optional chunk_size.
setenv OMP_SCHEDULE "dynamic"
The default value of the environment variable is implementation dependent. If the optional chunk_size is set, the value must be positive. If chunk_size is not set, a value of 1 is assumed, except for static schedule. For a static schedule, the default chunk_size is set to the loop iteration space divided by a number of threads applied to the loop.


NOTE: OMP_SCHEDULE is ignored for for and parallel for directives that have a schedule type other than runtime.


OMP_NUM_THREADS

The value of the OMP_NUM_THREADS must be positive. This value depends on whether dynamic adjustment of the number of threads is enabled. If dynamic adjustments is disabled, the value of this environment variable is the number of threads to use for each parallel region until that number is explicitly changed during execution. If dynamic adjustment of the number of threads is enabled, the value of the environment variable is interpreted as the maximum number of threads to use.
setenv OMP_NUM_THREADS 16

OMP_DYNAMIC

The OMP_DYNAMIC environment variable enables or disables dynamic adjustment of the number of threads available for execution of parallel regions. Its value must be TRUE or FALSE. If the value is set to TRUE, the number of threads that are used for executing parallel regions may be adjusted by the runtime environment to best utilize system resources. If the value is set to FALSE, dynamic adjustment is disabled.
setenv OMP_DYNAMIC TRUE

OMP_NESTED

The OMP_NESTED environment variable enables or disables nested parallelism. Its value must be TRUE or FALSE. If the value is set to TRUE, nested parallelism is enabled and if the value is set to FALSE, the nested parallelism is disabled. The default value is
set to FALSE.
setenv OMP_NESTED FALSE

Runtime Library Functions

This section describes the OpenMP C run-time library functions. The header <omp.h> declares two types: several functions that can be used to control and query the parallel execution environment, and lock functions that can be used to synchronize access to data.

The type omp_lock_t is an object type capable of representing that a lock is available, or a thread owns a lock. These locks are referred as simple locks.

The type omp_nest_lock_t is an object type capable of representing either that a lock is available, or both the identity of the thread that owns the lock and a nesting count. These locks are referred as nestable locks.

The library functions are external functions.

The descriptions of library functions are divided into the following topics:

Execution environment functions

The functions described in this section affect and monitor threads, processors, and the parallel environment:
omp_set_num_threads
The omp_set_num_threads function sets the number of threads to use for subsequent parallel regions. The format is as follows:
#include <omp.h>
void omp_set_num_threads(int num_threads);
The value of the parameter num_threads must be positive. Its effect depends upon whether dynamic adjustment of the number of threads is enabled. If dynamic adjustment is disabled, the value is used as the number of threads for all subsequent parallel regions prior to the next call to this function; otherwise, the value is the maximum number of threads that will be used. This function has effect only when called from serial portions of the program. If it is called from a portion of the program where the omp_in_parallel function returns non-zero, the behavior of this function is undefined. For more information on this subject, see the omp_set_dynamic and omp_get_dynamic functions. This call has precedence over the OMP_NUM_THREADS environment variable.

omp_get_num_threads
The omp_get_num_threads function returns the number of threads currently in the team executing the parallel region from which it is called. The format is as follows:
#include <omp.h>
int omp_get_num_threads(void);
The omp_set_num_threads function and the OMP_NUM_THREADS environment variable control the number of threads in a team. If the number of threads has not been explicitly set by the user, the default is implementation dependent. This function binds to the closest
enclosing omp parallel directive. If called from a serial portion of a program, or from a nested parallel region that is serialized, this function returns 1.

omp_get_max_threads
The omp_get_max_threads function returns the maximum value that can be returned by calls to omp_get_num_threads. The format is as follows:
#include <omp.h>
int omp_get_max_threads(void);
If omp_set_num_threads is used to change the number of threads, subsequent calls to this function will return the new value. A typical use of this function is to determine the size of an array for which all thread numbers are valid indices, even when omp_set_dynamic is set to non-zero.

This function returns the maximum value whether executing within a serial region or a parallel region.

omp_get_thread_num
The omp_get_thread_num function returns the thread number, within its team, of the thread executing the function. The thread number lies between 0 and omp_get_num_threads()-1, inclusive. The master thread of the team is thread 0. The format is as follows:
#include <omp.h>
int omp_get_thread_num(void);
If called from a serial region, omp_get_thread_num returns 0. If called from within a nested parallel region that is serialized, this function returns 0.

omp_get_num_procs
The omp_get_num_procs function returns the maximum number of processors that could be assigned to the program. The format is as follows:
#include <omp.h>
int omp_get_num_procs(void);
omp_in_parallel
The omp_in_parallel function returns non-zero if it is called within the dynamic extent of a parallel region executing in parallel; otherwise, it returns 0. The format is as follows:
#include <omp.h>
int omp_in_parallel(void);
This function returns non-zero from within a region executing in parallel, regardless of nested regions that are serialized.

omp_set_dynamic
The omp_set_dynamic function enables or disables dynamic adjustment of the number of threads available for execution of parallel regions. The format is as follows:
#include <omp.h>
void omp_set_dynamic(int dynamic_threads);
This function has effect only when called from serial portions of the program. If it is called from a portion of the program where the omp_in_parallel function returns non-zero, the behavior of the function is undefined. If dynamic_threads evaluates to non-zero, the number of threads that are used for executing subsequent parallel regions may be adjusted automatically by the run-time environment to best utilize system resources. As a consequence, the number of threads specified by the user is the maximum thread count. The number of threads always remains fixed over the duration of each parallel region and is reported by the omp_get_num_threads function.

If dynamic_threads evaluates to 0, dynamic adjustment is disabled. A call to omp_set_dynamic has precedence over the OMP_DYNAMIC environment variable.

The default for the dynamic adjustment of threads is implementation dependent. As a result, user codes that depend on a specific number of threads for correct execution should explicitly disable dynamic threads. Implementations are not required to provide the ability to dynamically adjust the number of threads, but they are required to provide the interface in order to support portability across all platforms.

omp_get_dynamic
The omp_get_dynamic function returns non-zero if dynamic thread adjustments enabled and returns 0 otherwise. For a description of dynamic thread adjustment, see omp_set_dynamic. The format is as follows:
#include <omp.h>
int omp_get_dynamic(void);
If the implementation does not implement dynamic adjustment of the number of threads, this function always returns 0.

omp_set_nested
The omp_set_nested function enables or disables nested parallelism. The format is as follows:
#include <omp.h>
void omp_set_nested(int nested);
If nested evaluates to 0, which is the default, nested parallelism is disabled, and nested parallel regions are serialized and executed by the current thread. If nested evaluates to non-zero, nested parallelism is enabled, and parallel regions that are nested may deploy additional threads to form the team.

This call has precedence over the OMP_NESTED environment variable. When nested parallelism is enabled, the number of threads used to execute nested parallel regions is implementation dependent. As a result, OpenMP-compliant implementations are allowed to serialize nested parallel regions even when nested parallelism is enabled.

omp_get_nested
The omp_get_nested function returns non-zero if nested parallelism is enabled and 0 if it is disabled. The format is as follows:
#include <omp.h>
int omp_get_nested(void);
If an implementation does not implement nested parallelism, this function always returns 0.

Lock functions

The functions described in this section manipulate locks used for synchronization.

For the following functions, the lock variable must have type omp_lock_t. This variable must only be accessed through these functions. All lock functions require an argument that has a pointer to omp_lock_t type.

For the following functions, the lock variable must have type omp_nest_lock_t. This variable must only be accessed through these
functions. All nestable lock functions require an argument that has a pointer to omp_nest_lock_t type.
omp_init_lock and omp_init_nest_lock Functions
These functions provide the only means of initializing a lock. Each function initializes the lock associated with the parameter lock for use in subsequent calls. The format is as follows:
#include <omp.h>
void omp_init_lock(omp_lock_t *lock);
void omp_init_nest_lock(omp_nest_lock_t *lock);
The initial state is unlocked (that is, no thread owns the lock). For a nestable lock, the initial nesting count is zero.

omp_destroy_lock and omp_destroy_nest_lock Functions
These functions ensure that the pointer to lock variable lock is uninitialized. The format is as follows:
#include <omp.h>
void omp_destroy_lock(omp_lock_t *lock);
void omp_destroy_nest_lock(omp_nest_lock_t *lock);
The argument to these functions must point to an initialized lock variable that is unlocked.

omp_set_lock and omp_set_nest_lock Functions
Each of these functions blocks the thread executing the function until the specified lock is available and then sets the lock. A simple lock is available if it is unlocked. A nestable lock is available if it is unlocked or if it is already owned by the thread executing the function. The format is as follows:
 
#include <omp.h>
void omp_set_lock(omp_lock_t *lock);
void omp_set_nest_lock(omp_nest_lock_t *lock);
For a simple lock, the argument to the omp_set_lock function must point to an initialized lock variable. Ownership of the lock is granted to the thread executing the function.

For a nestable lock, the argument to the omp_set_nest_lock function must point to an initialized lock variable. The nesting count is incremented, and the thread is granted, or retains, ownership of the lock.

omp_unset_lock and omp_unset_nest_lock Functions
These functions provide the means of releasing ownership of a lock. The format is as follows:
#include <omp.h>
void omp_unset_lock(omp_lock_t *lock);
void omp_unset_nest_lock(omp_nest_lock_t *lock);
The argument to each of these functions must point to an initialized lock variable owned by the thread executing the function. The behavior is undefined if the thread does not own that lock.

For a simple lock, the omp_unset_lock function releases the thread executing the function from ownership of the lock.

For a nestable lock, the omp_unset_nest_lock function decrements the nesting count, and releases the thread executing the function from ownership of the lock if the resulting count is zero.

omp_test_lock and omp_test_nest_lock Functions
These functions attempt to set a lock but do not block execution of the thread. The format is as follows:
#include <omp.h>
int omp_test_lock(omp_lock_t *lock);
int omp_test_nest_lock(omp_nest_lock_t *lock);
The argument must point to an initialized lock variable. These functions attempt to set a lock in the same manner as omp_set_lock and omp_set_nest_lock, except that they do not block execution of the thread.

For a simple lock, the omp_test_lock function returns non-zero if the lock is successfully set; otherwise, it returns zero.

For a nestable lock, the omp_test_nest_lock function returns the new nesting count if the lock is successfully set; otherwise, it returns zero.

Memory Classes

In order to use memory classes in C programs, you must include the header file /usr/include/spp_prog_model.h. Memory classes are described in the Parallel Programming Guide for HP-UX Systems. The memory classes described in this section include: In C, the general form for assigning memory is:

#include <spp_prog_model.h>

. . .

[storage_class_specfier] memory_class_name type_specifiernamelist

where:

Data objects that are assigned a memory class must have a static storage duration. If the object is declared within a function, it must have the storage class extern or static. Data objects declared at file scope and assigned a memory class need not specify a storage class.

A hypernode is a set of processors and physical memory organized as a symmetric multiprocessor (SMP) running a single image of the operating system microkernel.

node_private

This storage class specifier causes the variables and arrays specified in namelist to be replicated in the physical memory of each hypernode on which the process is executing. While each data object has a single image in virtual memory, it maps to a different physical location on each hypernode. The threads of a process within a hypernode all share access to the copy on their hypernode and cannot access the copies on other hypernodes.

thread_private

This storage class specifier causes the variables and arrays to be treated as thread_private. These data objects map to unique node_private addresses for each thread of a process. Refer to the Parallel Programming Guide for HP-UX Systems for more information.

Synchronization Functions

HP C provides functions that can be used with pragmas to achieve synchronization. Those discussed in this section include: Gates allow you to restrict execution of a block of code to a single thread. They can be allocated, locked, unlocked or deallocated. Or, they can be used with the ordered or critical section pragmas, which automate the locking and unlocking functions.

Barrriers block further execution until all executing threads reach the barrier.

You declare gates and barriers by using the following type definitions:

Gates and barriers should only appear in definition and declaration statements, and as formal and actual arguments.

Allocate Functions

These functions allocate memory for a gate or barrier. When memory is first allocated, gate variables are unlocked.
int alloc_gate(gate_t *gate_p);
 
int alloc_barrier(barrier_t *barrier_p);
gate_p and barrier_p are pointers of the indicated type, which have been previously declared as described above.

Deallocate Functions

These functions free the memory assigned to the specified gate or barrier variable.

These functions have the following declarations:

int free_gate(gate_t *gate_p);
 
int free_barrier(barrier_t, *barrier_p);
where gate_p and barrier_p are pointers of the indicated type. Always free gates and barriers when you are done using them.

Locking Functions

These functions acquire a gate for exclusive access. If the gate cannot be immediately acquired, the calling thread waits for it. The conditional locking functions, which are prefixed with COND_ or cond_, acquire a gate if doing so does not require a wait. If the gate is acquired, the functions return 0; if not, they return -1.

The functions have the following declarations:

int lock_gate(gate_t *gate_p);
 
int cond_lock_gate(gate_t *gate_p);
where gate_p is a pointer of the indicated type.

Unlocking Function

This function releases a gate from exclusive access. Gates are typically released by the thread that locks them, unless a gate was locked by thread 0 in serial code. In that case it might be unlocked by a single different thread in a parallel construct.

The function has the following declaration:

int unlock_gate(gate_t *gate_p);
where gate_p is a pointer of the indicated type.

Wait Function

This function uses a barrier to cause the calling thread to wait until the specified number of threads call the function, at which point all threads are released from the function simultaneously.

The function has the following declaration:

int wait_barrier(barrier_t *barrier_p, const int *nthr);
where barrier_p is a pointer of the indicated type and nthr is a pointer referencing the number of threads calling the routine.

You can use a barrier variable in multiple calls to the wait() function, as long as you ensure that two barriers are not active at the same time. Also, check that nthr reflects the correct number of threads.