HP C/HP-UX Online Help
NOTE: See the Compiling & Running
HP C Programs section of the HP C Online Help for a quick reference
of all HP C compiler options and pragmas. See the Optimizing
HP C Programs section of the HP C Online Help for a detailed description
of HP C optimization options and pragmas.
Parallel Options and Pragmas
Getting Started with Parallelizing C Programs
Guidelines for Parallelizing C Programs
Parallel Processing Options
Parallel Processing Pragmas
Openmp Pragmas
Memory Classes
Synchronization Functions
HP C generates efficient parallel code by default. You can increase
the amount of code the compiler can parallelize on multiprocessor systems
by using options, pragmas, and supporting library calls. Applications running
on HP 9000 K-Class and V-Class servers can benefit from the parallelization
features described in this section.
For detailed information and examples, see the Parallel Programming
Guide for HP-UX Systems.
Getting Started with Parallelizing C Programs
This section describes the basis tasks required to help you get started
with parallelizing C programs, including:
Transforming Loops for Parallel Execution (+Oparallel)
The +Oparallel option causes the compiler to transform eligible
loops for parallel execution on multiprocessor machines.
The following command lines compile (without linking) three source files:
x.c,
y.c,
and z.c. The files x.c and
y.c are compiled
for parallel execution. The file z.c is compiled for serial execution,
even though its object file will be linked with x.o and y.o.
cc +O3 +Oparallel -c x.c y.c cc +O3 -c z.c
The following command line links the three object files, producing the
executable file para_prog:
cc +O3 +Oparallel -o para_prog x.o y.o z.o
As this command line implies, if you link and compile separately, you
must use cc, not ld. The command line to link must also
include the +Oparallel and +O3 options in order to link
in the right startup files and runtime support.
Setting the Number of Threads Used in Parallel
Use the MP_NUMBER_OF_THREADS environment variable to set the number
of processors that are to execute your program in parallel. If you do not
set this variable, it defaults to the number of processors on the executing
machine.
From the C shell, the following command sets MP_NUMBER_OF_THREADS
to indicate that programs compiled for parallel execution can execute on
two processors:
setenv MP_NUMBER_OF_THREADS 2
If you use the Korn shell, the command is:
export MP_NUMBER_OF_THREADS=2
Determining Idle Thread States
Use the MP_IDLE_THREADS_WAIT environment variable to determine
how threads wait. Idle threads can be suspended or can spin-wait.
This variable takes an integer value n. For n less than
0, the threads spin-wait. For n equal to or greater than 0, the
threads spin-wait for n milliseconds before being suspended.
By default, idle threads spin-wait briefly after creation or a join.
They then suspend themselves if they receive no work.
Accessing the Pthreads Library
Pthreads (POSIX threads) refers to the Pthreads library of thread-management
routines. For information on Pthread routines see the pthread(3t)
man page.
To use the Pthread routines, your program must include the
<pthreads.h>
header file and the Pthreads library must be explicitly linked to your
program. For example:
% cc -D_POSIX_C_SOURCE+199506L prog.c -lpthread
The -D_POSIX_C_SOURCE=199506L string specifies the appropriate
POSIX revision level. In this case, the level is 199506L.
Profiling Parallelized Programs
Profiling a program that has been compiled for parallel execution is performed
in much the same way as it is for non-parallel programs:
-
Compile the program with the option -G.
-
Run the program to produce profiling data.
-
Run gprof against the program.
-
View the output from gprof.
The differences are:
-
Running the program in Step 2 produces a gmon.out file for the
master process and gmon.out.1,
gmon.out.2, and so on
for each of the slave processes. If your program executes on two processors,
Step 2 produces two files, gmon.out and gmon.out.1.
The flat profile that you view in Step 4 indicates loops that were parallelized
with the following notation:
routine_name #pr_line_0123
where routine_name is the name of the routine containing the loop,
pr
(parallel region) indicates that the loop was parallelized, and
0123
is the line number of the beginning of the loop or loops that are parallelized.
Guidelines for Parallelizing C Programs
This section describes the following guidelines for parallelizing C programs:
To ensure the best performance from a parallel program, do not run more
than one parallel program on a multiprocessor machine at the same time.
Running two or more parallel programs simultaneously or running one parallel
program on a heavily loaded system, will slow performance.
You should run a parallel-executing program at a higher priority than
any other user program; see rtprio(1) for information about setting
real-time priorities.
Conditions Inhibiting Loop Parallelization
The following sections describe different conditions that can inhibit parallelization.
Calling Routines with Side Effects
The compiler will not parallelize any loop containing a call to a routine
that has side effects. A routine has side effects if it does any of the
following:
-
Modifies its arguments.
-
Modifies an extern,
static, or global variable.
-
Redefines variables that are local to the calling routine.
-
Performs I/O.
-
Calls another subroutine or function that does any of the above.
Indeterminate Iteration Counts
If the compiler cannot determine what the runtime loop iteration count
is before the loop executes, it does not parallelize the loop. The reason
for this limitation is that the runtime code must know the iteration count
in order to know how many iterations to distribute to the different processors
for execution.
The following conditions can prevent a runtime count:
-
The loop is an infinite loop.
-
A conditional break statement or goto out of the loop appears
in the loop.
-
The loop modifies either the loop-control or loop-limit variable.
-
The loop is a while construct and the condition being tested is
defined within the loop.
Data Dependence
When a loop is parallelized, the iterations are executed independently
on different processors, and the order of execution differs from the serial
order that occurs on a single processor. This effect of parallelization
is not a problem. The iterations could be executed in any order with no
effect on the results. Consider the following loop:
for (i=0; i<5; i++)
a[i] = a[i] * b[i];
In this example, the array a would always end up with the same
data regardless of whether the order of execution were 0-1-2-3-4, 4-3-2-1-0,
3-1-4-0-2, or any other order. The independence of each iteration from
the others makes the loop eligible candidate for parallelization.
Such is not the case in the following:
for (i=1; i<5; i++)
a[i] = a[i-1] * b[i];
In this loop, the order of execution does matter. The data used in iteration
i
is dependent upon the data that was produced in the previous iteration
[i-1].
a would end up with very different data if the
order of execution were any other than 1-2-3-4. The data dependence in
this loop thus makes it ineligible for parallelization.
Not all data dependences must inhibit parallelization. The following
paragraphs discuss some of the exceptions.
Nested Loops and Matrices
Some nested loops that operate on matrices may have a data dependence in
the inner loop only, allowing the outer loop to be parallelized. Consider
the following:
for (i=0; i<10; i++)
for (j=1; j<100; j++)
a[i][j] = a[i][j-1] + 1;
The data dependence in this nested loop occurs in the inner
[j]
loop: Each row access of a[i][j] depends upon the preceding row
[j-1]
having been assigned in the previous iteration. If the iterations of the
[j]
loop were to execute in any other order than the one in which they would
execute on a single processor, the matrix would be assigned different values.
The inner loop, therefore, must not be parallelized.
But no such data dependence appears in the outer loop: Each column access
is independent of every other column access. Consequently, the compiler
can safely distribute entire columns of the matrix to execute on different
processors; the data assignments will be the same regardless of the order
in which the columns are executed, so long as each executes in serial order.
Assumed Dependencies
When analyzing a loop, the compiler errs on the safe side and assume that
what looks like a data dependence really is one and so it does not parallelize
the loop. Consider the following:
for (i=100; i<200; i++)
a[i] = a[i-k];
The compiler assumes that a data dependence exists in this loop because
it appears that data that has been defined in a previous iteration is being
used in a later iteration. However, if the value of k is 100,
the dependence is assumed rather than real because a[i-k] is defined
outside the loop.
Parallel Processing Options
HP C provides the following optimization options for parallelizing C programs:
+O[no]autopar
Optimization level(s): 3, 4
Default: +Oautopar if +Oparallel is enabled
When used with +Oparallel, the +Onoautopar option
causes the compiler to parallelize only those loops marked by the loop_parallel
or prefer_parallel pragmas. Because the compiler does not automatically
find parallel tasks or regions, user-specified task and region parallelization
is not affected by this option.
A loop is safe to parallelize if it has an iteration count that can
be determined at runtime before loop invocation, and contains no loop-carried
dependences, procedure calls, or I/O operations. A loop-carried dependence
exists when one iteration of a loop assigns a value to an address that
is referenced or assigned on another iteration.
+O[no]dynsel
Optimization level(s): 3, 4
Default: +Odynsel if +Oparallel is enabled
When specified with +Oparallel,
+Odynsel (the default)
enables workload-based dynamic selection. For parallelizable loops whose
iteration counts are known at compile time, +Odynsel causes the
compiler to generate either a parallel or a serial version of the loop-depending
on which is more profitable.
This optimization also causes the compiler to generate both parallel
and serial versions of parallelizable loops whose iteration counts are
unknown at compile time. At runtime, the loop workload is compared to parallelization
overhead, and the parallel version is run only if it is profitable to do
so.
The +Onodynsel option disables dynamic selection and tells
the compiler that it is profitable to parallelize all parallelizable loops.
The dynsel pragma can be used to enable dynamic selection for
specific loops when +Onodynsel is in effect.
See Also: dynsel[(trip_count=n)]
+O[no]loop_block
Optimization level(s): 3, 4
Default: +Onoloop_block
The +O[no]loop_block option enables [disables] blocking of
eligible loops for improved cache performance. The +Onoloop_block
option disables automatic and directive-specified loop blocking. For more
information on loop blocking, see the Parallel Programming Guide for
HP-UX Systems.
+O[no]loop_unroll_jam
Optimization level(s): 3, 4
Default: +Onoloop_unroll_jam
The +O[no]loop_unroll_jam option enables [disables] loop unrolling
and jamming. The +Onoloop_unroll_jam option disables both automatic
and directive-specified unroll and jam. Loop unrolling and jamming increases
register
exploitation. For more information on the unroll and jam optimization,
see the Parallel Programming Guide for HP-UX Systems.
+O[no]parallel
Optimization level(s): 3, 4
Default: +Onoparallel
The +Oparallel option optimizes the time it takes to execute
a single process running on a multiprocessor system.
| NOTE |
If you compile one or more files in an application using
+Oparallel,
then the application must be linked (using the compiler driver) with the
+Oparallel
option
to link in the proper start-up files and runtime support. |
The +Oparallel option causes the compiler to:
-
Recognize the directives and pragmas that involve parallelism, such as
begin_tasks,
loop_parallel,
and prefer_parallel
-
Look for opportunities for parallel execution in loops
The following methods can be used to specify the number of processors used
in executing your parallel programs:
-
loop_parallel(max_threads=m) pragma
-
prefer_parallel(max_threads=m) pragma
-
MP_NUMBER_OF_THREADS environment variable, which is read at runtime
by your program. If this variable is set to a positive integer
n,
your program executes on n processors.
n must be less than
or equal to the number of processors on the system where the program is
executing.
-
See Setting the Number of Threads Used in Parallel
for an example.
The +Oparallel option disables +Ofailsafe.
See Also: Transforming Loops for Parallel Execution
(+Oparallel) .
+O[no]report[= report_type]
Optimization level(s): 3, 4
Default: +Onoreport
This option causes the compiler to display various optimization reports.
+Onoreport
is the default. The value of report_type determines which report
is displayed, as described below.
+Oreport=loop produces the Loop Report. This report gives information
on optimizations performed on loops and calls. Using +Oreport
(without =report_type) also produces the Loop Report.
+Oreport=private produces the Loop Report and the Privatization
Table, which provides information on loop variables that are privatized
by the compiler.
+Oreport=all produces all reports.
The +Oreport[=report_type] option is active
only at +O3 and above. The +Onoreport option does not
accept any of the report_type values. See the Parallel Programming
Guide for HP-UX Systems for more information on the optimization reports.
+O[no]sharedgra
Optimization level(s): 2, 3, 4
Default: +Osharedgra
The +Onosharedgra option disables global register allocation
for shared-memory variables that are visible to multiple threads. This
option can help if a variable shared among parallel threads is causing
wrong answers. See the Parallel Programming Guide for HP-UX Systems
for more information.
Parallel Processing Pragmas
This section describes the following Parallel Processing Pragmas:
The syntax of a parallel processing pragma is:
#pragma [_CNX] pragma-list
where:
-
pragma-list is a comma-separated list of pragmas described in this
section.
See Specifying Task Parallelism for an example
on using these pragmas.
In the sections that follow, namelist represents a comma-separated
list of variables or arrays. The occurrence of a lowercase n or
m
is used to indicate an integer constant. Occurrences of gate_var
are for variables that have been, or are being, defined as gates.
begin_tasks[(attribute_list)]
This pragma defines the beginning of sections of code (see
next_task)
that are to be executed as independent, parallel tasks. Each task is executed
by a separate thread. begin_tasks must have an accompanying end_tasks
in the same program unit.
The optional attribute_list can be any of the following legal
combinations (m is an integer constant):
-
threads (default)
-
dist
-
ordered
-
max_threads=m
-
threads, ordered
-
dist, ordered
-
threads, max_threads=m
-
dist, max_threads=m
-
ordered, max_threads=m
-
threads, ordered, max_threads=m
-
dist, ordered, max_threads=m
Attributes may be listed in any order. The compiler flags any attribute
combinations other than those listed above with a warning and ignores the
pragma.
Refer to the Parallel Programming Guide for HP-UX Systems for
a complete discussion of parallel tasking.
block_loop[(block_factor=n)]
This pragma indicates a specific loop to block, and optionally, the block
factor n (n must be an integer constant greater than or equal
to 2) that is to be used in the compiler's internal computation of loop
nest based data reuse. If no block_factor is specified, the compiler
uses a heuristic to determine the block_factor. Refer to the Parallel
Programming Guide for HP-UX Systems for more information on blocking.
critical_section[(gate_var)]
This pragma defines the beginning of a code block in which only one thread
may be executing at a time. The end of the code block must be indicated
by an end_critical_section pragma, which must appear in the same
flow of control within the same program unit. The optional gate_var
can be used to implement a critical section which is not contiguous at
the source level. Refer to the Parallel Programming Guide for HP-UX
Systems for more information.
dynsel[(trip_count=n)]
This pragma enables workload-based dynamic selection for the immediately
following loop. trip_count represents either the thread_trip_count
or node_trip_count attribute, and
n is an integer constant.
When thread_trip_count=n is specified, the serial version
of the loop is run if the iteration count is less than n; otherwise,
the thread-parallel version is run. When node_trip_count=n
is specified, the serial version of the loop is run if the iteration count
is less than n; otherwise, the node-parallel version is run, assuming
+Onodepar
is specified.
end_critical_section
This pragma defines the end of the critical section that was begun with
the critical_section pragma. critical_section and end_critical_section
must appear as a pair. Refer to the Parallel Programming Guide for HP-UX
Systems for more information.
end_ordered_section
This pragma defines the end of the ordered section that was begun with
the ordered_section pragma.
ordered_section and end_ordered_section
must appear as a pair. Refer to the Parallel Programming Guide for HP-UX
Systems for more information on ordered sections.
end_parallel
This pragma signifies the end of a parallel region. The parallel
pragma signifies the beginning of a parallel region. Refer to the
Parallel
Programming Guide for HP-UX Systems for more information.
end_tasks
This pragma terminates the specification of parallel tasks indicated by
begin_tasks
and next_task. It must appear at the end of the last section of
parallel code defined by these pragmas. All of these must appear in the
same program unit. Refer to the Parallel Programming Guide for HP-UX
Systems for more information.
loop_parallel[(attribute_list)]
This pragma is an explicit instruction to the compiler to parallelize the
immediately following loop. The loop iterations are run in an indeterminate
order unless the optional ordered attribute appears. You are responsible
for any required data privatization and loop synchronization, as described
in the Parallel Programming Guide for HP-UX Systems. The optional
attribute_list
can be any of the following combinations (n and m are integer
constants):
-
threads (default) <
-
dist <
-
ordered
-
max_threads=m
-
chunk_size=n
-
threads, ordered
-
dist, ordered
-
threads, max_threads=m
-
dist, max_threads=m
-
ordered, max_threads=m
-
threads, chunk_size=n
-
dist, chunk_size=n
-
threads, ordered, max_threads=m
-
dist, ordered, max_threads=m
-
chunk_size=n, max_threads=m
-
threads, chunk_size=n, max_threads=m
-
dist, chunk_size=n, max_threads=m
-
ivar= indvar
The ivar= indvar attribute is:
-
Required for all loops in C
-
Compatible with any other attribute
Attributes may be listed in any order. The compiler flags any attribute
combinations other than those listed above with a warning and ignores the
pragma.
Refer to the Parallel Programming Guide for HP-UX Systems for
more information.
loop_private(namelist)
This pragma declares a list of variables and/or arrays private to the immediately
following loop. No values may be carried into the loop by loop_private
variables. To be loop private, the variables and/or arrays must be assigned
before they are used on each iteration of the immediately following loop.
These private data items should be treated as distinct objects from the
shared items of the same name that exist outside the loop. Values assigned
to loop_private variables on the final iteration (that is, the
nth
iteration of a loop with n iterations) may be saved into the shared
variables of the same name if the save_last pragma also appears
on this loop. If save_last is not used, then the value of any
shared variable declared to be
loop_private is undefined at loop
termination. Refer to the Parallel Programming Guide for HP-UX Systems
for more information.
next_task
This pragma starts a block of code following a begin_tasks block
that will be executed as a parallel task. The end of the code block is
marked by another next_task or by an end_tasks pragma.
This pragma must appear within a begin_tasks and end_tasks
pair. There is no limit on the number of next_task pragmas that
can appear. Refer to the Parallel Programming Guide for HP-UX Systems
for more information.
no_block_loop
This pragma disables loop blocking on the immediately following loop. Refer
to the Parallel Programming Guide for HP-UX Systems for more information
on loop blocking.
no_distribute
This pragma disables loop distribution for the immediately following loop.
Refer to the Parallel Programming Guide for HP-UX Systems for more
information on loop distribution.
no_dynsel
This pragma disables workload-based dynamic selection for the immediately
following loop. Refer to the Parallel Programming Guide for HP-UX Systems
for more information on dynamic selection.
no_loop_dependence(namelist)
This pragma informs the compiler that the arrays in namelist do
not have any dependencies for iterations of the immediately following loop.
Use no_loop_dependence for arrays only; use loop_private
to indicate dependence-free scalar variables.
This pragma causes the compiler to ignore any dependences that it perceives
to exist. This can enhance the compiler's ability to optimize the loop,
including the possibility of parallelization.
Refer to the Parallel Programming Guide for HP-UX Systems for
more information.
no_loop_transform
This pragma prevents the compiler from performing reordering transformations
on the following loop. The compiler does not distribute, fuse, block, interchange,
unroll, unroll and jam, or parallelize a loop on which this pragma appears.
Refer to the Parallel Programming Guide for HP-UX Systems for more
information.
no_parallel
This pragma prevents the compiler from generating parallel code for the
immediately following loop. Refer to the Parallel Programming Guide
for HP-UX Systems for more information.
no_side_effects(funclist)
This pragma (#pragma _CNX no_side_effects) informs the compiler
that the functions appearing in funclist have no side effects wherever
they appear lexically following the pragma. Side effects include modifying
a function argument, performing I/O, or calling another routine that does
any of the above. The compiler can sometimes eliminate calls to procedures
that have no side effects; also, the compiler may be able to parallelize
loops with calls when informed that the called routines do not have side
effects.
ordered_section(gate_var)
This pragma defines the beginning of an ordered section. An ordered section
is the same as a critical section (a code block in which only one thread
may be executing at a time) with the additional restriction that the threads
must pass through the ordered section in iteration order of the most recently
initiated parallelized loop. The end of the code block must be indicated
by an end_ordered_section pragma. Ordered sections must appear
within the control flow of a loop_parallel (ordered) pragma. Refer
to the Parallel Programming Guide for HP-UX Systems for more information.
parallel[(attribute_list)]
This pragma signifies the beginning of a parallel region of code. All code
up to the following end_parallel pragma will be run on all available
threads. No loop transformations, data privatization, or parallelization
analysis will be performed by the compiler on the region.
The optional attribute_list can be any of the following legal
combinations (m is an integer constant):
-
threads (default)
-
max_threads=m
-
threads,max_threads=m
Attributes may be listed in any order. The compiler flags any attribute
combinations other than those listed above with a warning and ignores the
pragma.
Refer to the Parallel Programming Guide for HP-UX Systems for
more information.
parallel_private(namelist)
This pragma declares a list of variables or arrays private to the immediately
following parallel region. It serves the same purpose for parallel regions
that task_private serves for tasks. The privatized variables and
arrays will not carry their values beyond the end_parallel pragma.
Refer to the Parallel Programming Guide for HP-UX Systems for more
information.
prefer_parallel[(attribute_list)]
This pragma instructs the compiler to parallelize the following loop, but
only if it is safe to do so. A loop is safe to parallelize if it has an
iteration count that can be determined at runtime before loop invocation
and contains no loop-carried dependences, procedure calls, or I/O operations.
(A loop-carried dependence exists when one iteration of a loop assigns
a value to an address that is referenced or assigned on another iteration.)
Refer to the
Parallel Programming Guide for HP-UX Systems for more
information.
The optional attribute_list can be any of the following combinations
(n and m are integer constants):
-
threads (default)
-
dist
-
max_threads=m
-
chunk_size=n
-
threads, max_threads=m
-
dist, max_threads=m
-
threads, chunk_size=n
-
dist, chunk_size=n
-
chunk_size=n, max_threads=m
-
threads, chunk_size=n, max_threads=m
-
dist, chunk_size=n, max_threads=m
Attributes may be listed in any order. The compiler flags any attribute
combinations other than those listed above with a warning and ignores the
pragma.
save_last[(list)]
This pragma specifies that the variables in the comma-separated
list
that are also named in an associated
loop_private(namelist)pragma
must have their last values saved into the "shared" variable of the same
name at loop termination. (A variable's last value in a loop of n
iterations is the value it is assigned in the nth iteration.)
If the optional list is not used,
save_last specifies
that all variables named in an associated loop_private(namelist)
pragma must have their last values saved into the "shared" variable of
the same name at loop termination.
If save_last is not specified then the values in any privatized
variables or arrays are indeterminate at loop termination. Refer to the
Parallel
Programming Guide for HP-UX Systems for more information.
scalar
This pragma prevents the compiler from performing reordering transformations
on the following loop. The compiler does not distribute, fuse, block, interchange,
unroll, unroll and jam, or parallelize a loop on which this pragma appears.
The no_loop_transform pragma provides the same functionality
as the scalar pragma and is recommended in place of the scalar
pragma.
task_private(namelist)
This pragma privatizes the variables and arrays specified in namelistfor
each task specified in the immediately following begin_tasks/end_tasks
block. If a task_private data object is referenced within a task,
it must have been assigned a value previously in that task. The privatized
variables and arrays do not carry their values beyond the end_tasks
pragma. Refer to the Parallel Programming Guide for HP-UX Systems
for more information.
Specifying Task Parallelism
The following example uses the begin_tasks,
task_private,
next_task,
and end_tasks pragmas to specify simple task-parallelism:
/* one thread executes the for loop */
#pragma begin_tasks, task_private(i)
for(i=0;i<n-1;i++)
a[i] = a[i +1] + b[i];
/* another thread executes the function call */
#pragma next_task
tsub(x,y);
/* a third thread assigns elements of array d to every
other element of c */
#pragma next_task
for(i=0;i<500;i++)
c[i*2]=d[i];
#pragma end_tasks
The loop induction variable i is manually privatized because it
is used to control loops in two different tasks. If i was not
private, both tasks would modify it, causing wrong answers. The task_private
pragma is described in task_private(namelist).
OpenMP Pragmas
OpenMP is an industry-standard parallel programming model which implements
a fork-join model of parallel execution. The HP C OpenMP pragmas included
in this release are based on the OpenMP Standard for C, version 1.0.
To view details about the standard and details about usage, syntax and
values, please go to OpenMP.
You can download either a postscript (ps) or Adobe Acrobat (PDF) version
of the C/C++ Version 1.0 OpenMP standard from this website.
OpenMP pragmas has the following options:
+Oopenmp command line option
The OpenMP driver option +Oopenmp is added to this release of the
HP C compiler.
| NOTE |
The +Oopenmp option is accepted at all optimization levels. However,
most of the OpenMP pragmas and pragmas need a minimum optimization level
of +O3. To ensure that OpenMP pragmas are recognized, you must specify
+O3 on the command line. |
|
When +Oopenmp is seen in the command line, +Onodynsel, +Oparallel, +Onofailsafe,
and +Onoautopar are passed by default to the cc driver.
| NOTE |
+Oopenmp overrides +Odynsel, +Ofailsafe and +Onoparallel. |
|
When +Oopenmp is used, most of the HP Programming Model (HPPM) pragmas
are not accepted. The following HPPM pragmas, are accepted by the HP C
compiler when +Oopenmp is issued.
-
BLOCK_LOOP
-
NO_BLOCK_LOOP
-
NO_DISTRIBUTE
-
NO_DYNSEL
-
NO_PARALLEL
-
NO_LOOP_DEPENDENCE
-
NO_LOOP_TRANSFORM
-
NO_UNROLL_AND_JAM
-
OPTIONS
-
SCALAR
-
UNROLL_AND_JAM
New Header File
Every C program that contains OpenMP pragmas is to be compiled for the
current version of HP-UX and must include the header file <omp.h>. If
it does not, the OpenMP pragmas will be ignored. The default path for <omp.h>
is /usr/include.
OpenMP macro _OPENMP
The _OPENMP macro name is defined by OpenMP complaint implementation
as the decimal constant yyyymm, which will be the year and month of the
approved specification. This macro must not be the subject of #define
or #undef preprocessing directive.
#ifdef_OPENMP
iam = omp_get_thread_num() + index;
#endif
Openmp Pragmas
The following work sharing and synchronization pragmas along with the listed
clauses are available with HP C compiler.
Work sharing pragmas:
Synchronization pragmas:
A directive of control data environment during execution of parallel regions
is:
Each of the pragmas available in this release of HP C compiler are discussed
in brief below.
OMP PARALLEL
The OMP_PARALLEL pragma defines a parallel region, which is a region of
the program that is executed by multiple threads in parallel. This is the
fundamental contruct that starts parallel execution.
#pragma OMP_PARALLEL [clause1, clause2,...] new-line structured
block
where [clause1, clause2,...] indicates that the clauses are optional.
There can be zero (0) or more clauses, where clause may be one of the following:
-
PRIVATE (list)
private declares the variables in the (list) to be private to each
thread in a team. A new object with automatic storage duration is allocated
within the associated structured block.
-
DEFAULT (shared | none)
Specifying default(shared)is equivalent to explicitly listing each
currently visible variable in a shared clause. A variable referenced in
the scope of dafault(none)should be explicitly qualified by a private or
shared clause.
-
SHARED (list)
The shared clause shares the variables that appear in the list to be
shared among all threads in a team. All threads within a team access the
same storage area for the shared variables.
-
REDUCTION(operator :list)
This clause performs a reduction on the scalar variables that appear
in the list, with the operator op. The syntax of the reduction is: reduction(op:list)
-
LASTPRIVATE(list)
When lastprivate is specified in a loop, or section the value of the
each lastprivate variable from sequentially last iteration of the associated
loop, or lexically last section directive is assigned to the variable's
original object.
-
SCHEDULE(runtime)
schedule is used only with the FOR directive.
Example:
#pragma omp [for/parallel for] schedule(runtime)
IF(scalar-expression)
if(expr) is one of the clauses that can be used along with #pragma
OMP Parallel. The associated block of code will be executed in parallel
if the (expr) evaluates to a non-zero value, else no parallelization happens
and it is executed sequentially.
Example:
#pragma omp parallel private(x) if (a>b) reduction(+:p)
{
//code to be conditionally parallelized
}
FIRSTPRIVATE(list)
The firstprivate clause provides a superset of the functionality provided
by the private clause.
Variables specified in the list have private clause semantics described
earlier. The new private object is initialized as if there were an implied
declaration inside the structured block and the initializer is the value
of the original object.
| NOTE |
private and firstprivate do not work for globals and aggregate types. |
|
COPYIN(list)
copyin should copy the master thread's copy of a threadprivate variable
to all other threads at the beginning of the parallel region. This clause
can only be used with the PARALLEL directive.
OMP FOR
The OMP FOR pragma for HP C identifies a construct that specifies a region
in which the iterations of the associated loop should be executed in parallel.
The iterations of the loop are distributed across threads that already
exist.
#pragma OMP FOR [clause1, clause2, ...] newline
where [clause1, clause2,...] indicates that the clauses are optional.
There can be zero (0) or more clauses, where clause may be one of the following:
-
LASTPRIVATE (list)
-
REDUCTION (list)
-
ORDERED
-
SCHEDULED (kind, [,chunksize])
| NOTE: |
chunksize should be an integer constant. Expressions in place of chunksize
are not supported and chunksize can be of static, dynamic, guided, or runtime
types. |
-
FIRSTPRIVATE
-
PRIVATE
-
OMP_SECTION/OMP_SECTIONS
OMP_SECTION/OMP_SECTIONS
The OMP SECTION/SECTIONS pragmas identify a construct that specifies a
set of constructs to be divided among threads in a team. Each section is
executed by one of the threads in the team.
#pragma OMP SECTIONS [clause1, clause2, ...]new-line
{
#pragma OMP SECTION new-line
structured-block
#pragma OMP SECTION new-line
structured-block
.
.
.
}
where [clause1, clause2, ...] indicates that the clauses are optional.
There can be zero (0) or more clauses, where clause may be one of the following:
-
LASTPRIVATE
-
REDUCTION
-
PRIVATE
-
FIRSTPRIVATE
-
NOWIAT
OMP PARALLEL FOR
The OMP PARALLEL FOR pragma for HP C is a shortcut for an OMP PARALLEL
region that contains a single OMP FOR pragma.
#pragma OMP_PARALLEL_FOR [clause1, clause2, ...]new-line
for-loop
OMP PARALLEL FOR admits all the allowable clauses of the OMP PARALLEL pragma
and the OMP FOR pragma.
OMP PARALLEL SECTIONS
The OMP PARALLEL SECTIONS pragma for HP C is a shortcut for specifying
a parallel region containing a single OMP SECTIONS pragma.
#pragma OMP PARALLEL SECTIONS [clause1, clause2, ...]new-line
{
[#pragma OMP SECTION new-line
structured-block
[#pragma OMP SECTION new-line
structured-block
.
.
.
}
OMP PARALLEL SECTIONS admits all the allowable clauses of the OMP PARALLEL
pragma and the OMP SECTIONS pragma. The PRIVATE clause is supported.
OMP SINGLE
The OMP SINGLE directive identifies a construct that specifies the associated
structured block is executed by only one thread in the team (not necessarily
the master thread).
#pragma OMP SINGLE [clause[clause] . . .] new-line
structured-block
where [clause] is one of the following:
-
PRIVATE (list)
-
FIRSTPRIVATE (list)
-
NOWAIT
| NOTE: |
SINGLE does not take any associated classes in this release of HP C
compiler. |
OMP PARALLEL CRITICAL
The OMP PARALLEL CRITICAL pragma identifies a construct that restricts
the execution of the associated structured block to one thread at a time.
#pragma OMP CRITICAL [ (name)] new-line
structured-block
The critical section name parameter is optional. All unnamed critical sections
globally map to a single name; this is provided by the HP C compiler.
OMP BARRIER
The OMP_BARRIER pragma synchronizes all the threads in a team. When encountered,
each thread waits until all the threads in the team have reached that point.
#pragma OMP BARRIER new-line
The OMP_BARRIER pragma synchronizes all the threads in a team. When encountered,
each thread waits until all the threads in the team have reached that point.
OMP ORDERED
The OMP ORDERED pragma indicates that the following structured block should
be executed in the same order in which iterations would be executed in
a sequential loop.
#pragma OMP ORDERED new-line
structured-block
The OMP ORDERED pragmas must be called within the OMP FOR and/or OMP PARALLEL FOR loops. When ORDERED clause is used with SCHEDULE which has a chunksize, then the chunksize is ignored by the compiler.
OMP ATOMIC
The OMP ATOMIC directive ensures that a specific memory location is updated
automically, rather than exposing it to the possibility of multiple simultaneous
writing threads.
#pragma OMP ATOMIC new-line
expression stmt
where expression stmt must have one of the following forms:
-
x binop = expr
-
x++
-
++x
-
x--
-
--x
where, in the above expressions:
-
x, is an lvalue expression with scalar type
-
expr, is an expression with scalar type, and it does not reference the
object designated by x.
-
binop, is not an overloaded operator and one of +, *, -, /,&, ^, |,
<<, or >>.
OMP FLUSH
The OMP FLUSH directive whether explicit or implied, specifies a cross-thread
sequence point at which the implementation is required to ensure that all
threads in a team have a consistent view of certain objects in the memory.
#pragma OMP FLUSH [(list)] new-line
A FLUSH directive without a (list)is implied for the following directives:
-
barrier
-
an entry to and exit from CRITICAL
-
at entry to and exit from ORDERED
-
at exit from PARALLEL
-
at exit from FOR
-
at exit from SECTIONS
-
at exit from SINGLE
The directive is not implied if a NOWAIT clause is present.
OMP MASTER
The OMP MASTER pragma for HP C directs that the structured block following
it should be executed by the master thread(thread 0) of the team.
#pragma OMP MASTER new-line
structured-block
Other threads in the team do not execute the associated block.
OMP NOWAIT
The NOWAIT pragma when used, removes the implicit barrier synchronization
at the end of a FOR or SECTIONS construct.
#pragma omp for nowait
OMP THREADPRIVATE
The OMP THREADPRIVATE directive is provided to make file-scope variables
local to a thread. The THREADPRIVATE directive makes the named file-scope
or namescope-scope variables specified in the list private to a thread
but file-scope visible within the thread.
#pragma OMP THREADPRIVATE (list) new-line
| NOTE: |
THREADPRIVATE variables are not supported in this release of HP C compiler. |
Caveats
Observe these known restrictions while you use OpenMP pragmas:
-
Array in firstprivate clause is treated as private.
-
The clauses firstprivate, private,and nowait
are not supported with SINGLE directive at this stage.
Environment Variables in OpenMP
The OpenMP environment variables available in HP C compiler control the
execution of parallel code. The environment variable names are case sensitive
and they must be in uppercase. The following environment variables are
available in HP C compiler:
OMP_SCHEDULE
This environment variable applies for for and parallel for directives that
have the schedule type as runtime. The schedule type and chunk size for
all such loops can be set at run-time by setting this environment variable
to any of the recognized schedule types and to an optional chunk_size.
setenv OMP_SCHEDULE "dynamic"
The default value of the environment variable is implementation dependent.
If the optional chunk_size is set, the value must be positive. If chunk_size
is not set, a value of 1 is assumed, except for static schedule. For a
static schedule, the default chunk_size is set to the loop iteration space
divided by a number of threads applied to the loop.
| NOTE: |
OMP_SCHEDULE is ignored for for and parallel for directives
that have a schedule type other than runtime. |
OMP_NUM_THREADS
The value of the OMP_NUM_THREADS must be positive. This value
depends on whether dynamic adjustment of the number of threads is enabled.
If dynamic adjustments is disabled, the value of this environment variable
is the number of threads to use for each parallel region until that number
is explicitly changed during execution. If dynamic adjustment of the number
of threads is enabled, the value of the environment variable is interpreted
as the maximum number of threads to use.
setenv OMP_NUM_THREADS 16
OMP_DYNAMIC
The OMP_DYNAMIC environment variable enables or disables dynamic
adjustment of the number of threads available for execution of parallel
regions. Its value must be TRUE or FALSE. If the value is set to TRUE,
the number of threads that are used for executing parallel regions may
be adjusted by the runtime environment to best utilize system resources.
If the value is set to FALSE, dynamic adjustment is disabled.
setenv OMP_DYNAMIC TRUE
OMP_NESTED
The OMP_NESTED environment variable enables or disables nested
parallelism. Its value must be TRUE or FALSE. If the value is set to TRUE,
nested parallelism is enabled and if the value is set to FALSE, the nested
parallelism is disabled. The default value is
set to FALSE.
setenv OMP_NESTED FALSE
Runtime Library Functions
This section describes the OpenMP C run-time library functions. The header
<omp.h>
declares two types: several functions that can be used to control and query
the parallel execution environment, and lock functions that can be used
to synchronize access to data.
The type omp_lock_t is an object type capable of representing
that a lock is available, or a thread owns a lock. These locks are referred
as simple locks.
The type omp_nest_lock_t is an object type capable of representing
either that a lock is available, or both the identity of the thread that
owns the lock and a nesting count. These locks are referred as nestable
locks.
The library functions are external functions.
The descriptions of library functions are divided into the following
topics:
Execution environment functions
The functions described in this section affect and monitor threads, processors,
and the parallel environment:
omp_set_num_threads
The omp_set_num_threads function sets the number of threads to
use for subsequent parallel regions. The format is as follows:
#include <omp.h>
void omp_set_num_threads(int num_threads);
The value of the parameter num_threads must be positive. Its effect
depends upon whether dynamic adjustment of the number of threads is enabled.
If dynamic adjustment is disabled, the value is used as the number of threads
for all subsequent parallel regions prior to the next call to this function;
otherwise, the value is the maximum number of threads that will be used.
This function has effect only when called from serial portions of the program.
If it is called from a portion of the program where the omp_in_parallel
function returns non-zero, the behavior of this function is undefined.
For more information on this subject, see the omp_set_dynamic
and omp_get_dynamic functions. This call has precedence over the
OMP_NUM_THREADS
environment variable.
omp_get_num_threads
The omp_get_num_threads function returns the number of threads
currently in the team executing the parallel region from which it is called.
The format is as follows:
#include <omp.h>
int omp_get_num_threads(void);
The omp_set_num_threads function and the OMP_NUM_THREADS
environment variable control the number of threads in a team. If the number
of threads has not been explicitly set by the user, the default is implementation
dependent. This function binds to the closest
enclosing omp parallel directive. If called from a serial portion of
a program, or from a nested parallel region that is serialized, this function
returns 1.
omp_get_max_threads
The omp_get_max_threads function returns the maximum value that
can be returned by calls to omp_get_num_threads. The format is
as follows:
#include <omp.h>
int omp_get_max_threads(void);
If omp_set_num_threads is used to change the number of threads,
subsequent calls to this function will return the new value. A typical
use of this function is to determine the size of an array for which all
thread numbers are valid indices, even when omp_set_dynamic is
set to non-zero.
This function returns the maximum value whether executing within a serial
region or a parallel region.
omp_get_thread_num
The omp_get_thread_num function returns the thread number, within
its team, of the thread executing the function. The thread number lies
between 0 and omp_get_num_threads()-1, inclusive. The master thread
of the team is thread 0. The format is as follows:
#include <omp.h>
int omp_get_thread_num(void);
If called from a serial region, omp_get_thread_num returns 0.
If called from within a nested parallel region that is serialized, this
function returns 0.
omp_get_num_procs
The omp_get_num_procs function returns the maximum number of processors
that could be assigned to the program. The format is as follows:
#include <omp.h>
int omp_get_num_procs(void);
omp_in_parallel
The omp_in_parallel function returns non-zero if it is called
within the dynamic extent of a parallel region executing in parallel; otherwise,
it returns 0. The format is as follows:
#include <omp.h>
int omp_in_parallel(void);
This function returns non-zero from within a region executing in parallel,
regardless of nested regions that are serialized.
omp_set_dynamic
The omp_set_dynamic function enables or disables dynamic adjustment
of the number of threads available for execution of parallel regions. The
format is as follows:
#include <omp.h>
void omp_set_dynamic(int dynamic_threads);
This function has effect only when called from serial portions of the program.
If it is called from a portion of the program where the omp_in_parallel
function
returns non-zero, the behavior of the function is undefined. If dynamic_threads
evaluates to non-zero, the number of threads that are used for executing
subsequent parallel regions may be adjusted automatically by the run-time
environment to best utilize system resources. As a consequence, the number
of threads specified by the user is the maximum thread count. The number
of threads always remains fixed over the duration of each parallel region
and is reported by the omp_get_num_threads function.
If dynamic_threads evaluates to 0, dynamic adjustment is disabled.
A call to omp_set_dynamic has precedence over the OMP_DYNAMIC
environment variable.
The default for the dynamic adjustment of threads is implementation
dependent. As a result, user codes that depend on a specific number of
threads for correct execution should explicitly disable dynamic threads.
Implementations are not required to provide the ability to dynamically
adjust the number of threads, but they are required to provide the interface
in order to support portability across all platforms.
omp_get_dynamic
The omp_get_dynamic function returns non-zero if dynamic thread
adjustments enabled and returns 0 otherwise. For a description of dynamic
thread adjustment, see omp_set_dynamic. The format is as follows:
#include <omp.h>
int omp_get_dynamic(void);
If the implementation does not implement dynamic adjustment of the number
of threads, this function always returns 0.
omp_set_nested
The omp_set_nested function enables or disables nested parallelism.
The format is as follows:
#include <omp.h>
void omp_set_nested(int nested);
If nested evaluates to 0, which is the default, nested parallelism is disabled,
and nested parallel regions are serialized and executed by the current
thread. If nested evaluates to non-zero, nested parallelism is enabled,
and parallel regions that are nested may deploy additional threads to form
the team.
This call has precedence over the OMP_NESTED environment variable.
When nested parallelism is enabled, the number of threads used to execute
nested parallel regions is implementation dependent. As a result, OpenMP-compliant
implementations are allowed to serialize nested parallel regions even when
nested parallelism is enabled.
omp_get_nested
The omp_get_nested function returns non-zero if nested parallelism
is enabled and 0 if it is disabled. The format is as follows:
#include <omp.h>
int omp_get_nested(void);
If an implementation does not implement nested parallelism, this function
always returns 0.
Lock functions
The functions described in this section manipulate locks used for synchronization.
For the following functions, the lock variable must have type omp_lock_t.
This variable must only be accessed through these functions. All lock functions
require an argument that has a pointer to omp_lock_t type.
For the following functions, the lock variable must have type omp_nest_lock_t.
This variable must only be accessed through these
functions. All nestable lock functions require an argument that has
a pointer to omp_nest_lock_t type.
omp_init_lock and omp_init_nest_lock Functions
These functions provide the only means of initializing a lock. Each function
initializes the lock associated with the parameter lock for use in subsequent
calls. The format is as follows:
#include <omp.h>
void omp_init_lock(omp_lock_t *lock);
void omp_init_nest_lock(omp_nest_lock_t *lock);
The initial state is unlocked (that is, no thread owns the lock). For a
nestable lock, the initial nesting count is zero.
omp_destroy_lock and omp_destroy_nest_lock Functions
These functions ensure that the pointer to lock variable lock is uninitialized.
The format is as follows:
#include <omp.h>
void omp_destroy_lock(omp_lock_t *lock);
void omp_destroy_nest_lock(omp_nest_lock_t *lock);
The argument to these functions must point to an initialized lock variable
that is unlocked.
omp_set_lock and omp_set_nest_lock Functions
Each of these functions blocks the thread executing the function until
the specified lock is available and then sets the lock. A simple lock is
available if it is unlocked. A nestable lock is available if it is unlocked
or if it is already owned by the thread executing the function. The format
is as follows:
#include <omp.h>
void omp_set_lock(omp_lock_t *lock);
void omp_set_nest_lock(omp_nest_lock_t *lock);
For a simple lock, the argument to the omp_set_lock function must
point to an initialized lock variable. Ownership of the lock is granted
to the thread executing the function.
For a nestable lock, the argument to the omp_set_nest_lock
function must point to an initialized lock variable. The nesting count
is incremented, and the thread is granted, or retains, ownership of the
lock.
omp_unset_lock and omp_unset_nest_lock Functions
These functions provide the means of releasing ownership of a lock. The
format is as follows:
#include <omp.h>
void omp_unset_lock(omp_lock_t *lock);
void omp_unset_nest_lock(omp_nest_lock_t *lock);
The argument to each of these functions must point to an initialized lock
variable owned by the thread executing the function. The behavior is undefined
if the thread does not own that lock.
For a simple lock, the omp_unset_lock function releases the
thread executing the function from ownership of the lock.
For a nestable lock, the omp_unset_nest_lock function decrements
the nesting count, and releases the thread executing the function from
ownership of the lock if the resulting count is zero.
omp_test_lock and omp_test_nest_lock Functions
These functions attempt to set a lock but do not block execution of the
thread. The format is as follows:
#include <omp.h>
int omp_test_lock(omp_lock_t *lock);
int omp_test_nest_lock(omp_nest_lock_t *lock);
The argument must point to an initialized lock variable. These functions
attempt to set a lock in the same manner as omp_set_lock and omp_set_nest_lock,
except that they do not block execution of the thread.
For a simple lock, the omp_test_lock function returns non-zero
if the lock is successfully set; otherwise, it returns zero.
For a nestable lock, the omp_test_nest_lock function returns
the new nesting count if the lock is successfully set; otherwise, it returns
zero.
Memory Classes
In order to use memory classes in C programs, you must include the header
file /usr/include/spp_prog_model.h. Memory classes are described
in the Parallel Programming Guide for HP-UX Systems. The memory
classes described in this section include:
In C, the general form for assigning memory is:
#include <spp_prog_model.h>
. . .
[storage_class_specfier]
memory_class_name type_specifiernamelist
where:
storage_class_specifier specifies a non-automatic storage class
-
memory_class_name is thread_private or node_private
-
type_specifier is a data type (for example, int or float)
-
namelist is a comma-seperated list of variables and/or arrays of
type type_specifier
Data objects that are assigned a memory class must have a static storage
duration. If the object is declared within a function, it must have the
storage class extern or static. Data objects declared
at file scope and assigned a memory class need not specify a storage class.
A hypernode is a set of processors and physical memory organized as
a symmetric multiprocessor (SMP) running a single image of the operating
system microkernel.
node_private
This storage class specifier causes the variables and arrays specified
in namelist to be replicated in the physical memory of each hypernode
on which the process is executing. While each data object has a single
image in virtual memory, it maps to a different physical location on each
hypernode. The threads of a process within a hypernode all share access
to the copy on their hypernode and cannot access the copies on other hypernodes.
thread_private
This storage class specifier causes the variables and arrays to be treated
as thread_private. These data objects map to unique node_private
addresses for each thread of a process. Refer to the Parallel Programming
Guide for HP-UX Systems for more information.
Synchronization Functions
HP C provides functions that can be used with pragmas to achieve synchronization.
Those discussed in this section include:
Gates allow you to restrict execution of a block of code to a single thread.
They can be allocated, locked, unlocked or deallocated. Or, they can be
used with the ordered or critical section pragmas, which automate the locking
and unlocking functions.
Barrriers block further execution until all executing threads reach
the barrier.
You declare gates and barriers by using the following type definitions:
-
gate_t namelist declares variables to use in a critical
section, ordered section, or passed as arguments to the synchronization
functions
-
barrier_t namelist declares a list of synchronization variables
for the barrier routine
namelist is a comma-separated list of one
or more gate or barrier names.
Gates and barriers should only appear in definition and declaration statements,
and as formal and actual arguments.
Allocate Functions
These functions allocate memory for a gate or barrier. When memory is first
allocated, gate variables are unlocked.
int alloc_gate(gate_t *gate_p);
int alloc_barrier(barrier_t *barrier_p);
gate_p and barrier_p are pointers of the indicated type,
which have been previously declared as described above.
Deallocate Functions
These functions free the memory assigned to the specified gate or barrier
variable.
These functions have the following declarations:
int free_gate(gate_t *gate_p);
int free_barrier(barrier_t, *barrier_p);
where gate_p and barrier_p are pointers of the indicated
type. Always free gates and barriers when you are done using them.
Locking Functions
These functions acquire a gate for exclusive access. If the gate cannot
be immediately acquired, the calling thread waits for it. The conditional
locking functions, which are prefixed with COND_ or cond_,
acquire a gate if doing so does not require a wait. If the gate is acquired,
the functions return 0; if not, they return -1.
The functions have the following declarations:
int lock_gate(gate_t *gate_p);
int cond_lock_gate(gate_t *gate_p);
where gate_p is a pointer of the indicated type.
Unlocking Function
This function releases a gate from exclusive access. Gates are typically
released by the thread that locks them, unless a gate was locked by thread
0 in serial code. In that case it might be unlocked by a single different
thread in a parallel construct.
The function has the following declaration:
int unlock_gate(gate_t *gate_p);
where gate_p is a pointer of the indicated type.
Wait Function
This function uses a barrier to cause the calling thread to wait until
the specified number of threads call the function, at which point all threads
are released from the function simultaneously.
The function has the following declaration:
int wait_barrier(barrier_t *barrier_p, const int *nthr);
where barrier_p is a pointer of the indicated type and nthr
is a pointer referencing the number of threads calling the routine.
You can use a barrier variable in multiple calls to the wait()
function, as long as you ensure that two barriers are not active at the
same time. Also, check that nthr reflects the correct number of
threads.