HP C/HP-UX Online Help

Return to the MainHP C Online Help page



NOTE: See the Compiling& Running HP C Programs section of the HP C Online Help for a quickreference of all HP C compiler options and pragmas. See the OptimizingHP C Programs section of the HP C Online Help for a detailed descriptionof HP C optimization options and pragmas.

Parallel Options and Pragmas

Getting Started with Parallelizing C Programs
Guidelines for Parallelizing C Programs
Parallel Processing Options
Parallel Processing Pragmas
OpenmpPragmas
Memory Classes
Synchronization Functions

HP C generates efficient parallel code by default. You can increasethe amount of code the compiler can parallelize on multiprocessor systemsby using options, pragmas, and supporting library calls. Applications runningon HP 9000 K-Class and V-Class servers can benefit from the parallelizationfeatures described in this section.

For detailed information and examples, see the Parallel ProgrammingGuide for HP-UX Systems.

Getting Started with Parallelizing C Programs

This section describes the basis tasks required to help you get startedwith parallelizing C programs, including:

Transforming Loops for Parallel Execution (+Oparallel)

The +Oparallel option causes the compiler to transform eligibleloops for parallel execution on multiprocessor machines.

The following command lines compile (without linking) three source files:x.c,y.c,and z.c. The files x.c andy.c are compiledfor parallel execution. The file z.c is compiled for serial execution,even though its object file will be linked with x.o and y.o.

cc +O3 +Oparallel -c x.c y.c cc +O3 -c z.c

The following command line links the three object files, producing theexecutable file para_prog:

cc +O3 +Oparallel -o para_prog x.o y.o z.o

As this command line implies, if you link and compile separately, youmust use cc, not ld. The command line to link must alsoinclude the +Oparallel and +O3 options in order to linkin the right startup files and runtime support.

Setting the Number of Threads Used in Parallel

Use the MP_NUMBER_OF_THREADS environment variable to set the numberof processors that are to execute your program in parallel. If you do notset this variable, it defaults to the number of processors on the executingmachine.

From the C shell, the following command sets MP_NUMBER_OF_THREADSto indicate that programs compiled for parallel execution can execute ontwo processors:

setenv MP_NUMBER_OF_THREADS 2< /tt>

If you use the Korn shell, the command is:

export MP_NUMBER_OF_THREADS=2

Determining Idle Thread States

Use the MP_IDLE_THREADS_WAIT environment variable to determinehow threads wait. Idle threads can be suspended or can spin-wait.

This variable takes an integer value n. For n less than0, the threads spin-wait. For n equal to or greater than 0, thethreads spin-wait for n milliseconds before being suspended.

By default, idle threads spin-wait briefly after creation or a join.They then suspend themselves if they receive no work.

Accessing the Pthreads Library

Pthreads (POSIX threads) refers to the Pthreads library of thread-managementroutines. For information on Pthread routines see the pthread(3t)man page.

To use the Pthread routines, your program must include the<pthreads.h>header file and the Pthreads library must be explicitly linked to yourprogram. For example:

% cc -D_POSIX_C_SOURCE+199506L prog.c -lpthread

The -D_POSIX_C_SOURCE=199506L string specifies the appropriatePOSIX revision level. In this case, the level is 199506L.

Profiling Parallelized Programs

Profiling a program that has been compiled for parallel execution is performedin much the same way as it is for non-parallel programs:
  1. Compile the program with the option -G.
  2. Run the program to produce profiling data.
  3. Run gprof against the program.
  4. View the output from gprof.
The differences are:

Guidelines for Parallelizing C Programs

This section describes the following guidelines for parallelizing C programs:To ensure the best performance from a parallel program, do not run morethan one parallel program on a multiprocessor machine at the same time.Running two or more parallel programs simultaneously or running one parallelprogram on a heavily loaded system, will slow performance.

You should run a parallel-executing program at a higher priority thanany other user program; see rtprio(1) for information about settingreal-time priorities.

Conditions Inhibiting Loop Parallelization

The following sections describe different conditions that can inhibit parallelization.

Calling Routines with Side Effects

The compiler will not parallelize any loop containing a call to a routinethat has side effects. A routine has side effects if it does any of thefollowing:

Indeterminate Iteration Counts

If the compiler cannot determine what the runtime loop iteration countis before the loop executes, it does not parallelize the loop. The reasonfor this limitation is that the runtime code must know the iteration countin order to know how many iterations to distribute to the different processorsfor execution.

The following conditions can prevent a runtime count:

Data Dependence

When a loop is parallelized, the iterations are executed independentlyon different processors, and the order of execution differs from the serialorder that occurs on a single processor. This effect of parallelizationis not a problem. The iterations could be executed in any order with noeffect on the results. Consider the following loop:
for (i=0; i<5; i++)    a[i] = a[i] * b[i];
In this example, the array a would always end up with the samedata regardless of whether the order of execution were 0-1-2-3-4, 4-3-2-1-0,3-1-4-0-2, or any other order. The independence of each iteration fromthe others makes the loop eligible candidate for parallelization.

Such is not the case in the following:

for (i=1; i<5; i++)    a[i] = a[i-1] * b[i];
In this loop, the order of execution does matter. The data used in iterationiis dependent upon the data that was produced in the previous iteration[i-1].a would end up with very different data if theorder of execution were any other than 1-2-3-4. The data dependence inthis loop thus makes it ineligible for parallelization.

Not all data dependences must inhibit parallelization. The followingparagraphs discuss some of the exceptions.

Nested Loops and Matrices

Some nested loops that operate on matrices may have a data dependence inthe inner loop only, allowing the outer loop to be parallelized. Considerthe following:
for (i=0; i<10; i++)    for (j=1; j<100; j++)         a[i][j] = a[i][j-1] + 1;
The data dependence in this nested loop occurs in the inner[j]loop: Each row access of a[i][j] depends upon the preceding row[j-1]having been assigned in the previous iteration. If the iterations of the[j]loop were to execute in any other order than the one in which they wouldexecute on a single processor, the matrix would be assigned different values.The inner loop, therefore, must not be parallelized.

But no such data dependence appears in the outer loop: Each column accessis independent of every other column access. Consequently, the compilercan safely distribute entire columns of the matrix to execute on differentprocessors; the data assignments will be the same regardless of the orderin which the columns are executed, so long as each executes in serial order.

Assumed Dependencies

When analyzing a loop, the compiler errs on the safe side and assume thatwhat looks like a data dependence really is one and so it does not parallelizethe loop. Consider the following:
for (i=100; i<200; i++)    a[i] = a[i-k];
The compiler assumes that a data dependence exists in this loop becauseit appears that data that has been defined in a previous iteration is beingused in a later iteration. However, if the value of k is 100,the dependence is assumed rather than real because a[i-k] is definedoutside the loop.

Parallel Processing Options

HP C provides the following optimization options for parallelizing C programs:

+O[no]autopar

Optimization level(s): 3, 4

Default: +Oautopar if +Oparallel is enabled

When used with +Oparallel, the +Onoautopar optioncauses the compiler to parallelize only those loops marked by the loop_parallelor prefer_parallel pragmas. Because the compiler does not automaticallyfind parallel tasks or regions, user-specified task and region parallelizationis not affected by this option.

A loop is safe to parallelize if it has an iteration count that canbe determined at runtime before loop invocation, and contains no loop-carrieddependences, procedure calls, or I/O operations. A loop-carried dependenceexists when one iteration of a loop assigns a value to an address thatis referenced or assigned on another iteration.

+O[no]dynsel

Optimization level(s): 3, 4

Default: +Odynsel if +Oparallel is enabled

When specified with +Oparallel,+Odynsel (the default)enables workload-based dynamic selection. For parallelizable loops whoseiteration counts are known at compile time, +Odynsel causes thecompiler to generate either a parallel or a serial version of the loop-dependingon which is more profitable.

This optimization also causes the compiler to generate both paralleland serial versions of parallelizable loops whose iteration counts areunknown at compile time. At runtime, the loop workload is compared to parallelizationoverhead, and the parallel version is run only if it is profitable to doso.

The +Onodynsel option disables dynamic selection and tellsthe compiler that it is profitable to parallelize all parallelizable loops.The dynsel pragma can be used to enable dynamic selection forspecific loops when +Onodynsel is in effect.

See Also: dynsel[(trip_count=n)]

+O[no]loop_block

Optimization level(s): 3, 4

Default: +Onoloop_block

The +O[no]loop_block option enables [disables] blocking ofeligible loops for improved cache performance. The +Onoloop_blockoption disables automatic and directive-specified loop blocking. For moreinformation on loop blocking, see the Parallel Programming Guide forHP-UX Systems.

+O[no]loop_unroll_jam

Optimization level(s): 3, 4

Default: +Onoloop_unroll_jam

The +O[no]loop_unroll_jam option enables [disables] loop unrollingand jamming. The +Onoloop_unroll_jam option disables both automaticand directive-specified unroll and jam. Loop unrolling and jamming increasesregister exploitation. For more information on the unroll and jam optimization,see the Parallel Programming Guide for HP-UX Systems.

+O[no]parallel

Optimization level(s): 3, 4

Default: +Onoparallel

The +Oparallel option optimizes the time it takes to executea single process running on a multiprocessor system.


NOTE If you compile one or more files in an application using+Oparallel,then the application must be linked (using the compiler driver) with the+Oparalleloptionto link in the proper start-up files and runtime support.

The +Oparallel option causes the compiler to:

The following methods can be used to specify the number of processors usedin executing your parallel programs:The +Oparallel option disables +Ofailsafe.

See Also: Transforming Loops for Parallel Execution(+Oparallel) .

+O[no]report[= report_type]

Optimization level(s): 3, 4

Default: +Onoreport

This option causes the compiler to display various optimization reports.+Onoreportis the default. The value of report_type determines which reportis displayed, as described below.

+Oreport=loop produces the Loop Report. This report gives informationon optimizations performed on loops and calls. Using +Oreport(without =report_type) also produces the Loop Report.

+Oreport=private produces the Loop Report and the PrivatizationTable, which provides information on loop variables that are privatizedby the compiler.

+Oreport=all produces all reports.

The +Oreport[=report_type] option is activeonly at +O3 and above. The +Onoreport option does notaccept any of the report_type values. See the Parallel ProgrammingGuide for HP-UX Systems for more information on the optimization reports.

+O[no]sharedgra

Optimization level(s): 2, 3, 4

Default: +Osharedgra

The +Onosharedgra option disables global register allocationfor shared-memory variables that are visible to multiple threads. Thisoption can help if a variable shared among parallel threads is causingwrong answers. See the Parallel Programming Guide for HP-UX Systemsfor more information.

Parallel Processing Pragmas

This section describes the following Parallel Processing Pragmas:


The syntax of a parallel processing pragma is:

#pragma [_CNX] pragma-list

where:

See Specifying Task Parallelism for an exampleon using these pragmas.

In the sections that follow, namelist represents a comma-separatedlist of variables or arrays. The occurrence of a lowercase n ormis used to indicate an integer constant. Occurrences of gate_varare for variables that have been, or are being, defined as gates.

begin_tasks[(attribute_list)]

This pragma defines the beginning of sections of code (seenext_task)that are to be executed as independent, parallel tasks. Each task is executedby a separate thread. begin_tasks must have an accompanying end_tasksin the same program unit.

The optional attribute_list can be any of the following legalcombinations (m is an integer constant):

Attributes may be listed in any order. The compiler flags any attributecombinations other than those listed above with a warning and ignores thepragma.

Refer to the Parallel Programming Guide for HP-UX Systems fora complete discussion of parallel tasking.

block_loop[(block_factor=n)]

This pragma indicates a specific loop to block, and optionally, the blockfactor n (n must be an integer constant greater than or equalto 2) that is to be used in the compiler's internal computation of loopnest based data reuse. If no block_factor is specified, the compileruses a heuristic to determine the block_factor. Refer to the ParallelProgramming Guide for HP-UX Systems for more information on blocking.

critical_section[(gate_var)]

This pragma defines the beginning of a code block in which only one threadmay be executing at a time. The end of the code block must be indicatedby an end_critical_section pragma, which must appear in the sameflow of control within the same program unit. The optional gate_varcan be used to implement a critical section which is not contiguous atthe source level. Refer to the Parallel Programming Guide for HP-UXSystems for more information.

dynsel[(trip_count=n)]

This pragma enables workload-based dynamic selection for the immediatelyfollowing loop. trip_count represents either the thread_trip_countor node_trip_count attribute, andn is an integer constant.

When thread_trip_count=n is specified, the serial versionof the loop is run if the iteration count is less than n; otherwise,the thread-parallel version is run. When node_trip_count=nis specified, the serial version of the loop is run if the iteration countis less than n; otherwise, the node-parallel version is run, assuming+Onodeparis specified.

end_critical_section

This pragma defines the end of the critical section that was begun withthe critical _section pragma. critical_section and end_critical_sectionmust appear as a pair. Refer to the Parallel Programming Guide for HP-UXSystems for more information.

end_ordered_section

This pragma defines the end of the ordered section that was begun withthe ordered_section pragma.ordered_section and end_ordered_sectionmust appear as a pair. Refer to the Parallel Programming Guide for HP-UXSystems for more information on ordered sections.

end_parallel

This pragma signifies the end of a parallel region. The parallelpragma signifies the beginning of a parallel region. Refer to theParallelProgramming Guide for HP-UX Systems for more information.

end_tasks

This pragma terminates the specification of parallel tasks indicated bybegin_tasksand next_task. It must appear at the end of the last section ofparallel code defined by these pragmas. All of these must appear in thesame program unit. Refer to the Parallel Programming Guide for HP-UXSystems for more information.

loop_parallel[(attribute_list)]

This pragma is an explicit instruction to the compiler to parallelize theimmediately following loop. The loop iterations are run in an indeterminateorder unless the optional ordered attribute appears. You are responsiblefor any required data privatization and loop synchronization, as describedin the Parallel Programming Guide for HP-UX Systems. The optionalattribute_listcan be any of the following combinations (n and m are integerconstants):The ivar= indvar attribute is:Attributes may be listed in any order. The compiler flags any attributecombinations other than those listed above with a warning and ignores thepragma.

Refer to the Parallel Programming Guide for HP-UX Systems formore information.

loop_private(namelist)

This pragma declares a list of variables and/or arrays private to the immediatelyfollowing loop. No values may be carried into the loop by loop_privatevariables. To be loop private, the variables and/or arrays must be assignedbefore they are used on each iteration of the immediately following loop.These private data items should be treated as distinct objects from theshared items of the same name that exist outside the loop. Values assignedto loop_private variables on the final iteration (that is, thenthiteration of a loop with n iterations) may be saved into the sharedvariables of the same name if the save_last pragma also appearson this loop. If save_last is not used, then the value of anyshared variable declared to beloop_private is undefined at looptermination. Refer to the Parallel Programming Guide for HP-UX Systemsfor more information.

next_task

This pragma starts a block of code following a begin_tasks blockthat will be executed as a parallel task. The end of the code block ismarked by another next_task or by an end_tasks pragma.

This pragma must appear within a begin_tasks and end_taskspair. There is no limit on the number of next_task pragmas thatcan appear. Refer to the Parallel Programming Guide for HP-UX Systemsfor more information.

no_block_loop

This pragma disables loop blocking on the immediately following loop. Referto the Parallel Programming Guide for HP-UX Systems for more informationon loop blocking.

no_distribute

This pragma disables loop distribution for the immediately following loop.Refer to the Parallel Programming Guide for HP-UX Systems for moreinformation on loop distribution.

no_dynsel

This pragma disables workload-based dynamic selection for the immediatelyfollowing loop. Refer to the Parallel Programming Guide for HP-UX Systemsfor more information on dynamic selection.

no_loop_dependence(namelist)

This pragma informs the compiler that the arrays in namelist donot have any dependencies for iterations of the immediately following loop.Use no_loop_dependence for arrays only; use loop_privateto indicate dependence-free scalar variables.

This pragma causes the compiler to ignore any dependences that it perceivesto exist. This can enhance the compiler's ability to optimize the loop,including the possibility of parallelization.

Refer to the Parallel Programming Guide for HP-UX Systems formore information.

no_loop_transform

This pragma prevents the compiler from performing reordering transformationson the following loop. The compiler does not distribute, fuse, block, interchange,unroll, unroll and jam, or parallelize a loop on which this pragma appears.Refer to the Parallel Programming Guide for HP-UX Systems for moreinformation.

no_parallel

This pragma prevents the compiler from generating parallel code for theimmediately following loop. Refer to the Parallel Programming Guidefor HP-UX Systems for more information.

no_side_effects(funclist)

This pragma (#pragma _CNX no_side_effects) informs the compilerthat the functions appearing in funclist have no side effects whereverthey appear lexically following the pragma. Side effects include modifyinga function argument, performing I/O, or calling another routine that doesany of the above. The compiler can sometimes eliminate calls to proceduresthat have no side effects; also, the compiler may be able to parallelizeloops with calls when informed that the called routines do not have sideeffects.

ordered_section(gate_var)

This pragma defines the beginning of an ordered section. An ordered sectionis the same as a critical section (a code block in which only one threadmay be executing at a time) with the additional restriction that the threadsmust pass through the ordered section in iteration order of the most recentlyinitiated parallelized loop. The end of the code block must be indicatedby an end_ordered_section pragma. Ordered sections must appearwithin the control flow of a loop_parallel (ordered) pragma. Referto the Parallel Programming Guide for HP-UX Systems for more information.

parallel[(attribute_list)]

This pragma signifies the beginning of a parallel region of code. All codeup to the following end_parallel pragma will be run on all availablethreads. No loop transformations, data privatization, or parallelizationanalysis will be performed by th e compiler on the region.

The optional attribute_list can be any of the following legalcombinations (m is an integer constant):

Attributes may be listed in any order. The compiler flags any attributecombinations other than those listed above with a warning and ignores thepragma.

Refer to the Parallel Programming Guide for HP-UX Systems formore information.

parallel_private(namelist)

This pragma declares a list of variables or arrays private to the immediatelyfollowing parallel region. It serves the same purpose for parallel regionsthat task_private serves for tasks. The privatized variables andarrays will not carry their values beyond the end_parallel pragma.Refer to the Parallel Programming Guide for HP-UX Systems for moreinformation.

prefer_parallel[(attribute_list)]

This pragma instructs the compiler to parallelize the following loop, butonly if it is safe to do so. A loop is safe to parallelize if it has aniteration count that can be determined at runtime before loop invocationand contains no loop-carried dependences, procedure calls, or I/O operations.(A loop-carried dependence exists when one iteration of a loop assignsa value to an address that is referenced or assigned on another iteration.)Refer to theParallel Programming Guide for HP-UX Systems for moreinformation.

The optional attribute_list can be any of the following combinations(n and m are integer constants):

Attributes may be listed in any order. The compiler flags any attributecombinations other than those listed above with a warning and ignores thepragma.

save_last[(list)]

This pragma specifies that the variables in the comma-separatedlistthat are also named in an associatedloop_private(namelist)pragmamust have their last values saved into the "shared" variable of the samename at loop termination. (A variable's last value in a loop of niterations is the value it is assigned in the nth iteration.)

If the optional list is not used,save_last specifiesthat all variables named in an associated loop_private(namelist)pragma must have their last values saved into the "shared" variable ofthe same name at loop termination.

If save_last is not specified then the values in any privatizedvariables or arrays are indeterminate at loop termination. Refer to theParallelProgramming Guide for HP-UX Systems for more information.

scalar

This pragma prevents the compiler from performing reordering transformationson the following loop. The compiler does not distribute, fuse, block, interchange,unroll, unroll and jam, or parallelize a loop on which this pragma appears.

The no_loop_transform pragma provides the same functionalityas the scalar pragma and is recommended in place of the scalarpragma.

task_private(namelist)

This pragma privatizes the variables and arrays specified in namelistforeach task specified in the immediately following begin_tasks/end_tasksblock. If a task_pr ivate data object is referenced within a task,it must have been assigned a value previously in that task. The privatizedvariables and arrays do not carry their values beyond the end_taskspragma. Refer to the Parallel Programming Guide for HP-UX Systemsfor more information.

Specifying Task Parallelism

The following example uses the begin_tasks,task_private,next_task,and end_tasks pragmas to specify simple task-parallelism:
/* one thread executes the for loop */#pragma begin_tasks, task_private(i) for(i=0;i<n-1;i++)    a[i] = a[i +1] + b[i]; /* another thread executes the function call */#pragma next_task tsub(x,y); /* a third thread assigns elements of array d to every   other element of c */#pragma next_task for(i=0;i<500;i++)    c[i*2]=d[i]; #pragma end_tasks
The loop induction variable i is manually privatized because itis used to control loops in two different tasks. If i was notprivate, both tasks would modify it, causing wrong answers. The task_privatepragma is described in task_private(namelist).

OpenMP Pragmas

OpenMP is an industry-standard parallel programming model which implementsa fork-join model of parallel execution. The HP C OpenMP pragmas includedin this release are based on the OpenMP Standard for C, version 1.0.

To view details about the standard and details about usage, syntax andvalues, please go to OpenMP.You can download either a postscript (ps) or Adobe Acrobat (PDF) versionof the C/C++ Version 1.0 OpenMP standard from this website.

OpenMP pragmas has the following options:


+Oopenmp command line option

The OpenMP driver option +Oopenmp and +Onoopenmp is addedto this release of the HP C compiler.


NOTE The +Oopenmp option is accepted at all optimization levels. However,most of the OpenMP pragmas and pragmas need a minimum optimization levelof +O3. To ensure that OpenMP pragmas are recognized, you must specify+O3 on the command line.

When +Oopenmp is seen in the command line, +Onodynsel, +Oparallel, +Onofailsafe,and +Onoautopar are passed by default to the cc driver.


NOTE +Oopenmp overrides +Odynsel, +Ofailsafe and +Onoparallel.

When +Oopenmp is used, most of the HP Programming Model (HPPM) pragmasare not accepted. The following HPPM pragmas, are accepted by the HP Ccompiler when +Oopenmp is issued.

Using +Onoopenmp option will ignore all OpenMP directives silently.

New Header File

Every C program that contains OpenMP pragmas is to be compiled for thecurrent version of HP-UX and must include the header file <omp.h>. Ifit does not, the OpenMP pragmas will be ignored. The default path for <omp.h>is /usr/include.

OpenMP macro _OPENMP

The _OPENMP macro name is defined by OpenMP complaint implementationas the decimal constant yyyymm, which will b e the year and month of theapproved specification. This macro must not be the subject of #defineor #undef preprocessing directive.
#ifdef_OPENMPiam = omp_get_thread_num() + index;#endif

Openmp Pragmas

The following work sharing and synchronization pragmas along with the listedclauses are available with HP C compiler.

Work sharing pragmas:

Synchronization pragmas:A directive of control data environment during execution of parallel regionsis:Each of the pragmas available in this release of HP C compiler are discussedin brief below.

OMP PARALLEL

The OMP_PARALLEL pragma defines a parallel region, which is a region ofthe program that is executed by multiple threads in parallel. This is thefundamental contruct that starts parallel execution.
#pragma OMP_PARALLEL [clause1, clause2,...] new-line structuredblock
where [clause1, clause2,...] indicates that the clauses are optional.There can be zero (0) or more clauses, where clause may be one of the following:

OMP FOR

The OMP FOR pragma for HP C identifies a construct that specifies a regionin which the iterations of the associated loop should be executed in parallel.The iterations of the loop are distributed across threads that alreadyexist.
#pragma OMP FOR [clause1, clause2, ...] newline
where [clause1, clause2,...] indicates that the clauses are optional.There can be zero (0) or more clauses, where clause may be one of the following:
NOTE:chunksize should be an integer constant. Expressions in place of chunksizeare not supported and chunksize can be of static, dynamic, guided, or runtimetypes.

OMP_SECTION/OMP_SECTIONS

The OMP SECTION/SECTIONS pragmas identify a construct that specifies aset of constructs to be divided among threads in a team. Each section isexecuted  by one of the threads in the team.
#pragma OMP SECTIONS [clause1, clause2, ...]new-line {#pragma OMP SECTION new-line           structured-block#pragma OMP SECTION new-line           structured-block...}
where [clause1, clause2, ...] indicates that the clauses are optional.There can be zero (0) or more clauses, where clause may be one of the following:

OMP PARALLEL FOR

The OMP PARALLEL FOR pragma for HP C is a shortcut for an OMP PARALLELregion that contains a single OMP FOR pragma.
#pragma OMP_PARALLEL_FOR [clause1, clause2, ...]new-line        for-loop
OMP PARALLEL FOR admits all the allowable clauses of the OMP PARALLEL pragmaand the OMP FOR pragma.

OMP PARALLEL SECTIONS

The OMP PARALLEL SECTIONS pragma for HP C is a shortcut for specifyinga parallel region containing a single OMP SECTIONS pragma.
#pragma OMP PARALLEL SECTIONS [clause1, clause2, ...]new-line{[#pragma OMP SECTION new-line         structured-block[#pragma OMP SECTION new-line         structured-block...}
OMP PARALLEL SECTIONS admits all the allowable clauses of the OMP PARALLELpragma and the OMP SECTIONS pragma. The PRIVATE clause is supported.

OMP SINGLE

The OMP SINGLE directive identifies a construct that specifies the associatedstructured block is executed by only one thread in the team (not necessarilythe master thread).
#pragma OMP SINGLE [clause[clause] . . .] new-line                          structured-block
where [clause] is one of the following:
NOTE:SINGLE does not take any associated classes in this release of HP Ccompiler.

OMP PARALLEL CRITICAL

The OMP PARALLEL CRITICAL pragma identifies a construct that restrictsthe execution of the associated structured block to one thread at a time.
#pragma OMP CRITICAL [ (name)] new-line              &
nbsp; structured-block
The critical section name parameter is optional. All unnamed critical sectionsglobally map to a single name; this is provided by the HP C compiler.

OMP BARRIER

The OMP_BARRIER pragma synchronizes all the threads in a team. When encountered,each thread waits until all the threads in the team have reached that point.
#pragma OMP BARRIER new-line
The OMP_BARRIER pragma synchronizes all the threads in a team. When encountered,each thread waits until all the threads in the team have reached that point.

OMP ORDERED

The OMP ORDERED pragma indicates that the following structured block shouldbe executed in the same order in which iterations would be executed ina sequential loop.
#pragma OMP ORDERED new-line           structured-block
The OMP ORDERED pragmas must be called within the OMP FOR and/or OMP PARALLEL FOR loops. When ORDERED clause is used with SCHEDULE which has a chunksize, then the chunksize is ignored by the compiler.

OMP ATOMIC

The OMP ATOMIC directive ensures that a specific memory location is updatedautomically, rather than exposing it to the possibility of multiple simultaneouswriting threads.
#pragma OMP ATOMIC new-line           expression stmt
where expression stmt must have one of the following forms:where, in the above expressions:

OMP FLUSH

The OMP FLUSH directive whether explicit or implied, specifies a cross-threadsequence point at which the implementation is required to ensure that allthreads in a team have a consistent view of certain objects in the memory.
#pragma OMP FLUSH [(list)] new-line
A FLUSH directive without a (list)is implied for the following directives:The directive is not implied if a NOWAIT clause is present.

OMP MASTER

The OMP MASTER pragma for HP C directs that the structured block followingit should be executed by the master thread(thread 0) of the team.
#pragma OMP MASTER new-line
       structured-block

Other threads in the team do not execute the associated block.

OMP NOWAIT

The NOWAIT pragma when used, removes the implicit barrier synchronizationat the end of a FOR or SECTIONS construct.
#pragma omp for nowait

OMP THREADPRIVATE

The OMP THREADPRIVATE directive is provided to make file-scope variableslocal to a thread. The THREADPRIVATE directive makes the named file-scopeor namescope-scope variables specified in the list private to a threadbut file-scope visible within the thread.
#pragma OMP THREADPRIVATE (list) new-line

NOTE:THREADPRIVATE variables are not supported in this release of HP C compiler.

Caveats

Observe these known restrictions while you use OpenMP pragmas:

Environment Variables in OpenMP

The OpenMP environm ent variables available in HP C compiler control theexecution of parallel code. The environment variable names are case sensitiveand they must be in uppercase. The following environment variables areavailable in HP C compiler:

OMP_SCHEDULE

This environment variable applies for for and parallel for directives thathave the schedule type as runtime. The schedule type and chunk size forall such loops can be set at run-time by setting this environment variableto any of the recognized schedule types and to an optional chunk_size.
setenv OMP_SCHEDULE "dynamic"
The default value of the environment variable is implementation dependent.If the optional chunk_size is set, the value must be positive. If chunk_sizeis not set, a value of 1 is assumed, except for static schedule. For astatic schedule, the default chunk_size is set to the loop iteration spacedivided by a number of threads applied to the loop.


NOTE:OMP_SCHEDULE is ignored for for and parallel for directivesthat have a schedule type other than runtime.


OMP_NUM_THREADS

The value of the OMP_NUM_THREADS must be positive. This valuedepends on whether dynamic adjustment of the number of threads is enabled.If dynamic adjustments is disabled, the value of this environment variableis the number of threads to use for each parallel region until that numberis explicitly changed during execution. If dynamic adjustment of the numberof threads is enabled, the value of the environment variable is interpretedas the maximum number of threads to use.
setenv OMP_NUM_THREADS 16

OMP_DYNAMIC

The OMP_DYNAMIC environment variable enables or disables dynamicadjustment of the number of threads available for execution of parallelregions. Its value must be TRUE or FALSE. If the value is set to TRUE,the number of threads that are used for executing parallel regions maybe adjusted by the runtime environment to best utilize system resources.If the value is set to FALSE, dynamic adjustment is disabled.
setenv OMP_DYNAMIC TRUE

OMP_NESTED

The OMP_NESTED environment variable enables or disables nestedparallelism. Its value must be TRUE or FALSE. If the value is set to TRUE,nested parallelism is enabled and if the value is set to FALSE, the nestedparallelism is disabled. The default value is
set to FALSE.
setenv OMP_NESTED FALSE

Runtime Library Functions

This section describes the OpenMP C run-time library functions. The header<omp.h>declares two types: several functions that can be used to control and querythe parallel execution environment, and lock functions that can be usedto synchronize access to data.

The type omp_lock_t is an object type capable of representingthat a lock is available, or a thread owns a lock. These locks are referredas simple locks.

The type omp_nest_lock_t is an object type capable of representingeither that a lock is available, or both the identity of the thread thatowns the lock and a nesting count. These locks are referred as nestablelocks.

The library functions are external functions.

The descriptions of library functions are divided into the followingtopics:

Execution environment functions

The functions described in this section affect and monitor threads, processors,and the parallel environment:
omp_set_num_threads
The omp_set_num_threads function sets the number of threads touse for subsequent parallel regions. The format is as follows:
#include <omp.h>void omp_set_num_threads(int num_threads);
The value of the parameter num_threads must be positive. Its effectdepends upon whether dynamic adjustment of the number of threads is enabled.If dynamic adjustment is disabled, the value is used as the number of threadsfor all subsequent parallel regions prior to the next call to this function;otherwise, the value is the maximum number of threads that will be used.This function has effect only when called from serial portions of the program.If it is called from a portion of the program where the omp_in_parallelfunction returns non-zero, the behavior of this function is undefined.For more information on this subject, see the omp_set_dynamicand omp_get_dynamic functions. This call has precedence over theOMP_NUM_THREADSenvironment variable.

omp_get_num_threads
The omp_get_num_threads function returns the number of threadscurrently in the team executing the parallel region from which it is called.The format is as follows:
#include <omp.h>int omp_get_num_threads(void);
The omp_set_num_threads function and the OMP_NUM_THREADSenvironment variable control the number of threads in a team. If the numberof threads has not been explicitly set by the user, the default is implementationdependent. This function binds to the closest
enclosing omp parallel directive. If called from a serial portion ofa program, or from a nested parallel region that is serialized, this functionreturns 1.

omp_get_max_threads
The omp_get_max_threads function returns the maximum value thatcan be returned by calls to omp_get_num_threads. The format isas follows:
#include <omp.h>int omp_get_max_threads(void);
If omp_set_num_threads is used to change the number of threads,subsequent calls to this function will return the new value. A typicaluse of this function is to determine the size of an array for which allthread numbers are valid indices, even when omp_set_dynamic isset to non-zero.

This function returns the maximum value whether executing within a serialregion or a parallel region.

omp_get_thread_num
The omp_get_thread_num function returns the thread number, withinits team, of the thread executing the function. The thread number liesbetween 0 and omp_get_num_threads()-1, inclusive. The master threadof the team is thread 0. The format is as follows:
#include <omp.h>int omp_get_thread_num(void);
If called from a serial region, omp_get_thread_num returns 0.If called from within a nested parallel region that is serialized, thisfunction returns 0.

omp_get_num_procs
The omp_get_num_procs function returns the maximum number of processorsthat could be assigned to the program. The format is as follows:
#include <omp.h>int omp_get_num_procs(void);
omp_in_parallel
The omp_in_parallel function returns non-zero if it is calledwithin the dynamic extent of a parallel region executing in parallel; otherwise,it returns 0. The format is as follows:
#include <omp.h>int omp_in_parallel(void);
This function returns non-zero from within a region executing in parallel,regardless of nested regions that are serialized.

omp_set_dynamic
The omp_set_dynamic function enables or disables dynamic adjustmentof the number of threads available for execution of parallel regions. Theformat is as follows:
#include <omp.h>void omp_set_dynamic(int dynamic_threads);
This function has effect only when called from serial portions of the program.If it is called from a portion of the program where the omp_in_parallelfunctionreturns non-zero, the behavior of the function is undefined. If dynamic_threadsevaluates to non-zero, the number of threads that are used for executingsubsequent parallel regions may be adjusted automatically by the run-timeenvironment to best utilize system resources. As a consequence, the numberof threads specified by the user is the maximum thread count. The numberof threads always remains fixed over the duration of each parallel regionand is reported by the omp_get_num_threads function.

If dynamic_threads evaluates to 0, dynamic adjustment is disabled.A call to omp_set_dynamic has precedence over the OMP_DYNAMICenvironment variable.

The default for the dynamic adjustment of threads is implementationdependent. As a result, user codes that depend on a specific number ofthreads for correct execution should explicitly disable dynamic threads.Implementations are not required to provide the ability to dynamicallyadjust the number of threads, but they are required to provide the interfacein order to support portability across all platforms.

omp_get_dynamic
The omp_get_dynamic function returns non-zero if dynamic threadadjustments enabled and returns 0 otherwise. For a description of dynamicthread adjustment, see omp_set_dynamic. The format is as follows:
#include <omp.h>int omp_get_dynamic(void);
If the implementation does not implement dynamic adjustment of the numberof threads, this function always returns 0.

omp_set_nested
The omp_set_nested function enables or disables nested parallelism.The format is as follows:
#include <omp.h>void omp_set_nested(int nested);
If nested evaluates to 0, which is the default, nested parallelism is disabled,and nested parallel regions are serialized and executed by the currentthread. If nested evaluates to non-zero, nested parallelism is enabled,and parallel regions that are nested may deploy additional threads to formthe team.

This call has precedence over the OMP_NESTED environment variable.When nested parallelism is enabled, the number of threads used to executenested parallel regions is implementation dependent. As a result, OpenMP-compliantimplementations are allowed to serialize nested parallel regions even whennested parallelism is enabled.

omp_get_nested
The omp_get_nested function returns non-zero if nested parallelismis enabled and 0 if it is disabled. The format is as follows:
#include <omp.h>int omp_get_nested(void);
If an implementation does not implement nested parallelism, this functionalways returns 0.

Lock functions

The functions described in this section manipulate locks used for synchronization.

For the following functions, the lock variable must have type omp_lock_t.This variable must only be accessed through these functions. All lock functionsrequire an argument that has a pointer to omp_lock_t type.

For the following functions, the lock variable must have type omp_nest_lock_t.This variable must only be accessed through these
functions. All nestable lock functions require an argument that hasa pointer to omp_nest_lock_t type.
omp_init_lock and omp_init_nest_lock Functions
These functions provide the only means of initializing a lock. Each functioninitializes the lock associated with the parameter lock for use in subsequentcalls. The format is as follows:
#include <omp.h>void omp_init_lock(omp_lock_t *lock);void omp_init_nest_lock(omp_nest_lock_t *lock);
The initial state is unlocked (that is, no thread owns the lock). For anestable lock, the initial nesting count is zero.

omp_destroy_lock and omp_destroy_nest_lock Functions
These functions ensure that the pointer to lock variable lock is uninitialized.The format is as follows:
#include <omp.h>void omp_destroy_lock(omp_lock_t *lock);void omp_destroy_nest_lock(omp_nest_lock_t *lock);
The argument to these functions must point to an initialized lock variablethat is unlocked.

omp_set_lock and omp_set_nest_lock Functions
Each of these functions blocks the thread executing the function untilthe specified lock is available and then sets the lock. A simple lock isavailable if it is unlocked. A nestable lock is available if it is unlockedor if it is already owned by the thread executing the function. The formatis as follows:
 
#include <omp.h>void omp_set_lock(omp_lock_t *lock);void omp_set_nest_lock(omp_nest_lock_t *lock);
For a simple lock, the argument to the omp_set_lock function mustpoint to an initialized lock variable. Ownership of the lock is grantedto the thread executing the function.

For a nestable lock, the argument to the omp_set_nest_lockfunction must point to an initialized lock variable. The nesting countis incremented, and the thread is granted, or retains, ownership of thelock.

omp_unset_lock and omp_unset_nest_lock Functions
These functions provide the means of releasing ownership of a lock. Theformat is as follows:
#include <omp.h>void omp_unset_lock(omp_lock_t *lock);void omp_unset_nest_lock(omp_nest_lock_t *lock);
The argument to each of these functions must point to an initialized lockvariable owned by the thread executing the function. The behavior is undefinedif the thread does not own that lock.

For a simple lock, the omp_unset_lock function releases thethread executing the function from ownership of the lock.

For a nestable lock, the omp_unset_nest_lock function decrementsthe nesting count, and releases the thread executing the function fromownership of the lock if the resulting count is zero.

omp_test_lock and omp_test_nest_lock Functions
These functions attempt to set a lock but do not block execution of thethread. The format is as follows:
#include <omp.h>int omp_test_lock(omp_lock_t *lock);int omp_test_nest_lock(omp_nest_lock_t *lock);
The argument must point to an initialized lock variable. These functionsattempt to set a lock in the same manner as omp_set_lock and omp_set_nest_lock,except that they do not block execution of the thread.

For a simple lock, the omp_test_lock function returns non-zeroif the lock is successfully set; otherwise, it returns zero.

For a nestable lock, the omp_test_nest_lock function returnsthe new nesting count if the lock is successfully set; otherwise, it returnszero.

Memory Classes

In order to use memory classes in C programs, you must include the headerfile /usr/include/spp_prog_model.h. Memory classes are describedin the Parallel Programming Guide for HP-UX Systems. The memoryclasses described in this section include:In C, the general form for assigning memory is:

#include <spp_prog_model.h>

. . .

[storage_class_specfier]memory_class_name type_specifiernamelist

where:

Data objects that are assigned a memory class must have a static storageduration. If the object is declared within a function, it must have thestorage class extern or static. Data objects declaredat file scope and assigned a memory class need not specify a storage class.

A hypernode is a set of processors and physical memory organized asa symmetric multiprocessor (SMP) running a single image of the operatingsystem microkernel.

node_private

This storage class specifier causes the variables and arrays specifiedin namelist to be replicated in the physical memory of each hypernodeon which the process is executing. While each data object has a singleimage in virtual memory, it maps to a different physical location on eachhypernode. The threads of a process within a hypernode all share accessto the copy on their hypernode and cannot access the copies on other hypernodes.

thread_private

This storage class specifier causes the variables and arrays to be treatedas thread_private. These data objects map to unique node_privateaddresses for each thread of a process. Refer to the Parallel ProgrammingGuide for HP-UX Systems for more information.

Synchronization Functions

HP C provides functions that can be used with pragmas to achieve synchronization.Those discussed in this section include:Gates allow you to restrict execution of a block of code to a single thread.They can be allocated, locked, unlocked or deallocated. Or, they can beused with the ordered or critical section pragmas, which automate the lockingand unlocking functions.

Barrriers block further execution until all executing threads reachthe barrier.

You declare gates and barriers by using the following type definitions:

Gates and barriers should only appear in definition and declaration statements,and as formal and actual arguments.

Allocate Functions

These functions allocate memory for a gate or barrier. When memory is firstallocated, gate variables are unlocked.
int alloc_gate(gate_t *gate_p); int alloc_barrier(barrier_t *barrier_p);
gate_p and barrier_p are pointers of the indicated type,which have been previously declared as described above.

Deallocate Functions

These functions free the memory assigned to the specified gate or barriervariable.

These functions have the following declarations:

int free_gate(gate_t *gate_p); int free_barrier(barrier_t, *barrier_p);
where gate_p and barrier_p are pointers of the indicatedtype. Always free gates and barriers when you are done using them.

Locking Functions

These functions acquire a gate for exclusive access. If the gate cannotbe immediately acquired, the calling thread waits for it. The conditionallocking functions, which are prefixed with COND_ or cond_,acquire a gate if doing so does not require a wait. If the gate is acquired,the functions return 0; if not, they return -1.

The functions have the following declarations:

int lock_gate(gate_t *gate_p); int cond_lock_gate(gate_t *gate_p);
where gate_p is a pointer of the indicated type.

Unlocking Function

This function releases a gate from exclusive access. Gates are typicallyreleased by the thread that locks them, unless a gate was locked by thread0 in serial code. In that case it might be unlocked by a single differentthread in a parallel construct.

The function has the following declaration:

int unlock_gate(gate_t *gate_p);
where gate_p is a pointer of the indicated type.

Wait Function

This function uses a barrier to cause the calling thread to wait untilthe specified number of threads call the function, at which point all threadsare released from the function simultaneously.

The function has the following declaration:

int wait_barrier(barrier_t *barrier_p, const int *nthr);
where barrier_p is a pointer of the indicated type and nthris a pointer referencing the number of threads calling the routine.

You can use a barrier variable in multiple calls to the wait()function, as long as you ensure that two barriers are not active at thesame time. Also, check that nthr reflects the correct number ofthreads.