Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Parallel Programming Guide for HP-UX Systems: K-Class and V-Class Servers > Chapter 9 Parallel programming techniques

Parallelizing regions

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

A parallel region is a single block of code that is written to run replicated on several threads. Certain scalar code within the parallel region is run by each thread in preparation for work-sharing parallel constructs such as prefer_parallel(dist), loop_parallel(dist), or begin_tasks(dist). The scalar code typically assigns data into parallel_private variables so that subsequent references to the data have a high cache hit rate. Within a parallel region, code execution can be restricted to subsets of threads by using conditional blocks that test the thread ID.

Region parallelization differs from task parallelization in that parallel tasks are separate, contiguous blocks of code. When parallelized using the tasking directives and pragmas, each block generally runs on a separate thread. This is in comparison to a single parallel region, which runs on several threads.

Specifying parallel tasks is also typically less time consuming because each thread's work is implicitly defined by the task boundaries. In region parallelization, you must manually modify the region to identify
thread-specific code. However, region parallelism can reduce parallelization overhead as discussed in the section explaining the dist attribute.

The beginning of a parallel region is denoted by the parallel directive or pragma. The end is denoted by the end_parallel directive or pragma. end_parallel also prevents execution from continuing until all copies of the parallel region have completed.

Within a parallel region, the compiler does not check for data dependences, perform variable privatization, or perform parallelization analysis. You must manually synchronize any dependences between copies of the region and manually privatize data as necessary. In the absence of a threads attribute, parallel defaults to thread parallelization.

The forms of the regional parallelization directives and pragmas are shown in Table 9-10 “Forms of region parallelization directives and pragmas”.

Table 9-10 Forms of region parallelization directives and pragmas

LanguageForm
Fortran

C$DIR PARALLEL[(attribute-list)]

C$DIR END_PARALLEL

C

#pragma _CNX parallel(attribute-list)

#pragma _CNX end_parallel

 

The optional attribute-list can contain one of the following attributes (m is an integer constant).

Table 9-11 Attributes for region parallelization

AttributeDescription
max_threads = m

Restricts execution of the specified region to no more than m threads if specified alone or with the threads attribute. m must be an integer constant.

Can include any combination of ordered, or unordered execution.

 

WARNING! Do not use the parallel region directives or pragmas unless you ensure that dependences do not exist or you insert your own synchronization code, if necessary, in the region. The compiler performs no dependence checking or synchronization on the code delimited by the parallel region directives and pragmas. Synchronization is discussed in Chapter 12 “Parallel synchronization”.

Region parallelization

The following Fortran example provides an implementation of region parallelization using the PARALLEL directive:

      REAL A(1000,8), B(1000,8), C(1000,8), RDONLY(1000), SUM(8)
INTEGER MYTID
.
.
.
C FIRST INITIALIZATION OF RDONLY IN SERIAL CODE:
CALL INIT1(RDONLY)
IF(NUM_THREADS() .LT. 8) STOP "NOT ENOUGH THREADS; EXITING"
C$DIR PARALLEL(MAX_THREADS = 8), PARALLEL_PRIVATE(I, J, K, MYTID)
MYTID = MY_THREAD() + 1 !ADD 1 FOR PROPER SUBSCRIPTING
DO I = 1, 1000
A(I, MYTID) = B(I, MYTID) * RDONLY(I)
ENDDO
IF(MYTID .EQ. 1) THEN ! ONLY THREAD 0 EXECUTES SECOND
CALL INIT2(RDONLY) ! INITIALIZATION
ENDIF
DO J = 1, 1000
B(J, MYTID) = B(J, MYTID) * RDONLY(J)
C(J, MYTID) = A(J, MYTID) * B(J, MYTID)
ENDDO
DO K = 1, 1000
SUM(MYTID) = SUM(MYTID) + A(K,MYTID) + B(K,MYTID) + C(K,MYTID)
ENDDO
C$DIR END_PARALLEL

In this example, all arrays written to in the parallel code have one dimension for each of the anticipated number of parallel threads. Each thread can work on disjoint data, there is no chance of two threads attempting to update the same element, and, therefore, there is no need for explicit synchronization. The RDONLY array is one-dimensional, but it is never written to by parallel threads. Before the parallel region, RDONLY is initialized in serial code.

The PARALLEL_PRIVATE directive is used to privatize the induction variables used in the parallel region. This must be done so that the various threads processing the region do not attempt to write to the same shared induction variables. PARALLEL_PRIVATE is covered in more detail in the section parallel_private.

At the beginning of the parallel region, the NUM_THREADS() intrinsic is called to ensure that the expected number of threads are available. Then the MY_THREAD() intrinsic, is called by each thread to determine its thread ID. All subsequent code in the region is executed based on this ID. In the I loop, each thread computes one row of A using RDONLY and the corresponding row of B.

RDONLY is reinitialized in a subroutine call that is only executed by thread 0 before it is used again in the computation of B in the J loop. In J, each thread computes a row again. The J loop similarly computes C.

Finally, the K loop sums each dimension of A, B, and C into the SUM array. No synchronization is necessary here because each thread is running the entire loop serially and assigning into a discrete element of SUM.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.