Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Fortran 90, Fortran 77, C, aC++: Exemplar Programming Guide > Appendix F Compiler Parallel Support Library

Examples

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

The examples presented here demonstrate various constructs that can be programmed using the CPSlib functions described in the previous sections.

Symmetric parallelism

There are two forms of symmetric parallelism: block parallelism, and cyclic parallelism. The CPSlib functions used in the examples that follow are described in detail in the section “CPS library functions”.

Block parallelism

Block parallelism is the most commonly used form of parallelism; it is the form generated by default by Exemplar compilers. It involves splitting up the iterations of a loop into iteration blocks of similar size, and running each block on a separate processor.

A simple Fortran example that uses CPSlib to implement block parallelism follows. The CPSlib functions used here are described in detail in the "“CPS library functions”" section.

      PROGRAM CPSBLOCK
REAL X(1000), Y(1000), Z(1000)
INTEGER PARGS(4), CPS_PPCALLN, NTHR, CPS_NODE_CPUS
C$DIR SYNC_ROUTINE(CPS_PPCALLN, CPS_NODE_CPUS)
EXTERNAL PARBLK ! REQUIRED BECAUSE PARBLK IS AN ARGUMENT
C INITIALIZE PARGS ARRAY:
PARGS(1) = -2 ! ALLOCATE THREADS ON CALLING THREAD'S NODE
PARGS(2) = 2 ! MINIMUM OF 2 THREADS
PARGS(3) = CPS_NODE_CPUS() ! MAXIMUM # OF THREADS
PARGS(4) = 1 ! ALLOCATE MULTIPLE THREADS PER HYPERNODE
C SPAWN THREADS:
ITHREAD = CPS_PPCALLN(PARGS, PARBLK, 4, X, Y, Z, NTHR)
C IF SPAWN FAILS, REPORT:
IF (ITHREAD .LT. 0) PRINT *,'PPCALLN FAILED'
.
. ! SERIAL CODE
.
END

SUBROUTINE PARBLK (X, Y, Z, NTHR)
REAL X(1000), Y(1000), Z(1000)
INTEGER CPS_NSTHREADS, CPS_STID, STID, NTHR
C$DIR SYNC_ROUTINE(CPS_NSTHREADS, CPS_STID)
STID = CPS_STID() ! GET MY STID
NTHR = CPS_NSTHREADS() ! GET NUMBER OF THREADS SPAWNED
ITPERPROC = 1000/NTHR ! COMPUTE ITERATIONS PER THREAD
IEXCESS = 1000-ITPERPROC*NTHR ! COMPUTE EXCESS ITERATIONS
C COMPUTE LOOP START AND END FOR CASES OF NO EXCESS OR FOR THREADS
C THAT DO NOT HANDLE EXCESS:
IF(STID .GE. IEXCESS) THEN
MYSTART = STID*ITPERROC + IEXCESS + 1
MYEND = MYSTART + ITPERPROC - 1
C COMPUTE LOOP START AND END FOR THREADS THAT HANDLE EXCESS:
ELSE
MYSTART = STID * (ITPERPROC+1) + 1
MYEND = MYSTART + (ITPERPROC+1) - 1
ENDIF
C ACTUAL COMPUTATION:
DO J = MYSTART, MYEND
Z(J) = X(J) + Y(J)
ENDDO
RETURN
END

This example calls CPS_NODE_CPUS to find the number of available threads, then calls CPS_PPCALLN to spawn parallel threads to run the subroutine PARBLK.

PARBLK then determines the number of iterations necessary per processor; if the number of processors does not integrally divide the number of iterations, it automatically adjusts some blocks to handle the extra iterations. Finally, the loop in PARBLK performs its body in parallel, with each thread operating on the appropriate iteration range.

Note the error trap immediately after the call to CPS_PPCALLN; this is important, as it provides the only means of knowing if the spawn failed.

Cyclic parallelism

Cyclic parallelism distributes consecutive iterations of a loop to separate processors. It is similar to the parallelism achieved through use of the loop_parallel(ordered) directive and pragma, but it does not order the iterations automatically; you must handle any necessary ordering manually. loop_parallel(ordered) is discussed in the section loop_parallel(ordered) in Chapter 6, "Chapter 6 “Advanced shared-memory programming”."

A simple Fortran example that uses CPSlib to implement cyclic parallelism follows. The CPSlib functions used here are described in detail in the "“CPS library functions”" section.

      PROGRAM CPSCYCLE
REAL X(1000), Y(1000), Z(1000), SUM
INTEGER PARGS(4), CPS_PPCALLN, NTHR, CPS_NODE_CPUS
C$DIR SYNC_ROUTINE(CPS_PPCALLN, CPS_NODE_CPUS)
EXTERNAL PARCYC ! PARCYC IS AN ARGUMENT
READ *, NPROCSC
C INITIALIZE PARGS ARRAY:
PARGS(1) = -2 ! ALLOCATE THREADS ON CALLING THREAD'S NODE
PARGS(2) = 2 ! MINIMUM OF 2 THREADS
PARGS(3) = CPS_NODE_CPUS() ! MAXIMUM # OF THREADS
PARGS(4) = 1 ! ALLOCATE MULTIPLE THREADS PER HYPERNODE
C SPAWN THREADS:
ITHREAD = CPS_PPCALLN (PARGS, PARCYC, 4, X, Y, Z, NTHR)
C IF SPAWN FAILS, REPORT:
IF (ITHREAD .LT. 0) PRINT *,'PPCALLN FAILED'
.
. ! SERIAL CODE
.
END

SUBROUTINE PARCYC (X, Y, Z, NTHR)
REAL X(1000), Y(1000), Z(1000)
INTEGER CPS_NSTHREADS, CPS_STID, STID, NTHR
C$DIR SYNC_ROUTINE(CPS_NSTHREADS, CPS_STID)
STID = CPS_STID() ! GET MY STID
NTHR = CPS_NSTHREADS() ! GET NUMBER OF THREADS SPAWNED
C ACTUAL COMPUTATION:
DO J = 1+STID, 1000, NTHR ! STEP BY NUMBER OF THREADS
Z(J) = X(J) + Y(J)
ENDDO
RETURN
END

This example works exactly like the block parallelism example, except the loop in PARCYC, its parallel subroutine, is cyclically parallel. Cyclic parallelism is accomplished here by offsetting the loop start value by spawn thread ID and stepping the loop by the number of parallel threads. This ensures that each thread computes a unique array element on every step of the loop; NTHR elements are computed per step. Contiguous STIDs compute contiguous elements.

Asymmetric parallelism

A simple Fortran program that implements asymmetric parallelism follows. The CPSlib functions used here are described in detail in the "“CPS library functions”" section.

      PROGRAM ASYM
REAL X1(1000), X2(1000), Y1(1000), Y2(1000), Z(1000)
INTEGER CPS_THREAD_CREATE, CPS_THREAD_WAIT
C$DIR SYNC_ROUTINE(CPS_THREAD_CREATE, CPS_THREAD_WAIT)
COMMON /POINTS/ X1, X2, Y1, Y2
EXTERNAL DISTANCE ! DISTANCE IS AN ARGUMENT
.
. ! SERIAL CODE
. ! EXAMPLE CONTINUED
C SPAWN ASYMMETRIC THREAD TO EXECUTE SUBROUTINE DISTANCE:
ITHREAD = CPS_THREAD_CREATE(-2, DISTANCE, Z)
IF (ITHREAD .LT. 0) PRINT*, "THREAD_CREATE FAILED IN MAIN"
.
. ! THIS CODE RUNS IN PARALLEL WITH DISTANCE
.
IWAIT = CPS_THREAD_WAIT(1) ! WAIT FOR ALL ASYMMETRIC
! THREADS TO TERMINATE
IF(IWAIT .LT. 0) PRINT*, "CPS_THREAD_WAIT FAILED"
.
. ! THIS CODE RUNS SERIALLY AFTER PARALLEL THREADS
. ! TERMINATE
END
      SUBROUTINE DISTANCE(Z)
REAL X1(1000), X2(1000), Y1(1000), Y2(1000), Z(1000)
REAL X3(1000), Y3(1000)
INTEGER CPS_THREAD_CREATE, CPS_THREAD_WAIT
C$DIR SYNC_ROUTINE(CPS_THREAD_CREATE, CPS_THREAD_WAIT)
COMMON /POINTS/ X1, X2, Y1, Y2
EXTERNAL FINDX ! FINDX IS AN ARGUMENT
C SPAWN ASYMMETRIC THREAD TO EXECUTE SUBROUTINE FINDX:
JTHREAD = CPS_THREAD_CREATE(-2, FINDX, X3)
IF (JTHREAD .LT. 0) PRINT*, "THREAD_CREATE FAILED IN DISTANCE"
DO I = 1, 1000 ! COMPUTE Y3 IN PARALLEL WITH FINDX
Y3(I) = (Y2(I) - Y1(I))**2
ENDDO
10 IWAIT = CPS_THREAD_WAIT(0) ! FIND NUMBER OF ASYM THREADS
IF(IWAIT .LT. 0) PRINT*, "CPS_THREAD_WAIT FAILED"
IF(IWAIT .GT. 1) GOTO 10 ! SPIN UNTIL ONLY THIS THREAD
! IS ACTIVE
DO I = 1, 1000 ! COMPUTE Z SERIALLY AFTER X3 AND Y3
Z(I) = SQRT(X3(I) + Y3(I))
ENDDO
RETURN
END

SUBROUTINE FINDX(X3) ! RUNS IN PARALLEL WITH COMPUTATION
! OF Y3
REAL X1(1000), X2(1000), Y1(1000), Y2(1000), X3(1000)
COMMON /POINTS/ X1, X2, Y1, Y2
DO I = 1, 1000
X3(I) = (X2(I) - X1(I))**2
ENDDO
RETURN
END

In this example, the arrays X3 and Y3 must be computed before the array Z can be computed. The main program spawns an asymmetric parallel thread to run DISTANCE, which spawns an asymmetric thread to run FINDX. DISTANCE then computes Y3 while FINDX computes X3; all the while, the main program can be doing other work in parallel with both subroutines. When DISTANCE is done with Y3, it waits until FINDX is done with X3, then computes Z. The main program waits until DISTANCE is done, then proceeds with more work.

Note the way in which CPS_THREAD_WAIT is used when called from DISTANCE; this is explained further in the "“CPS library functions”" section.

Synchronization using high-level functions

This section demonstrates how to use barriers and mutexes to synchronize symmetrically parallel code.

Barriers

Remember that, when you use cps_ppcall() to spawn symmetric parallelism, before the function returns, a join operation takes place after all spawned threads terminate. This join is an implicit barrier, since thread 0 cannot proceed until all parallel threads terminate. In many cases, this is the only barrier synchronization you will require.

However, the cps_barrier() high-level synchronization functions allow you to explicitly create barriers if necessary.

The following Fortran example is similar to the symmetric parallelism example in the section “Block parallelism” except that instead of relying on the implicit barrier contained in the call to CPS_PPCALL(), it contains an explicit CPS_BARRIER() in the subroutine SUMMER.

      PROGRAM BAR
REAL A(1000)
REAL SUM(:), TOTSUM
INTEGER PARGS(4), SUMBAR, CPS_NODE_CPUS, CPS_PPCALLN
INTEGER CPS_BARRIER_ALLOC, CPS_BARRIER_FREE
C$DIR SYNC_ROUTINE(CPS_BARRIER_ALLOC,CPS_BARRIER_FREE)
C$DIR SYNC_ROUTINE(CPS_NODE_CPUS,CPS_PPCALLN)
EXTERNAL SUMMER
ALLOCATABLE SUM
NCPUS = CPS_NODE_CPUS()
ALLOCATE(SUM(0:NCPUS-1)) ! ONE ELEMENT FOR EACH CPU
PARGS(1) = -2 ! ALLOCATE THREADS ON CALLING THREAD'S NODE
PARGS(2) = 2 ! MINIMUM OF 2 THREADS PER NODE
PARGS(3) = NCPUS ! MAXIMUM # OF THREADS PER NODE
PARGS(4) = 1 ! ALLOCATE MULTIPLE THREADS PER NODE
DO I = 0, NCPUS-1 ! INITIALIZE SUM
SUM(I) = 0.0
ENDDO
.
. ! SERIAL CODE
.
IERR = CPS_BARRIER_ALLOC(SUMBAR) ! ALLOCATE BARRIER
IF (IERR .LT. 0) PRINT*, "BARRIER ALLOCATION FAILED"
C SPAWN PARALLEL THREADS:
IERR = CPS_PPCALLN(PARGS,SUMMER,5,A,SUM,TOTSUM,SUMBAR,NCPUS)
IF (IERR .LT. 0) PRINT*, "PPCALL FAILED"
IERR = CPS_BARRIER_FREE(SUMBAR) ! FREE BARRIER
IF (IERR .LT. 0) PRINT*, "BARRIER FREE FAILED"
.
. ! SERIAL CODE
.
END



SUBROUTINE SUMMER(A,SUM,TOTSUM,SUMBAR,NCPUS)
INTEGER STID, NTHR, SUMBAR
INTEGER CPS_STID, CPS_NSTHREADS, CPS_BARRIER
C$DIR SYNC_ROUTINE(CPS_STID, CPS_NSTHREADS, CPS_BARRIER)
REAL A(1000), SUM(0:NCPUS-1), TOTSUM
STID = CPS_STID() ! GET MY STID
NTHR = CPS_NSTHREADS() ! GET NUMBER OF THREADS SPAWNED
ITPERPROC = 1000/NTHR ! COMPUTE ITERATIONS PER THREAD
IEXCESS = 1000-ITPERPROC*NTHR ! COMPUTE EXCESS ITERATIONS
C COMPUTE LOOP START AND END FOR CASES OF NO EXCESS OR FOR THREADS
C THAT DO NOT HANDLE EXCESS:
IF(STID .GE. IEXCESS) THEN
MYSTART = STID*ITPERROC + IEXCESS + 1
MYEND = MYSTART + ITPERPROC - 1
C COMPUTE LOOP START AND END FOR THREADS THAT HANDLE EXCESS:
ELSE
MYSTART = STID * (ITPERPROC+1) + 1
MYEND = MYSTART + (ITPERPROC+1) - 1
ENDIF
C ACTUAL COMPUTATION:
DO J = MYSTART, MYEND ! EACH THREAD COMPUTES LOCAL SUM
SUM(STID) = SUM(STID) + A(J)
ENDDO
C WAIT UNTIL ALL THREADS ARE DONE COMPUTING THEIR PORTION OF SUM:
IERR = CPS_BARRIER(SUMBAR, NTHR)
IF (IERR .LT. 0) PRINT*, "BARRIER FAILED"
IF(STID .EQ. 0) THEN ! THREAD 0 COMPUTES TOTAL SUM
DO I = 0, NTHR-1
TOTSUM = TOTSUM + SUM(I)
ENDDO
ENDIF
RETURN
END

Here, the subroutine SUMMER is called in parallel to compute the sum of the elements of array A. Each parallel thread computes its own sum in an element of the array SUM. The CPS_BARRIER function is used to prevent execution of any further code until all threads have finished computing their portion of SUM. When CPS_BARRIER returns, thread 0 computes TOTSUM, and SUMMER returns.

Mutexes

CPSlib mutexes allow you to limit access to the sections of code they delimit to one thread at a time, allowing you to construct critical sections similar to those discussed in Chapter 4, "Chapter 4 “Basic shared-memory
programming”
."

In the following Fortran example, the routine SUMMER performs the same task it did in the preceding barrier example. However, here access to the TOTSUM computation takes place in fully parallel code; it is limited to one thread at a time by the mutex SUMMUTEX. This eliminates the need for each thread to compute independent SUM arrays as in the preceding barrier example.

      PROGRAM MUT
REAL A(1000)
REAL TOTSUM
INTEGER PARGS(4), SUMMUTEX
INTEGER CPS_NODE_CPUS,CPS_PPCALLN
INTEGER CPS_MUTEX_ALLOC,CPS_MUTEX_FREE
C$DIR SYNC_ROUTINE(CPS_NODE_CPUS,CPS_PPCALLN)
C$DIR SYNC_ROUTINE(CPS_MUTEX_ALLOC,CPS_MUTEX_FREE)
EXTERNAL SUMMER
NCPUS = CPS_NODE_CPUS()
PARGS(1) = -2 ! ALLOCATE THREADS ON CALLING THREAD'S NODE
PARGS(2) = 2 ! MINIMUM OF 2 THREADS PER NODE
PARGS(3) = NCPUS ! MAXIMUM # OF THREADS PER NODE
PARGS(4) = 1 ! ALLOCATE MULTIPLE THREADS PER NODE
TOTSUM = 0.0 ! INITIALIZE TOTSUM
.
. ! SERIAL CODE
.
IERR = CPS_MUTEX_ALLOC(SUMMUTEX) ! ALLOCATE MUTEX
IF (IERR .LT. 0) PRINT*, "MUTEX ALLOCATION FAILED"
C SPAWN PARALLEL THREADS:
IERR = CPS_PPCALLN(PARGS,SUMMER,3,A,TOTSUM,SUMMUTEX)
IF (IERR .LT. 0) PRINT*, "PPCALL FAILED"
IERR = CPS_MUTEX_FREE(SUMMUTEX) ! FREE MUTEX
IF (IERR .LT. 0) PRINT*, "MUTEX FREE FAILED"
.
. ! SERIAL CODE
.
END
      SUBROUTINE SUMMER(A,TOTSUM,SUMMUTEX)
INTEGER STID, NTHR, SUMMUTEX
INTEGER CPS_STID, CPS_NSTHREADS
INTEGER CPS_MUTEX_LOCK, CPS_MUTEX_UNLOCK
C$DIR SYNC_ROUTINE(CPS_STID, CPS_NSTHREADS)
C$DIR SYNC_ROUTINE(CPS_MUTEX_LOCK, CPS_MUTEX_UNLOCK)
REAL A(1000),TOTSUM
STID = CPS_STID() ! GET MY STID
NTHR = CPS_NSTHREADS() ! GET NUMBER OF THREADS SPAWNED
ITPERPROC = 1000/NTHR ! COMPUTE ITERATIONS PER THREAD
IEXCESS = 1000-ITPERPROC*NTHR ! COMPUTE EXCESS ITERATIONS
C COMPUTE LOOP START AND END FOR CASES OF NO EXCESS OR FOR THREADS
C THAT DO NOT HANDLE EXCESS:
IF(STID .GE. IEXCESS) THEN
MYSTART = STID*ITPERROC + IEXCESS + 1
MYEND = MYSTART + ITPERPROC - 1
C COMPUTE LOOP START AND END FOR THREADS THAT HANDLE EXCESS:
ELSE
MYSTART = STID * (ITPERPROC+1) + 1
MYEND = MYSTART + (ITPERPROC+1) - 1
ENDIF
C ACTUAL COMPUTATION:
DO J = MYSTART, MYEND
C MUTEX LIMITS ACCESS TO TOTSUM TO ONE THREAD AT A TIME:
IERR = CPS_MUTEX_LOCK(SUMMUTEX)
IF (IERR .LT. 0) PRINT*, "MUTEX LOCK FAILED"
TOTSUM = TOTSUM + A(J)
IERR = CPS_MUTEX_UNLOCK(SUMMUTEX)
IF(IERR .LT. 0) PRINT*, "MUTEX UNLOCK FAILED"
ENDDO
RETURN
END

Here, as in the barrier example, SUMMER is called in parallel. Each parallel thread then waits until it can lock SUMMUTEX before updating TOTSUM.

Synchronization using low-level functions

This section demonstrates how to use semaphores to synchronize symmetrically parallel code.

Critical sections

Critical sections like the one in the preceding mutex example can be implemented in a similar fashion using cache-based or memory-based semaphores.

The following Fortran example is identical to the mutex example, but implements the critical section using a memory-based semaphore instead of a mutex:

      PROGRAM SEM
REAL A(1000)
REAL TOTSUM
INTEGER PARGS(4), SUMSEM, SEMCNT
INTEGER CPS_NODE_CPUS, CPS_PPCALLN, M_INIT32, M_FREE32
C$DIR SYNC_ROUTINE(CPS_NODE_CPUS, CPS_PPCALLN, M_INIT32, M_FREE32)
EXTERNAL SUMMER
NCPUS = CPS_NODE_CPUS()
PARGS(1) = -2 ! ALLOCATE THREADS ON CALLING THREAD'S NODE
PARGS(2) = 2 ! MINIMUM OF 2 THREADS PER NODE
PARGS(3) = NCPUS ! MAXIMUM # OF THREADS PER NODE
PARGS(4) = 1 ! ALLOCATE MULTIPLE THREADS PER NODE
TOTSUM = 0.0 ! INITIALIZE TOTSUM
SEMCNT = 0 ! COUNTER FOR SEMAPHORE; VALUE IS IRRELEVANT
.
. ! SERIAL CODE
.
IERR = M_INIT32(SUMSEM,SEMCNT) ! ALLOCATE SUMSEM
IF (IERR .LT. 0) PRINT*, "SEMAPHORE ALLOCATION FAILED"
IERR = CPS_PPCALLN(PARGS,SUMMER,3,A,TOTSUM,SUMSEM)
IF (IERR .LT. 0) PRINT*, "PPCALL FAILED"
IERR = M_FREE32(SUMSEM)


IF (IERR .LT. 0) PRINT*, "SEMAPHORE FREE FAILED"
.
. ! SERIAL CODE
.
END
      SUBROUTINE SUMMER(A,TOTSUM,SUMSEM)
INTEGER STID, NTHR, SUMSEM
INTEGER CPS_STID, CPS_NSTHREADS, M_LOCK, M_UNLOCK
REAL A(1000),TOTSUM


C$DIR SYNC_ROUTINE(CPS_STID, CPS_NSTHREADS, M_LOCK, M_UNLOCK)
STID = CPS_STID() ! GET MY STID
NTHR = CPS_NSTHREADS() ! GET NUMBER OF THREADS SPAWNED
ITPERPROC = 1000/NTHR ! COMPUTE ITERATIONS PER THREAD
IEXCESS = 1000-ITPERPROC*NTHR ! COMPUTE EXCESS ITERATIONS
C COMPUTE LOOP START AND END FOR CASES OF NO EXCESS OR FOR THREADS
C THAT DO NOT HANDLE EXCESS:
IF(STID .GE. IEXCESS) THEN
MYSTART = STID*ITPERROC + IEXCESS + 1
MYEND = MYSTART + ITPERPROC - 1
C COMPUTE LOOP START AND END FOR THREADS THAT HANDLE EXCESS:
ELSE
MYSTART = STID * (ITPERPROC+1) + 1
MYEND = MYSTART + (ITPERPROC+1) - 1
ENDIF
C ACTUAL COMPUTATION:
DO J = MYSTART, MYEND
C SEMAPHORE LIMITS ACCESS TO TOTSUM TO ONE THREAD AT A TIME:
IERR = M_LOCK(SUMSEM)
IF (IERR .LT. 0) PRINT*, "SEMAPHORE LOCK FAILED"
TOTSUM = TOTSUM + A(J)
IERR = M_UNLOCK(SUMSEM)
IF(IERR .LT. 0) PRINT*, "SEMAPHORE UNLOCK FAILED"
ENDDO
RETURN
END

Here as in the mutex example, SUMMER is called in parallel. Each parallel thread then waits until it can lock the memory-based semaphore SUMSEM before updating TOTSUM.

Ordered sections

Semaphores can also be used to construct ordered sections such as those constructed using the loop_parallel(ordered), begin_ordered_section and end_ordered_section directives and pragmas, which are described in Chapter 6, "Chapter 6 “Advanced shared-memory programming”."

The parallel loop in the following Fortran example contains a backward LCD, which is isolated using low-level synchronization functions so that the threads must execute the LCD in iteration order.

      PROGRAM ORDERED ! DEMONSTRATES ORDERED SECTIONS USING CPS 
! LOW LEVEL SYNCHRONIZATION
REAL X(1000), Y(1000)
INTEGER PARGS(4), CPS_PPCALLN, NTHR, CPS_NODE_CPUS
INTEGER M_INIT32, M_FREE32
C$DIR SYNC_ROUTINE(CPS_PPCALLN, CPS_NODE_CPUS, M_INIT32, M_FREE32)
INTEGER ORDSEM, SEMCNT
EXTERNAL ORDWORK
PARGS(1) = -2 ! ALLOCATE THREADS CALLING THREAD"S NODE
PARGS(2) = 2 ! MINIMUM OF 2 THREADS
PARGS(3) = CPS_NODE_CPUS() ! MAXIMUM OF NPROCS THREADS
PARGS(4) = 1 ! ALLOCATE MULTIPLE THREADS PER HYPERNODE
SEMCNT = 0
.
. ! SERIAL CODE
.
IERR = M_INIT32(ORDSEM,SEMCNT) ! ALLOCATE ORDSEM
IF (IERR .LT. 0) PRINT*, "SEMAPHORE ALLOCATION FAILED"
C SPAWN THREADS:
ITHREAD = CPS_PPCALLN(PARGS,ORDWORK,5,X,Y,NTHR,ORDSEM,SEMCNT)
IF (ITHREAD .LT. 0) PRINT *,"PPCALLN FAILED"
IERR = M_FREE32(ORDSEM)
IF (IERR .LT. 0) PRINT*, "SEMAPHORE FREE FAILED"
.
. ! SERIAL CODE
.
END
      SUBROUTINE ORDWORK (X, Y, NTHR,ORDSEM,SEMCNT)
REAL X(1000), Y(1000)
INTEGER CPS_NSTHREADS, CPS_STID, M_FETCH32
INTEGER ORDSEM,SEMCNT,STID, NTHR, CNTVAL
INTEGER M_FETCH_AND_INC32, M_FETCH_AND_CLEAR32
C$DIR SYNC_ROUTINE(CPS_NSTHREADS, CPS_STID, M_FETCH32)
C$DIR SYNC_ROUTINE(M_FETCH_AND_INC32, M_FETCH_AND_CLEAR32)
STID = CPS_STID() ! GET MY STID
NTHR = CPS_NSTHREADS() ! GET NUMBER OF THREADS SPAWNEDC ACTUAL COMPUTATION:
DO J = 2+STID, 1000, NTHR ! CYCLIC DECOMPOSITION
.
. ! DEPENDENCE-FREE PARALLEL CODE
.
10 CNTVAL = M_FETCH32(ORDSEM) ! GET SEMAPHORE COUNTER VALUE
IF(CNTVAL .EQ. STID) THEN ! IF IT"S MY STID"S TURN
C PERFORM LCD COMPUTATION:
X(J) = X(J-1) + Y(J)
IF(CNTVAL .GE. NTHR-1) THEN ! HIGHEST STID
IERR = M_FETCH_AND_CLEAR32(ORDSEM) ! RESETS COUNTER
IF(IERR .LT. 0) PRINT*, "FETCH-CLEAR FAILED"
ELSE ! ALL OTHER STIDS INCREMENT COUNTER:
IERR = M_FETCH_AND_INC32(ORDSEM)
IF(IERR .LT. 0) PRINT*, "FETCH-INC FAILED"
ENDIF
ELSE
GOTO 10 ! LOOP AND TRY AGAIN IF CNTVAL .NE. STID
ENDIF
ENDDO
RETURN
END

This example uses a cyclic decomposition in the parallel J loop because, by definition, ordered sections must be executed in iteration order, and this is impossible using a block decomposition.

As in the example in the "“Cyclic parallelism”" section, here the starting index is offset according to spawn thread ID and the loop steps by the number of parallel threads. This ensures that each thread computes a unique array element on every step of the loop; NTHR elements are computed per step. Contiguous STIDs compute contiguous elements.

The loop is ordered by the first IF statement in the loop, which only allows the body of the loop (including the LCD) to execute if the counter associated with the semaphore ORDSEM is equal to the current STID. This counter is incremented (or reset when the highest STID is reached) in the body of the loop, forcing the threads to execute in iteration order. The counter associated with ORDSEM controls access to the LCD; no explicit semaphore lock is needed.

Substantial nonordered work must be present in this loop to make the overhead of the ordered section worthwhile. Assuming this condition is met, once all the threads pass through the ordered section once, their execution of the nonordered code will be staggered such that they will stay busy while they are outside the ordered code.

The ordered parallelism described here is similar to that achieved through use of compiler directives in the "“Ordered sections”" section of Chapter 6, "Chapter 6 “Advanced shared-memory programming”."

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.