 |
» |
|
|
 |
The examples presented here demonstrate various constructs
that can be programmed using the CPSlib functions described in the
previous sections. Symmetric parallelism |  |
There are two forms of
symmetric parallelism: block parallelism, and cyclic parallelism.
The CPSlib functions used in the examples that follow are described
in detail in the section “CPS library functions”. Block parallelism
is
the most commonly used form of parallelism;
it is the form generated by default by Exemplar compilers. It involves
splitting up the iterations of a loop into iteration blocks of similar
size, and running each block on a separate processor. A simple Fortran example that uses CPSlib to implement block
parallelism follows. The CPSlib functions used here are described
in detail in the "“CPS library functions”"
section.  |
PROGRAM CPSBLOCK REAL X(1000), Y(1000), Z(1000) INTEGER PARGS(4), CPS_PPCALLN, NTHR, CPS_NODE_CPUS C$DIR SYNC_ROUTINE(CPS_PPCALLN, CPS_NODE_CPUS) EXTERNAL PARBLK ! REQUIRED BECAUSE PARBLK IS AN ARGUMENT C INITIALIZE PARGS ARRAY: PARGS(1) = -2 ! ALLOCATE THREADS ON CALLING THREAD'S NODE PARGS(2) = 2 ! MINIMUM OF 2 THREADS PARGS(3) = CPS_NODE_CPUS() ! MAXIMUM # OF THREADS PARGS(4) = 1 ! ALLOCATE MULTIPLE THREADS PER HYPERNODE C SPAWN THREADS: ITHREAD = CPS_PPCALLN(PARGS, PARBLK, 4, X, Y, Z, NTHR) C IF SPAWN FAILS, REPORT: IF (ITHREAD .LT. 0) PRINT *,'PPCALLN FAILED' . . ! SERIAL CODE . END SUBROUTINE PARBLK (X, Y, Z, NTHR) REAL X(1000), Y(1000), Z(1000) INTEGER CPS_NSTHREADS, CPS_STID, STID, NTHR C$DIR SYNC_ROUTINE(CPS_NSTHREADS, CPS_STID) STID = CPS_STID() ! GET MY STID NTHR = CPS_NSTHREADS() ! GET NUMBER OF THREADS SPAWNED ITPERPROC = 1000/NTHR ! COMPUTE ITERATIONS PER THREAD IEXCESS = 1000-ITPERPROC*NTHR ! COMPUTE EXCESS ITERATIONS C COMPUTE LOOP START AND END FOR CASES OF NO EXCESS OR FOR THREADS C THAT DO NOT HANDLE EXCESS: IF(STID .GE. IEXCESS) THEN MYSTART = STID*ITPERROC + IEXCESS + 1 MYEND = MYSTART + ITPERPROC - 1 C COMPUTE LOOP START AND END FOR THREADS THAT HANDLE EXCESS: ELSE MYSTART = STID * (ITPERPROC+1) + 1 MYEND = MYSTART + (ITPERPROC+1) - 1 ENDIF C ACTUAL COMPUTATION: DO J = MYSTART, MYEND Z(J) = X(J) + Y(J) ENDDO RETURN END |
 |
This example calls CPS_NODE_CPUS
to find the number of available threads, then calls CPS_PPCALLN
to spawn parallel threads to run the subroutine PARBLK. PARBLK then determines the
number of iterations necessary per processor; if the number of processors
does not integrally divide the number of iterations, it automatically
adjusts some blocks to handle the extra iterations. Finally, the
loop in PARBLK performs its body
in parallel, with each thread operating on the appropriate iteration
range. Note the error trap immediately after the call to CPS_PPCALLN;
this is important, as it provides the only means of knowing if the
spawn failed. Cyclic parallelism
distributes consecutive iterations of a loop to
separate processors. It is similar to the parallelism achieved through
use of the loop_parallel(ordered)
directive and pragma, but it does not order the iterations automatically;
you must handle any necessary ordering manually. loop_parallel(ordered)
is discussed in the section “loop_parallel(ordered)” in Chapter 6, "Chapter 6 “Advanced shared-memory programming”." A simple Fortran example that uses CPSlib to implement cyclic parallelism
follows. The CPSlib functions used here are described in detail
in the "“CPS library functions”"
section.  |
PROGRAM CPSCYCLE REAL X(1000), Y(1000), Z(1000), SUM INTEGER PARGS(4), CPS_PPCALLN, NTHR, CPS_NODE_CPUS C$DIR SYNC_ROUTINE(CPS_PPCALLN, CPS_NODE_CPUS) EXTERNAL PARCYC ! PARCYC IS AN ARGUMENT READ *, NPROCSC C INITIALIZE PARGS ARRAY: PARGS(1) = -2 ! ALLOCATE THREADS ON CALLING THREAD'S NODE PARGS(2) = 2 ! MINIMUM OF 2 THREADS PARGS(3) = CPS_NODE_CPUS() ! MAXIMUM # OF THREADS PARGS(4) = 1 ! ALLOCATE MULTIPLE THREADS PER HYPERNODE C SPAWN THREADS: ITHREAD = CPS_PPCALLN (PARGS, PARCYC, 4, X, Y, Z, NTHR) C IF SPAWN FAILS, REPORT: IF (ITHREAD .LT. 0) PRINT *,'PPCALLN FAILED' . . ! SERIAL CODE . END SUBROUTINE PARCYC (X, Y, Z, NTHR) REAL X(1000), Y(1000), Z(1000) INTEGER CPS_NSTHREADS, CPS_STID, STID, NTHR C$DIR SYNC_ROUTINE(CPS_NSTHREADS, CPS_STID) STID = CPS_STID() ! GET MY STID NTHR = CPS_NSTHREADS() ! GET NUMBER OF THREADS SPAWNED C ACTUAL COMPUTATION: DO J = 1+STID, 1000, NTHR ! STEP BY NUMBER OF THREADS Z(J) = X(J) + Y(J) ENDDO RETURN END |
 |
This example works exactly like the block parallelism example,
except the loop in PARCYC, its
parallel subroutine, is cyclically parallel. Cyclic parallelism
is accomplished here by offsetting the loop start value by spawn
thread ID and stepping the loop by the number of parallel threads.
This ensures that each thread computes a unique array element on
every step of the loop; NTHR elements
are computed per step. Contiguous STIDs
compute contiguous elements. Asymmetric parallelism |  |
A simple Fortran program that
implements asymmetric
parallelism follows. The CPSlib functions used here are described
in detail in the "“CPS library functions”"
section. PROGRAM ASYM REAL X1(1000), X2(1000), Y1(1000), Y2(1000), Z(1000) INTEGER CPS_THREAD_CREATE, CPS_THREAD_WAIT C$DIR SYNC_ROUTINE(CPS_THREAD_CREATE, CPS_THREAD_WAIT) COMMON /POINTS/ X1, X2, Y1, Y2 EXTERNAL DISTANCE ! DISTANCE IS AN ARGUMENT . . ! SERIAL CODE . ! EXAMPLE CONTINUED C SPAWN ASYMMETRIC THREAD TO EXECUTE SUBROUTINE DISTANCE: ITHREAD = CPS_THREAD_CREATE(-2, DISTANCE, Z) IF (ITHREAD .LT. 0) PRINT*, "THREAD_CREATE FAILED IN MAIN" . . ! THIS CODE RUNS IN PARALLEL WITH DISTANCE . IWAIT = CPS_THREAD_WAIT(1) ! WAIT FOR ALL ASYMMETRIC ! THREADS TO TERMINATE IF(IWAIT .LT. 0) PRINT*, "CPS_THREAD_WAIT FAILED" . . ! THIS CODE RUNS SERIALLY AFTER PARALLEL THREADS . ! TERMINATE END |
 |
 |
SUBROUTINE DISTANCE(Z) REAL X1(1000), X2(1000), Y1(1000), Y2(1000), Z(1000) REAL X3(1000), Y3(1000) INTEGER CPS_THREAD_CREATE, CPS_THREAD_WAIT C$DIR SYNC_ROUTINE(CPS_THREAD_CREATE, CPS_THREAD_WAIT) COMMON /POINTS/ X1, X2, Y1, Y2 EXTERNAL FINDX ! FINDX IS AN ARGUMENT C SPAWN ASYMMETRIC THREAD TO EXECUTE SUBROUTINE FINDX: JTHREAD = CPS_THREAD_CREATE(-2, FINDX, X3) IF (JTHREAD .LT. 0) PRINT*, "THREAD_CREATE FAILED IN DISTANCE" DO I = 1, 1000 ! COMPUTE Y3 IN PARALLEL WITH FINDX Y3(I) = (Y2(I) - Y1(I))**2 ENDDO 10 IWAIT = CPS_THREAD_WAIT(0) ! FIND NUMBER OF ASYM THREADS IF(IWAIT .LT. 0) PRINT*, "CPS_THREAD_WAIT FAILED" IF(IWAIT .GT. 1) GOTO 10 ! SPIN UNTIL ONLY THIS THREAD ! IS ACTIVE DO I = 1, 1000 ! COMPUTE Z SERIALLY AFTER X3 AND Y3 Z(I) = SQRT(X3(I) + Y3(I)) ENDDO RETURN END SUBROUTINE FINDX(X3) ! RUNS IN PARALLEL WITH COMPUTATION ! OF Y3 REAL X1(1000), X2(1000), Y1(1000), Y2(1000), X3(1000) COMMON /POINTS/ X1, X2, Y1, Y2 DO I = 1, 1000 X3(I) = (X2(I) - X1(I))**2 ENDDO RETURN END |
 |
In this example, the arrays X3
and Y3 must be computed before
the array Z can be computed. The
main program spawns an asymmetric parallel thread to run DISTANCE,
which spawns an asymmetric thread to run FINDX.
DISTANCE then computes Y3
while FINDX computes X3;
all the while, the main program can be doing other work in parallel
with both subroutines. When DISTANCE
is done with Y3, it waits until
FINDX is done with X3,
then computes Z. The main program
waits until DISTANCE is done, then
proceeds with more work. Note the way in which CPS_THREAD_WAIT
is used when called from DISTANCE;
this is explained further in the "“CPS library functions”" section. Synchronization using high-level functions |  |
This section demonstrates how to use barriers and
mutexes to synchronize symmetrically parallel code. Remember
that, when you use cps_ppcall()
to spawn symmetric parallelism, before the function returns, a join
operation takes place after all spawned threads terminate. This
join is an
implicit barrier, since thread 0 cannot proceed until all parallel
threads terminate. In many cases, this is the only barrier synchronization
you will require. However, the cps_barrier()
high-level synchronization functions allow you to explicitly create
barriers if necessary. The following Fortran example is similar to the symmetric
parallelism example in the section “Block parallelism” except that instead of relying on the
implicit barrier contained in the call to CPS_PPCALL(),
it contains an explicit CPS_BARRIER()
in the subroutine SUMMER.  |
PROGRAM BAR REAL A(1000) REAL SUM(:), TOTSUM INTEGER PARGS(4), SUMBAR, CPS_NODE_CPUS, CPS_PPCALLN INTEGER CPS_BARRIER_ALLOC, CPS_BARRIER_FREE C$DIR SYNC_ROUTINE(CPS_BARRIER_ALLOC,CPS_BARRIER_FREE) C$DIR SYNC_ROUTINE(CPS_NODE_CPUS,CPS_PPCALLN) EXTERNAL SUMMER ALLOCATABLE SUM NCPUS = CPS_NODE_CPUS() ALLOCATE(SUM(0:NCPUS-1)) ! ONE ELEMENT FOR EACH CPU PARGS(1) = -2 ! ALLOCATE THREADS ON CALLING THREAD'S NODE PARGS(2) = 2 ! MINIMUM OF 2 THREADS PER NODE PARGS(3) = NCPUS ! MAXIMUM # OF THREADS PER NODE PARGS(4) = 1 ! ALLOCATE MULTIPLE THREADS PER NODE DO I = 0, NCPUS-1 ! INITIALIZE SUM SUM(I) = 0.0 ENDDO . . ! SERIAL CODE . IERR = CPS_BARRIER_ALLOC(SUMBAR) ! ALLOCATE BARRIER IF (IERR .LT. 0) PRINT*, "BARRIER ALLOCATION FAILED" C SPAWN PARALLEL THREADS: IERR = CPS_PPCALLN(PARGS,SUMMER,5,A,SUM,TOTSUM,SUMBAR,NCPUS) IF (IERR .LT. 0) PRINT*, "PPCALL FAILED" IERR = CPS_BARRIER_FREE(SUMBAR) ! FREE BARRIER IF (IERR .LT. 0) PRINT*, "BARRIER FREE FAILED" . . ! SERIAL CODE . END SUBROUTINE SUMMER(A,SUM,TOTSUM,SUMBAR,NCPUS) INTEGER STID, NTHR, SUMBAR INTEGER CPS_STID, CPS_NSTHREADS, CPS_BARRIER C$DIR SYNC_ROUTINE(CPS_STID, CPS_NSTHREADS, CPS_BARRIER) REAL A(1000), SUM(0:NCPUS-1), TOTSUM STID = CPS_STID() ! GET MY STID NTHR = CPS_NSTHREADS() ! GET NUMBER OF THREADS SPAWNED ITPERPROC = 1000/NTHR ! COMPUTE ITERATIONS PER THREAD IEXCESS = 1000-ITPERPROC*NTHR ! COMPUTE EXCESS ITERATIONS C COMPUTE LOOP START AND END FOR CASES OF NO EXCESS OR FOR THREADS C THAT DO NOT HANDLE EXCESS: IF(STID .GE. IEXCESS) THEN MYSTART = STID*ITPERROC + IEXCESS + 1 MYEND = MYSTART + ITPERPROC - 1 C COMPUTE LOOP START AND END FOR THREADS THAT HANDLE EXCESS: ELSE MYSTART = STID * (ITPERPROC+1) + 1 MYEND = MYSTART + (ITPERPROC+1) - 1 ENDIF C ACTUAL COMPUTATION: DO J = MYSTART, MYEND ! EACH THREAD COMPUTES LOCAL SUM SUM(STID) = SUM(STID) + A(J) ENDDO C WAIT UNTIL ALL THREADS ARE DONE COMPUTING THEIR PORTION OF SUM: IERR = CPS_BARRIER(SUMBAR, NTHR) IF (IERR .LT. 0) PRINT*, "BARRIER FAILED" IF(STID .EQ. 0) THEN ! THREAD 0 COMPUTES TOTAL SUM DO I = 0, NTHR-1 TOTSUM = TOTSUM + SUM(I) ENDDO ENDIF RETURN END |
 |
Here, the subroutine SUMMER
is called in parallel to compute the sum of the elements of array
A. Each parallel thread computes
its own sum in an element of the array SUM.
The CPS_BARRIER function is used
to prevent execution of any further code until all threads have
finished computing their portion of SUM.
When CPS_BARRIER returns, thread
0 computes TOTSUM, and SUMMER
returns. CPSlib
mutexes allow you to limit access to the sections of code
they delimit to one thread at a time, allowing you to construct
critical sections similar to those discussed in Chapter 4, "Chapter 4 “Basic shared-memory programming”
." In the following Fortran example, the routine SUMMER
performs the same task it did in the preceding barrier example.
However, here access to the TOTSUM
computation takes place in fully parallel code; it is limited to
one thread at a time by the mutex SUMMUTEX.
This eliminates the need for each thread to compute independent
SUM arrays as in the preceding
barrier example. PROGRAM MUT REAL A(1000) REAL TOTSUM INTEGER PARGS(4), SUMMUTEX INTEGER CPS_NODE_CPUS,CPS_PPCALLN INTEGER CPS_MUTEX_ALLOC,CPS_MUTEX_FREE C$DIR SYNC_ROUTINE(CPS_NODE_CPUS,CPS_PPCALLN) C$DIR SYNC_ROUTINE(CPS_MUTEX_ALLOC,CPS_MUTEX_FREE) EXTERNAL SUMMER NCPUS = CPS_NODE_CPUS() PARGS(1) = -2 ! ALLOCATE THREADS ON CALLING THREAD'S NODE PARGS(2) = 2 ! MINIMUM OF 2 THREADS PER NODE PARGS(3) = NCPUS ! MAXIMUM # OF THREADS PER NODE PARGS(4) = 1 ! ALLOCATE MULTIPLE THREADS PER NODE TOTSUM = 0.0 ! INITIALIZE TOTSUM . . ! SERIAL CODE . IERR = CPS_MUTEX_ALLOC(SUMMUTEX) ! ALLOCATE MUTEX IF (IERR .LT. 0) PRINT*, "MUTEX ALLOCATION FAILED"
|
C SPAWN PARALLEL THREADS: IERR = CPS_PPCALLN(PARGS,SUMMER,3,A,TOTSUM,SUMMUTEX) IF (IERR .LT. 0) PRINT*, "PPCALL FAILED" IERR = CPS_MUTEX_FREE(SUMMUTEX) ! FREE MUTEX IF (IERR .LT. 0) PRINT*, "MUTEX FREE FAILED" . . ! SERIAL CODE . END |
 |
 |
SUBROUTINE SUMMER(A,TOTSUM,SUMMUTEX) INTEGER STID, NTHR, SUMMUTEX INTEGER CPS_STID, CPS_NSTHREADS INTEGER CPS_MUTEX_LOCK, CPS_MUTEX_UNLOCK C$DIR SYNC_ROUTINE(CPS_STID, CPS_NSTHREADS) C$DIR SYNC_ROUTINE(CPS_MUTEX_LOCK, CPS_MUTEX_UNLOCK) REAL A(1000),TOTSUM STID = CPS_STID() ! GET MY STID NTHR = CPS_NSTHREADS() ! GET NUMBER OF THREADS SPAWNED ITPERPROC = 1000/NTHR ! COMPUTE ITERATIONS PER THREAD IEXCESS = 1000-ITPERPROC*NTHR ! COMPUTE EXCESS ITERATIONS C COMPUTE LOOP START AND END FOR CASES OF NO EXCESS OR FOR THREADS C THAT DO NOT HANDLE EXCESS: IF(STID .GE. IEXCESS) THEN MYSTART = STID*ITPERROC + IEXCESS + 1 MYEND = MYSTART + ITPERPROC - 1 C COMPUTE LOOP START AND END FOR THREADS THAT HANDLE EXCESS: ELSE MYSTART = STID * (ITPERPROC+1) + 1 MYEND = MYSTART + (ITPERPROC+1) - 1 ENDIF C ACTUAL COMPUTATION: DO J = MYSTART, MYEND C MUTEX LIMITS ACCESS TO TOTSUM TO ONE THREAD AT A TIME: IERR = CPS_MUTEX_LOCK(SUMMUTEX) IF (IERR .LT. 0) PRINT*, "MUTEX LOCK FAILED" TOTSUM = TOTSUM + A(J) IERR = CPS_MUTEX_UNLOCK(SUMMUTEX) IF(IERR .LT. 0) PRINT*, "MUTEX UNLOCK FAILED" ENDDO RETURN END |
 |
Here, as in the barrier example, SUMMER
is called in parallel. Each parallel thread then waits until it
can lock SUMMUTEX before updating
TOTSUM. Synchronization using low-level functions |  |
This section demonstrates how to use semaphores to
synchronize symmetrically parallel code. Critical sections
like the one in the preceding mutex example can
be implemented in a similar fashion using cache-based or memory-based
semaphores. The following Fortran example is identical to the mutex example,
but implements the critical section using a memory-based semaphore
instead of a mutex: PROGRAM SEM REAL A(1000) REAL TOTSUM INTEGER PARGS(4), SUMSEM, SEMCNT INTEGER CPS_NODE_CPUS, CPS_PPCALLN, M_INIT32, M_FREE32 C$DIR SYNC_ROUTINE(CPS_NODE_CPUS, CPS_PPCALLN, M_INIT32, M_FREE32) EXTERNAL SUMMER NCPUS = CPS_NODE_CPUS() PARGS(1) = -2 ! ALLOCATE THREADS ON CALLING THREAD'S NODE PARGS(2) = 2 ! MINIMUM OF 2 THREADS PER NODE PARGS(3) = NCPUS ! MAXIMUM # OF THREADS PER NODE PARGS(4) = 1 ! ALLOCATE MULTIPLE THREADS PER NODE TOTSUM = 0.0 ! INITIALIZE TOTSUM SEMCNT = 0 ! COUNTER FOR SEMAPHORE; VALUE IS IRRELEVANT . . ! SERIAL CODE . IERR = M_INIT32(SUMSEM,SEMCNT) ! ALLOCATE SUMSEM IF (IERR .LT. 0) PRINT*, "SEMAPHORE ALLOCATION FAILED" IERR = CPS_PPCALLN(PARGS,SUMMER,3,A,TOTSUM,SUMSEM) IF (IERR .LT. 0) PRINT*, "PPCALL FAILED" IERR = M_FREE32(SUMSEM) IF (IERR .LT. 0) PRINT*, "SEMAPHORE FREE FAILED" . . ! SERIAL CODE . END |
 |
 |
SUBROUTINE SUMMER(A,TOTSUM,SUMSEM) INTEGER STID, NTHR, SUMSEM INTEGER CPS_STID, CPS_NSTHREADS, M_LOCK, M_UNLOCK REAL A(1000),TOTSUM C$DIR SYNC_ROUTINE(CPS_STID, CPS_NSTHREADS, M_LOCK, M_UNLOCK) STID = CPS_STID() ! GET MY STID NTHR = CPS_NSTHREADS() ! GET NUMBER OF THREADS SPAWNED ITPERPROC = 1000/NTHR ! COMPUTE ITERATIONS PER THREAD IEXCESS = 1000-ITPERPROC*NTHR ! COMPUTE EXCESS ITERATIONS C COMPUTE LOOP START AND END FOR CASES OF NO EXCESS OR FOR THREADS C THAT DO NOT HANDLE EXCESS: IF(STID .GE. IEXCESS) THEN MYSTART = STID*ITPERROC + IEXCESS + 1 MYEND = MYSTART + ITPERPROC - 1 C COMPUTE LOOP START AND END FOR THREADS THAT HANDLE EXCESS: ELSE MYSTART = STID * (ITPERPROC+1) + 1 MYEND = MYSTART + (ITPERPROC+1) - 1 ENDIF C ACTUAL COMPUTATION: DO J = MYSTART, MYEND C SEMAPHORE LIMITS ACCESS TO TOTSUM TO ONE THREAD AT A TIME: IERR = M_LOCK(SUMSEM) IF (IERR .LT. 0) PRINT*, "SEMAPHORE LOCK FAILED" TOTSUM = TOTSUM + A(J) IERR = M_UNLOCK(SUMSEM) IF(IERR .LT. 0) PRINT*, "SEMAPHORE UNLOCK FAILED" ENDDO RETURN END |
 |
Here as in the mutex example, SUMMER
is called in parallel. Each parallel thread then waits until it
can lock the memory-based semaphore SUMSEM
before updating TOTSUM. Semaphores
can
also be used to construct ordered sections such as
those constructed using the loop_parallel(ordered),
begin_ordered_section and end_ordered_section
directives and pragmas, which are described in Chapter 6, "Chapter 6 “Advanced shared-memory programming”." The parallel loop in the following Fortran example contains
a backward LCD, which is isolated using low-level synchronization
functions so that the threads must execute the LCD in iteration
order.  |
PROGRAM ORDERED ! DEMONSTRATES ORDERED SECTIONS USING CPS ! LOW LEVEL SYNCHRONIZATION REAL X(1000), Y(1000) INTEGER PARGS(4), CPS_PPCALLN, NTHR, CPS_NODE_CPUS INTEGER M_INIT32, M_FREE32 C$DIR SYNC_ROUTINE(CPS_PPCALLN, CPS_NODE_CPUS, M_INIT32, M_FREE32) INTEGER ORDSEM, SEMCNT EXTERNAL ORDWORK PARGS(1) = -2 ! ALLOCATE THREADS CALLING THREAD"S NODE PARGS(2) = 2 ! MINIMUM OF 2 THREADS PARGS(3) = CPS_NODE_CPUS() ! MAXIMUM OF NPROCS THREADS PARGS(4) = 1 ! ALLOCATE MULTIPLE THREADS PER HYPERNODE SEMCNT = 0 . . ! SERIAL CODE . IERR = M_INIT32(ORDSEM,SEMCNT) ! ALLOCATE ORDSEM IF (IERR .LT. 0) PRINT*, "SEMAPHORE ALLOCATION FAILED" C SPAWN THREADS: ITHREAD = CPS_PPCALLN(PARGS,ORDWORK,5,X,Y,NTHR,ORDSEM,SEMCNT) IF (ITHREAD .LT. 0) PRINT *,"PPCALLN FAILED" IERR = M_FREE32(ORDSEM) IF (IERR .LT. 0) PRINT*, "SEMAPHORE FREE FAILED" . . ! SERIAL CODE . END |
 |
 |
SUBROUTINE ORDWORK (X, Y, NTHR,ORDSEM,SEMCNT) REAL X(1000), Y(1000) INTEGER CPS_NSTHREADS, CPS_STID, M_FETCH32 INTEGER ORDSEM,SEMCNT,STID, NTHR, CNTVAL INTEGER M_FETCH_AND_INC32, M_FETCH_AND_CLEAR32 C$DIR SYNC_ROUTINE(CPS_NSTHREADS, CPS_STID, M_FETCH32) C$DIR SYNC_ROUTINE(M_FETCH_AND_INC32, M_FETCH_AND_CLEAR32) STID = CPS_STID() ! GET MY STID NTHR = CPS_NSTHREADS() ! GET NUMBER OF THREADS SPAWNEDC ACTUAL COMPUTATION: DO J = 2+STID, 1000, NTHR ! CYCLIC DECOMPOSITION . . ! DEPENDENCE-FREE PARALLEL CODE . 10 CNTVAL = M_FETCH32(ORDSEM) ! GET SEMAPHORE COUNTER VALUE IF(CNTVAL .EQ. STID) THEN ! IF IT"S MY STID"S TURN C PERFORM LCD COMPUTATION: X(J) = X(J-1) + Y(J) IF(CNTVAL .GE. NTHR-1) THEN ! HIGHEST STID IERR = M_FETCH_AND_CLEAR32(ORDSEM) ! RESETS COUNTER IF(IERR .LT. 0) PRINT*, "FETCH-CLEAR FAILED" ELSE ! ALL OTHER STIDS INCREMENT COUNTER: IERR = M_FETCH_AND_INC32(ORDSEM) IF(IERR .LT. 0) PRINT*, "FETCH-INC FAILED" ENDIF ELSE GOTO 10 ! LOOP AND TRY AGAIN IF CNTVAL .NE. STID ENDIF ENDDO RETURN END |
 |
This example uses a cyclic decomposition in the parallel J
loop because, by definition, ordered sections must be executed in
iteration order, and this is impossible using a block decomposition. As in the example in the "“Cyclic parallelism”" section, here the starting index
is offset according to spawn thread ID and the loop steps by the
number of parallel threads. This ensures that each thread computes
a unique array element on every step of the loop; NTHR
elements are computed per step. Contiguous STIDs
compute contiguous elements. The loop is ordered by the first IF
statement in the loop, which only allows the body of the loop (including
the LCD) to execute if the counter associated with the semaphore
ORDSEM is equal to the current
STID. This counter is incremented
(or reset when the highest STID
is reached) in the body of the loop, forcing the threads to execute
in iteration order. The counter associated with ORDSEM
controls access to the LCD; no explicit semaphore lock is needed. Substantial nonordered work must be present in this loop to
make the overhead of the ordered section worthwhile. Assuming this
condition is met, once all the threads pass through the ordered
section once, their execution of the nonordered code will be staggered
such that they will stay busy while they are outside the ordered
code. The ordered parallelism described here is similar to that
achieved through use of compiler directives in the "“Ordered sections”" section of
Chapter 6, "Chapter 6 “Advanced shared-memory programming”."
|