 |
» |
|
|
 |
The Alternating Direction Iterative (ADI) method is often
used to solve differential equations. In this example, multi_par.f,
a compiler that supports OPENMP directives is required in order
to achieve multi-level parallelism. multi_par.f implements the following
logic for a 2-dimensional compute region: DO J=1,JMAX DO I=2,IMAX A(I,J)=A(I,J)+A(I-1,J) ENDDO ENDDO DO J=2,JMAX DO I=1,IMAX A(I,J)=A(I,J)+A(I,J-1) ENDDO ENDDO |
There are loop carried dependencies on the first dimension
(array's row) in the first innermost DO loop and the second dimension
(array's column) in the second outermost DO loop. A simple method for parallelizing the fist outer-loop implies
a partitioning of the array in column blocks, while another for
the second outer-loop implies a partitioning of the array in row
blocks. With message-passing programming, such a method will require
massive data exchange among processes because of the partitioning
change. "Twisted data layout" partitioning is better in this case
because the partitioning used for the parallelization of the first
outer-loop can accommodate the other of the second outer-loop. The
partitioning of the array is shown in Figure A-1 “Array
partitioning”. In this sample program, the rank n process is assigned to the partition n at distribution initialization. Because these partitions
are not contiguous-memory regions, MPI's derived datatype
is used to define the partition layout to the MPI system. Each process starts with computing summations in row-wise
fashion. For example, the rank 2 process starts with the block that
is on the 0th-row block and 2nd-column block (denoted
as [0,2]). The block computed in the second step is [1,3]. Computing
the first row elements in this block requires the last row elements
in the [0,3] block (computed in the first step in the
rank 3 process). Thus, the rank 2 process receives the data from
the rank 3 process at the beginning of the second step. Note that
the rank 2 process also sends the last row elements of the [0,2]
block to the rank 1 process that computes [1,2] in the second step.
By repeating these steps, all processes finish summations in row-wise
fashion (the first outer-loop in the illustrated program). The second outer-loop (the summations in column-wise fashion)
is done in the same manner. For example, at the beginning of the
second step for the column-wise summations, the rank 2 process receives
data from the rank 1 process that computed the [3,0] block. The
rank 2 process also sends the last column of the [2,0] block to
the rank 3 process. Note that each process keeps the same blocks
for both of the outer-loop computations. This approach is good for distributed memory architectures
on which repartitioning requires massive data communications that
are expensive. However, on shared memory architectures, the partitioning
of the compute region does not imply data distribution. The row-
and column-block partitioning method requires just one
synchronization at the end of each outer loop. For distributed shared-memory architectures, the mix of the
two methods can be effective. The sample program implements the twisted-data
layout method with MPI and the row- and column-block partitioning
method with OPENMP
thread directives. In the first case, the data dependency is easily
satisfied as each thread computes down a different set of columns.
In the second case we still want to compute down the columns for
cache reasons, but to satisfy the data dependency, each thread computes
a different portion of the same column and the threads work left
to right across the rows together.  |
implicit none include 'mpif.h' integer nrow ! # of rows integer ncol ! # of columns parameter(nrow=1000,ncol=1000) double precision array(nrow,ncol) ! compute region integer blk ! block iteration counter integer rb ! row block number integer cb ! column block number integer nrb ! next row block number integer ncb ! next column block number integer rbs(:) ! row block start subscripts integer rbe(:) ! row block end subscripts integer cbs(:) ! column block start subscripts integer cbe(:) ! column block end subscripts integer rdtype(:) ! row block communication datatypes integer cdtype(:) ! column block communication datatypes integer twdtype(:) ! twisted distribution datatypes integer ablen(:) ! array of block lengths integer adisp(:) ! array of displacements integer adtype(:) ! array of datatypes allocatable rbs,rbe,cbs,cbe,rdtype,cdtype,twdtype,ablen,adisp, * adtype integer rank ! rank iteration counter integer comm_size ! number of MPI processes integer comm_rank ! sequential ID of MPI process integer ierr ! MPI error code integer mstat(mpi_status_size) ! MPI function status integer src ! source rank integer dest ! destination rank integer dsize ! size of double precision in bytes double precision startt,endt,elapsed ! time keepers external compcolumn,comprow ! subroutines execute in threads c c MPI initialization c call mpi_init(ierr) call mpi_comm_size(mpi_comm_world,comm_size,ierr) call mpi_comm_rank(mpi_comm_world,comm_rank,ierr) c c Data initialization and start up c if (comm_rank.eq.0) then write(6,*) 'Initializing',nrow,' x',ncol,' array...' call getdata(nrow,ncol,array) write(6,*) 'Start computation' endif call mpi_barrier(MPI_COMM_WORLD,ierr) startt=mpi_wtime() c c Compose MPI datatypes for row/column send-receive c c Note that the numbers from rbs(i) to rbe(i) are the indices c of the rows belonging to the i'th block of rows. These indices c specify a portion (the i'th portion) of a column and the c datatype rdtype(i) is created as an MPI contiguous datatype c to refer to the i'th portion of a column. Note this is a c contiguous datatype because fortran arrays are stored c column-wise. c c For a range of columns to specify portions of rows, the situation c is similar: the numbers from cbs(j) to cbe(j) are the indices c of the columns belonging to the j'th block of columns. These c indices specify a portion (the j'th portion) of a row, and the c datatype cdtype(j) is created as an MPI vector datatype to refer c to the j'th portion of a row. Note this a vector datatype c because adjacent elements in a row are actually spaced nrow c elements apart in memory. c allocate(rbs(0:comm_size-1),rbe(0:comm_size-1),cbs(0:comm_size-1), * cbe(0:comm_size-1),rdtype(0:comm_size-1), * cdtype(0:comm_size-1),twdtype(0:comm_size-1)) do blk=0,comm_size-1 call blockasgn(1,nrow,comm_size,blk,rbs(blk),rbe(blk)) call mpi_type_contiguous(rbe(blk)-rbs(blk)+1, * mpi_double_precision,rdtype(blk),ierr) call mpi_type_commit(rdtype(blk),ierr) call blockasgn(1,ncol,comm_size,blk,cbs(blk),cbe(blk)) call mpi_type_vector(cbe(blk)-cbs(blk)+1,1,nrow, * mpi_double_precision,cdtype(blk),ierr) call mpi_type_commit(cdtype(blk),ierr) enddo c Compose MPI datatypes for gather/scatter c c Each block of the partitioning is defined as a set of fixed length c vectors. Each process'es partition is defined as a struct of such c blocks. c allocate(adtype(0:comm_size-1),adisp(0:comm_size-1), * ablen(0:comm_size-1)) call mpi_type_extent(mpi_double_precision,dsize,ierr) do rank=0,comm_size-1 do rb=0,comm_size-1 cb=mod(rb+rank,comm_size) call mpi_type_vector(cbe(cb)-cbs(cb)+1,rbe(rb)-rbs(rb)+1, * nrow,mpi_double_precision,adtype(rb),ierr) call mpi_type_commit(adtype(rb),ierr) adisp(rb)=((rbs(rb)-1)+(cbs(cb)-1)*nrow)*dsize ablen(rb)=1 enddo call mpi_type_struct(comm_size,ablen,adisp,adtype, * twdtype(rank),ierr) call mpi_type_commit(twdtype(rank),ierr) do rb=0,comm_size-1 call mpi_type_free(adtype(rb),ierr) enddo enddo deallocate(adtype,adisp,ablen) c Scatter initial data with using derived datatypes defined above c for the partitioning. MPI_send() and MPI_recv() will find out the c layout of the data from those datatypes. This saves application c programs to manually pack/unpack the data, and more importantly, c gives opportunities to the MPI system for optimal communication c strategies. c if (comm_rank.eq.0) then do dest=1,comm_size-1 call mpi_send(array,1,twdtype(dest),dest,0,mpi_comm_world, * ierr) enddo else call mpi_recv(array,1,twdtype(comm_rank),0,0,mpi_comm_world, * mstat,ierr) endif c c Computation c c Sum up in each column. c Each MPI process, or a rank, computes blocks that it is assigned. c The column block number is assigned in the variable 'cb'. The c starting and ending subscripts of the column block 'cb' are c stored in 'cbs(cb)' and 'cbe(cb)', respectively. The row block c number is assigned in the variable 'rb'. The starting and ending c subscripts of the row block 'rb' are stored in 'rbs(rb)' and c 'rbe(rb)', respectively, as well. src=mod(comm_rank+1,comm_size) dest=mod(comm_rank-1+comm_size,comm_size) ncb=comm_rank do rb=0,comm_size-1 cb=ncb c c Compute a block. The function will go thread-parallel if the c compiler supports OPENMP directives. c call compcolumn(nrow,ncol,array, * rbs(rb),rbe(rb),cbs(cb),cbe(cb)) if (rb.lt.comm_size-1) then c c Send the last row of the block to the rank that is to compute the c block next to the computed block. Receive the last row of the c block that the next block being computed depends on. c nrb=rb+1 ncb=mod(nrb+comm_rank,comm_size) call mpi_sendrecv(array(rbe(rb),cbs(cb)),1,cdtype(cb),dest, * 0,array(rbs(nrb)-1,cbs(ncb)),1,cdtype(ncb),src,0, * mpi_comm_world,mstat,ierr) endif enddo c c Sum up in each row. c The same logic as the loop above except rows and columns are c switched. c src=mod(comm_rank-1+comm_size,comm_size) dest=mod(comm_rank+1,comm_size) do cb=0,comm_size-1 rb=mod(cb-comm_rank+comm_size,comm_size) call comprow(nrow,ncol,array, * rbs(rb),rbe(rb),cbs(cb),cbe(cb)) if (cb.lt.comm_size-1) then ncb=cb+1 nrb=mod(ncb-comm_rank+comm_size,comm_size) call mpi_sendrecv(array(rbs(rb),cbe(cb)),1,rdtype(rb),dest, * 0,array(rbs(nrb),cbs(ncb)-1),1,rdtype(nrb),src,0, * mpi_comm_world,mstat,ierr) endif enddo c c Gather computation results c call mpi_barrier(MPI_COMM_WORLD,ierr) endt=mpi_wtime() if (comm_rank.eq.0) then do src=1,comm_size-1 call mpi_recv(array,1,twdtype(src),src,0,mpi_comm_world, * mstat,ierr) enddo elapsed=endt-startt write(6,*) 'Computation took',elapsed,' seconds' else call mpi_send(array,1,twdtype(comm_rank),0,0,mpi_comm_world, * ierr) endif c c Dump to a file c c if (comm_rank.eq.0) then c print*,'Dumping to adi.out...' c open(8,file='adi.out') c write(8,*) array c close(8,status='keep') c endif c c Free the resources c do rank=0,comm_size-1 call mpi_type_free(twdtype(rank),ierr) enddo do blk=0,comm_size-1 call mpi_type_free(rdtype(blk),ierr) call mpi_type_free(cdtype(blk),ierr) enddo deallocate(rbs,rbe,cbs,cbe,rdtype,cdtype,twdtype) c c Finalize the MPI system c call mpi_finalize(ierr) end c********************************************************************** subroutine blockasgn(subs,sube,blockcnt,nth,blocks,blocke) c c This subroutine: c is given a range of subscript and the total number of blocks in c which the range is to be divided, assigns a subrange to the caller c that is n-th member of the blocks. c implicit none integer subs ! (in) subscript start integer sube ! (in) subscript end integer blockcnt ! (in) block count integer nth ! (in) my block (begin from 0) integer blocks ! (out) assigned block start subscript integer blocke ! (out) assigned block end subscript c integer d1,m1 c d1=(sube-subs+1)/blockcnt m1=mod(sube-subs+1,blockcnt) blocks=nth*d1+subs+min(nth,m1) blocke=blocks+d1-1 if(m1.gt.nth)blocke=blocke+1 end c c********************************************************************** subroutine compcolumn(nrow,ncol,array,rbs,rbe,cbs,cbe) c c This subroutine: c does summations of columns in a thread. c implicit none integer nrow ! # of rows integer ncol ! # of columns double precision array(nrow,ncol) ! compute region integer rbs ! row block start subscript integer rbe ! row block end subscript integer cbs ! column block start subscript integer cbe ! column block end subscript c c Local variables c integer i,j c c The OPENMP directive below allows the compiler to split the c values for "j" between a number of threads. By making i and j c private, each thread works on its own range of columns "j", c and works down each column at its own pace "i". c c Note no data dependency problems arise by having the threads all c working on different columns simultaneously. c C$OMP PARALLEL DO PRIVATE(i,j) do j=cbs,cbe do i=max(2,rbs),rbe array(i,j)=array(i-1,j)+array(i,j) enddo enddo C$OMP END PARALLEL DO end c********************************************************************** subroutine comprow(nrow,ncol,array,rbs,rbe,cbs,cbe) c c This subroutine: c does summations of rows in a thread. c implicit none integer nrow ! # of rows integer ncol ! # of columns double precision array(nrow,ncol) ! compute region integer rbs ! row block start subscript integer rbe ! row block end subscript integer cbs ! column block start subscript integer cbe ! column block end subscript c c Local variables c integer i,j c c The OPENMP directives below allow the compiler to split the c values for "i" between a number of threads, while "j" moves c forward lock-step between the threads. By making j shared c and i private, all the threads work on the same column "j" at c any given time, but they each work on a different portion "i" c of that column. c c This is not as efficient as found in the compcolumn subroutine, c but is necessary due to data dependencies. c C$OMP PARALLEL PRIVATE(i) do j=max(2,cbs),cbe C$OMP DO do i=rbs,rbe array(i,j)=array(i,j-1)+array(i,j) enddo C$OMP END DO enddo C$OMP END PARALLEL end c c********************************************************************** subroutine getdata(nrow,ncol,array) c c Enter dummy data c integer nrow,ncol double precision array(nrow,ncol) c do j=1,ncol do i=1,nrow array(i,j)=(j-1.0)*ncol+i enddo enddo end |
 |
multi_par.f
output |  |
The output from running the multi_par.f executable is shown
below. The application was run with -np1. Initializing 1000 x 1000 array... Start computation Computation took 4.088211059570312E-02 seconds |
|