Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP-MPI User's Guide > Appendix A Example applications

multi_par.f

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

The Alternating Direction Iterative (ADI) method is often used to solve differential equations. In this example, multi_par.f, a compiler that supports OPENMP directives is required in order to achieve multi-level parallelism. multi_par.f implements the following logic for a 2-dimensional compute region:

      DO J=1,JMAX
DO I=2,IMAX
A(I,J)=A(I,J)+A(I-1,J)
ENDDO
ENDDO

      DO J=2,JMAX
DO I=1,IMAX
A(I,J)=A(I,J)+A(I,J-1)
ENDDO
ENDDO

There are loop carried dependencies on the first dimension (array's row) in the first innermost DO loop and the second dimension (array's column) in the second outermost DO loop.

A simple method for parallelizing the fist outer-loop implies a partitioning of the array in column blocks, while another for the second outer-loop implies a partitioning of the array in row blocks.

With message-passing programming, such a method will require massive data exchange among processes because of the partitioning change. "Twisted data layout" partitioning is better in this case because the partitioning used for the parallelization of the first outer-loop can accommodate the other of the second outer-loop. The partitioning of the array is shown in Figure A-1 “Array partitioning”.

Figure A-1 Array partitioning

Array partitioning

In this sample program, the rank n process is assigned to the partition n at distribution initialization. Because these partitions are not
contiguous-memory regions, MPI's derived datatype is used to define the partition layout to the MPI system.

Each process starts with computing summations in row-wise fashion. For example, the rank 2 process starts with the block that is on the
0th-row block and 2nd-column block (denoted as [0,2]).

The block computed in the second step is [1,3]. Computing the first row elements in this block requires the last row elements in the [0,3] block
(computed in the first step in the rank 3 process). Thus, the rank 2 process receives the data from the rank 3 process at the beginning of the second step. Note that the rank 2 process also sends the last row elements of the [0,2] block to the rank 1 process that computes [1,2] in the second step. By repeating these steps, all processes finish summations in row-wise fashion (the first outer-loop in the illustrated program).

The second outer-loop (the summations in column-wise fashion) is done in the same manner. For example, at the beginning of the second step for the column-wise summations, the rank 2 process receives data from the rank 1 process that computed the [3,0] block. The rank 2 process also sends the last column of the [2,0] block to the rank 3 process. Note that each process keeps the same blocks for both of the outer-loop computations.

This approach is good for distributed memory architectures on which repartitioning requires massive data communications that are expensive. However, on shared memory architectures, the partitioning of the compute region does not imply data distribution. The row- and
column-block partitioning method requires just one synchronization at the end of each outer loop.

For distributed shared-memory architectures, the mix of the two methods can be effective. The sample program implements the twisted-data layout method with MPI and the row- and column-block partitioning method with OPENMP thread directives. In the first case, the data dependency is easily satisfied as each thread computes down a different set of columns. In the second case we still want to compute down the columns for cache reasons, but to satisfy the data dependency, each thread computes a different portion of the same column and the threads work left to right across the rows together.

      implicit none
      include 'mpif.h'
      integer nrow ! # of rows
      integer ncol ! # of columns
      parameter(nrow=1000,ncol=1000)
      double precision array(nrow,ncol) ! compute region
      integer blk ! block iteration counter
      integer rb ! row block number
      integer cb ! column block number
      integer nrb ! next row block number
integer ncb ! next column block number
integer rbs(:) ! row block start subscripts
integer rbe(:) ! row block end subscripts
integer cbs(:) ! column block start subscripts
integer cbe(:) ! column block end subscripts
integer rdtype(:) ! row block communication datatypes
integer cdtype(:) ! column block communication datatypes
integer twdtype(:) ! twisted distribution datatypes
integer ablen(:) ! array of block lengths
integer adisp(:) ! array of displacements
integer adtype(:) ! array of datatypes
allocatable rbs,rbe,cbs,cbe,rdtype,cdtype,twdtype,ablen,adisp,
* adtype
integer rank ! rank iteration counter
integer comm_size ! number of MPI processes
integer comm_rank ! sequential ID of MPI process
integer ierr ! MPI error code
integer mstat(mpi_status_size) ! MPI function status
integer src ! source rank
integer dest ! destination rank
integer dsize ! size of double precision in bytes
double precision startt,endt,elapsed ! time keepers
external compcolumn,comprow ! subroutines execute in threads

c
c MPI initialization
c
call mpi_init(ierr)
call mpi_comm_size(mpi_comm_world,comm_size,ierr)
call mpi_comm_rank(mpi_comm_world,comm_rank,ierr)

c
c Data initialization and start up
c
if (comm_rank.eq.0) then
write(6,*) 'Initializing',nrow,' x',ncol,' array...'
call getdata(nrow,ncol,array)
write(6,*) 'Start computation'
endif
call mpi_barrier(MPI_COMM_WORLD,ierr)
startt=mpi_wtime()
c
c Compose MPI datatypes for row/column send-receive
c
c Note that the numbers from rbs(i) to rbe(i) are the indices
c of the rows belonging to the i'th block of rows. These indices
c specify a portion (the i'th portion) of a column and the
c datatype rdtype(i) is created as an MPI contiguous datatype
c to refer to the i'th portion of a column. Note this is a
c contiguous datatype because fortran arrays are stored
c column-wise.
c
c For a range of columns to specify portions of rows, the situation
c is similar: the numbers from cbs(j) to cbe(j) are the indices
c of the columns belonging to the j'th block of columns. These
c indices specify a portion (the j'th portion) of a row, and the
c datatype cdtype(j) is created as an MPI vector datatype to refer
c to the j'th portion of a row. Note this a vector datatype
c because adjacent elements in a row are actually spaced nrow
c elements apart in memory.
c
allocate(rbs(0:comm_size-1),rbe(0:comm_size-1),cbs(0:comm_size-1),
* cbe(0:comm_size-1),rdtype(0:comm_size-1),
* cdtype(0:comm_size-1),twdtype(0:comm_size-1))
do blk=0,comm_size-1
call blockasgn(1,nrow,comm_size,blk,rbs(blk),rbe(blk))
call mpi_type_contiguous(rbe(blk)-rbs(blk)+1,
* mpi_double_precision,rdtype(blk),ierr)
call mpi_type_commit(rdtype(blk),ierr)
call blockasgn(1,ncol,comm_size,blk,cbs(blk),cbe(blk))
call mpi_type_vector(cbe(blk)-cbs(blk)+1,1,nrow,
* mpi_double_precision,cdtype(blk),ierr)
call mpi_type_commit(cdtype(blk),ierr)
enddo



c Compose MPI datatypes for gather/scatter
c
c Each block of the partitioning is defined as a set of fixed length
c vectors. Each process'es partition is defined as a struct of such
c blocks.
c
allocate(adtype(0:comm_size-1),adisp(0:comm_size-1),
* ablen(0:comm_size-1))
call mpi_type_extent(mpi_double_precision,dsize,ierr)
do rank=0,comm_size-1
do rb=0,comm_size-1
cb=mod(rb+rank,comm_size)
call mpi_type_vector(cbe(cb)-cbs(cb)+1,rbe(rb)-rbs(rb)+1,
* nrow,mpi_double_precision,adtype(rb),ierr)
call mpi_type_commit(adtype(rb),ierr)
adisp(rb)=((rbs(rb)-1)+(cbs(cb)-1)*nrow)*dsize
ablen(rb)=1
enddo
call mpi_type_struct(comm_size,ablen,adisp,adtype,
* twdtype(rank),ierr)
call mpi_type_commit(twdtype(rank),ierr)
do rb=0,comm_size-1
call mpi_type_free(adtype(rb),ierr)
enddo
enddo
deallocate(adtype,adisp,ablen)

c Scatter initial data with using derived datatypes defined above
c for the partitioning. MPI_send() and MPI_recv() will find out the
c layout of the data from those datatypes. This saves application
c programs to manually pack/unpack the data, and more importantly,
c gives opportunities to the MPI system for optimal communication
c strategies.
c
if (comm_rank.eq.0) then
do dest=1,comm_size-1
call mpi_send(array,1,twdtype(dest),dest,0,mpi_comm_world,
* ierr)
enddo
else
call mpi_recv(array,1,twdtype(comm_rank),0,0,mpi_comm_world,
* mstat,ierr)
endif

c
c Computation
c
c Sum up in each column.
c Each MPI process, or a rank, computes blocks that it is assigned.
c The column block number is assigned in the variable 'cb'. The
c starting and ending subscripts of the column block 'cb' are
c stored in 'cbs(cb)' and 'cbe(cb)', respectively. The row block
c number is assigned in the variable 'rb'. The starting and ending
c subscripts of the row block 'rb' are stored in 'rbs(rb)' and
c 'rbe(rb)', respectively, as well.
src=mod(comm_rank+1,comm_size)
dest=mod(comm_rank-1+comm_size,comm_size)
ncb=comm_rank
do rb=0,comm_size-1
cb=ncb
c
c Compute a block. The function will go thread-parallel if the
c compiler supports OPENMP directives.
c
call compcolumn(nrow,ncol,array,
* rbs(rb),rbe(rb),cbs(cb),cbe(cb))
if (rb.lt.comm_size-1) then
c
c Send the last row of the block to the rank that is to compute the
c block next to the computed block. Receive the last row of the
c block that the next block being computed depends on.
c
nrb=rb+1
ncb=mod(nrb+comm_rank,comm_size)
call mpi_sendrecv(array(rbe(rb),cbs(cb)),1,cdtype(cb),dest,
* 0,array(rbs(nrb)-1,cbs(ncb)),1,cdtype(ncb),src,0,
* mpi_comm_world,mstat,ierr)
endif
enddo
c
c Sum up in each row.
c The same logic as the loop above except rows and columns are
c switched.
c
src=mod(comm_rank-1+comm_size,comm_size)
dest=mod(comm_rank+1,comm_size)
do cb=0,comm_size-1
rb=mod(cb-comm_rank+comm_size,comm_size)
call comprow(nrow,ncol,array,
* rbs(rb),rbe(rb),cbs(cb),cbe(cb))
if (cb.lt.comm_size-1) then
ncb=cb+1
nrb=mod(ncb-comm_rank+comm_size,comm_size)
call mpi_sendrecv(array(rbs(rb),cbe(cb)),1,rdtype(rb),dest,
* 0,array(rbs(nrb),cbs(ncb)-1),1,rdtype(nrb),src,0,
* mpi_comm_world,mstat,ierr)
endif
enddo
c
c Gather computation results
c
call mpi_barrier(MPI_COMM_WORLD,ierr)
endt=mpi_wtime()

if (comm_rank.eq.0) then
do src=1,comm_size-1
call mpi_recv(array,1,twdtype(src),src,0,mpi_comm_world,
* mstat,ierr)
enddo

elapsed=endt-startt
write(6,*) 'Computation took',elapsed,' seconds'
else
call mpi_send(array,1,twdtype(comm_rank),0,0,mpi_comm_world,
* ierr)
endif
c
c Dump to a file
c
c if (comm_rank.eq.0) then
c print*,'Dumping to adi.out...'
c open(8,file='adi.out')
c write(8,*) array
c close(8,status='keep')
c endif
c
c Free the resources
c
do rank=0,comm_size-1
call mpi_type_free(twdtype(rank),ierr)
enddo
do blk=0,comm_size-1
call mpi_type_free(rdtype(blk),ierr)
call mpi_type_free(cdtype(blk),ierr)
enddo
deallocate(rbs,rbe,cbs,cbe,rdtype,cdtype,twdtype)
c
c Finalize the MPI system
c
call mpi_finalize(ierr)
end
c**********************************************************************
subroutine blockasgn(subs,sube,blockcnt,nth,blocks,blocke)
c
c This subroutine:
c is given a range of subscript and the total number of blocks in
c which the range is to be divided, assigns a subrange to the caller
c that is n-th member of the blocks.
c
implicit none
integer subs ! (in) subscript start
integer sube ! (in) subscript end
integer blockcnt ! (in) block count
integer nth ! (in) my block (begin from 0)
integer blocks ! (out) assigned block start subscript
integer blocke ! (out) assigned block end subscript
c
integer d1,m1
c
d1=(sube-subs+1)/blockcnt
m1=mod(sube-subs+1,blockcnt)
blocks=nth*d1+subs+min(nth,m1)
blocke=blocks+d1-1
if(m1.gt.nth)blocke=blocke+1
      end
c
c**********************************************************************
subroutine compcolumn(nrow,ncol,array,rbs,rbe,cbs,cbe)
c
c This subroutine:
c does summations of columns in a thread.
c
implicit none

integer nrow ! # of rows
integer ncol ! # of columns
double precision array(nrow,ncol) ! compute region
integer rbs ! row block start subscript
integer rbe ! row block end subscript
integer cbs ! column block start subscript
integer cbe ! column block end subscript

c
c Local variables
c
integer i,j

c
c The OPENMP directive below allows the compiler to split the
c values for "j" between a number of threads. By making i and j
c private, each thread works on its own range of columns "j",
c and works down each column at its own pace "i".
c
c Note no data dependency problems arise by having the threads all
c working on different columns simultaneously.
c

C$OMP PARALLEL DO PRIVATE(i,j)
do j=cbs,cbe
do i=max(2,rbs),rbe
array(i,j)=array(i-1,j)+array(i,j)
enddo
enddo
C$OMP END PARALLEL DO
end

c**********************************************************************
subroutine comprow(nrow,ncol,array,rbs,rbe,cbs,cbe)
c
c This subroutine:
c does summations of rows in a thread.
c
implicit none

integer nrow ! # of rows
integer ncol ! # of columns
double precision array(nrow,ncol) ! compute region
integer rbs ! row block start subscript
integer rbe ! row block end subscript
integer cbs ! column block start subscript
integer cbe ! column block end subscript

c
c Local variables
c
integer i,j

c
c The OPENMP directives below allow the compiler to split the
c values for "i" between a number of threads, while "j" moves
c forward lock-step between the threads. By making j shared
c and i private, all the threads work on the same column "j" at
c any given time, but they each work on a different portion "i"
c of that column.
c
c This is not as efficient as found in the compcolumn subroutine,
c but is necessary due to data dependencies.
c


C$OMP PARALLEL PRIVATE(i)
do j=max(2,cbs),cbe
C$OMP DO
do i=rbs,rbe
array(i,j)=array(i,j-1)+array(i,j)
enddo
C$OMP END DO
enddo
C$OMP END PARALLEL

end
c
c**********************************************************************
       subroutine getdata(nrow,ncol,array)
c
c Enter dummy data
c
integer nrow,ncol
double precision array(nrow,ncol)
c
do j=1,ncol
do i=1,nrow
array(i,j)=(j-1.0)*ncol+i
enddo
enddo
end

multi_par.f output

The output from running the multi_par.f executable is shown below. The application was run with -np1.

Initializing 1000  x 1000  array...
Start computation
Computation took 4.088211059570312E-02 seconds
Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 1979-2007 Hewlett-Packard Development Company, L.P.