Interconnect
support of MPI-2 functionality |
 |
HP-MPI has been tested on InfiniBand clusters with as many
as 2048 ranks using the VAPI protocol. Most HP-MPI features function
in a scalable manner. However, a few are still subject to significant
resource growth as the job size grows.
Table 3-4 Scalability
| Feature | Affected Interconnect/Protocol | Scalability Impact |
|---|
| spawn | All | Forces use of pairwise socket connections between
all mpid’s (typically one mpid per machine) |
| one-sided shared lock/unlock | All except VAPI and IBV | Only VAPI and IBV provide low-level calls to
efficiently implement shared lock/unlock. All other interconnects
require mpid’s to satisfy this feature. |
| one-sided exclusive lock/unlock | All except VAPI, IBV, and Elan | VAPI, IBV, and Elan provide low-level calls
which allow HP-MPI to efficiently implement exclusive lock/unlock.
All other interconnects require mpid’s to satisfy this
feature. |
| one-sided other | TCP/IP | All interconnects other than TCP/IP allow HP-MPI
to efficiently implement the remainder of the one-sided functionality.
Only when using TCP/IP are mpid’s required to satisfy this
feature. |
Resource
usage of TCP/IP communication |
 |
HP-MPI has also been tested on large Linux TCP/IP clusters
with as many as 2048 ranks. Because each HP-MPI rank creates a socket connection
to each other remote rank, the number of socket descriptors required
increases with the number of ranks. On many Linux systems, this
requires increasing the operating system limit on per-process and system-wide
file descriptors.
The number of sockets used by HP-MPI can be reduced on some
systems at the cost of performance by using daemon communication.
In this case, the processes on a host use shared memory to send
messages to and receive messages from the daemon. The daemon, in
turn, uses a socket connection to communicate with daemons on other
hosts. Using this option, the maximum number of sockets opened by
any HP-MPI process grows with the number of hosts used by the MPI
job rather than the number of total ranks.
To use daemon communication, specify the -commd option
in the mpirun command. Once you have set the -commd option,
you can use the MPI_COMMD environment variable to
specify the number of shared-memory fragments used for inbound and
outbound messages.Refer to “mpirun ” and “MPI_COMMD” for
more information. Daemon communication can result in lower application performance.
Therefore, it should only be used to scale an application to a large
number of ranks when it is not possible to increase the operating system
file descriptor limits to the required values.
Resource
usage of RDMA communication modes |
 |
When using InfiniBand or GM, a certain amount of memory is
pinned, which means it is locked to physical memory and cannot be
paged out. The amount of pre-pinned memory HP-MPI uses can be adjusted
using several tunables, such as MPI_RDMA_MSGSIZE, MPI_RDMA_NENVELOPE, MPI_RDMA_NSRQRECV,
and MPI_RDMA_NFRAGMENT.
By default when the number of ranks is less than or equal
to 512, each rank will pre-pin 256k per remote rank; thus making
each rank pin up to 128Mb. If the number of ranks is above512 but
less than or equal to 1024, then each rank will only pre-pin 96k per
remote rank; thus making each rank pin up to 96Mb. If the number of
ranks is over 1024, then the 'shared receiving queue' option is
used which reduces the amount of pre-pinned memory used for each
rank to a fixed 64Mb regardless of how many ranks are used.
HP-MPI also has a safeguard variables MPI_PHYSICAL_MEMORY and MPI_PIN_PERCENTAGE which
set an upper bound on the total amount of memory an HP-MPI job will
pin. An error will be reported during startup if this total is not
large enough to accommodate the pre-pinned memory.