Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP-MPI User's Guide > Chapter 3 Understanding HP-MPI

Scalability

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

Interconnect support of MPI-2 functionality

HP-MPI has been tested on InfiniBand clusters with as many as 2048 ranks using the VAPI protocol. Most HP-MPI features function in a scalable manner. However, a few are still subject to significant resource growth as the job size grows.

Table 3-4 Scalability

FeatureAffected Interconnect/ProtocolScalability Impact
spawnAllForces use of pairwise socket connections between all mpid’s (typically one mpid per machine)
one-sided shared lock/unlockAll except VAPI and IBVOnly VAPI and IBV provide low-level calls to efficiently implement shared lock/unlock. All other interconnects require mpid’s to satisfy this feature.
one-sided exclusive lock/unlockAll except VAPI, IBV, and ElanVAPI, IBV, and Elan provide low-level calls which allow HP-MPI to efficiently implement exclusive lock/unlock. All other interconnects require mpid’s to satisfy this feature.
one-sided otherTCP/IPAll interconnects other than TCP/IP allow HP-MPI to efficiently implement the remainder of the one-sided functionality. Only when using TCP/IP are mpid’s required to satisfy this feature.

 

Resource usage of TCP/IP communication

HP-MPI has also been tested on large Linux TCP/IP clusters with as many as 2048 ranks. Because each HP-MPI rank creates a socket connection to each other remote rank, the number of socket descriptors required increases with the number of ranks. On many Linux systems, this requires increasing the operating system limit on per-process and system-wide file descriptors.

The number of sockets used by HP-MPI can be reduced on some systems at the cost of performance by using daemon communication. In this case, the processes on a host use shared memory to send messages to and receive messages from the daemon. The daemon, in turn, uses a socket connection to communicate with daemons on other hosts. Using this option, the maximum number of sockets opened by any HP-MPI process grows with the number of hosts used by the MPI job rather than the number of total ranks.

Figure 3-2 Daemon communication

Daemon communication

To use daemon communication, specify the -commd option in the mpirun command. Once you have set the -commd option, you can use the MPI_COMMD environment variable to specify the number of shared-memory fragments used for inbound and outbound messages.Refer to mpirun and MPI_COMMD for more information. Daemon communication can result in lower application performance. Therefore, it should only be used to scale an application to a large number of ranks when it is not possible to increase the operating system file descriptor limits to the required values.

Resource usage of RDMA communication modes

When using InfiniBand or GM, a certain amount of memory is pinned, which means it is locked to physical memory and cannot be paged out. The amount of pre-pinned memory HP-MPI uses can be adjusted using several tunables, such as MPI_RDMA_MSGSIZE, MPI_RDMA_NENVELOPE, MPI_RDMA_NSRQRECV, and MPI_RDMA_NFRAGMENT.

By default when the number of ranks is less than or equal to 512, each rank will pre-pin 256k per remote rank; thus making each rank pin up to 128Mb. If the number of ranks is above512 but less than or equal to 1024, then each rank will only pre-pin 96k per remote rank; thus making each rank pin up to 96Mb. If the number of ranks is over 1024, then the 'shared receiving queue' option is used which reduces the amount of pre-pinned memory used for each rank to a fixed 64Mb regardless of how many ranks are used.

HP-MPI also has a safeguard variables MPI_PHYSICAL_MEMORY and MPI_PIN_PERCENTAGE which set an upper bound on the total amount of memory an HP-MPI job will pin. An error will be reported during startup if this total is not large enough to accommodate the pre-pinned memory.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 1979-2007 Hewlett-Packard Development Company, L.P.