HP-MPI provides multiple rail support on OpenFabric through
the MPI_IB_MULTIRAIL environment variable. This environment
variable is ignored by all other interconnects. In multi-rail mode,
a rank can use up to all cards on its node, but it is limited to the
number of cards on the node to which it is connecting.
For example, if rank A has three cards, rank B has two cards,
and rank C has three cards, then connection A--B uses two cards, connection
B--C uses two cards, and connection A--C uses three cards. Long messages
are striped among all the cards on that connection to improve bandwidth.
By default, multi-card message striping is off. To turn it on,
specify -e MPI_IB_MULTIRAIL=N where N is the
number of cards used by a rank:
If N <= 1, message striping is not used.
If N is greater than the maximum number of cards M
on that node, all M cards are used.
If 1 < N <= M, message striping is used on N
cards or less.
If you specify -e MPI_IB_MULTIRAIL ,
the maximum possible cards are used.
On a host, all the ranks select all the cards in a series. For
example: given 4 cards and 4 ranks per host:
rank 0 will use cards 0, 1, 2, 3
rank 1 will use cards 1, 2, 3, 0
rank 2 will use cards 2, 3, 0, 1
rank 4 will use cards 3, 0, 1, 2
The order is important in SRQ mode because only the first card
is used for short messages. The selection approach allows short RDMA
messages to use all the cards in a balanced way.
For HP-MPI 2.2.5.1 and older, all cards must be on the same
fabric.