The mpirun option -cpu_bind binds a rank to an
ldom to prevent a process from moving to a different ldom after
startup. The binding occurs before the MPI application is executed.
To accomplish this, a shared library is loaded at startup
which does the following for each rank:
Spins for a short time in a tight loop to
let the operating system distribute processes to CPUs evenly. This
duration can be changed by setting the MPI_CPU_SPIN environment
variable which controls the number of spins in the initial loop.
Default is 3 seconds.
Determines the current CPU
and ldom
Checks with other ranks in
the MPI job on the host for oversubscription by using a "shm" segment
created by mpirun and a lock to communicate with other ranks. If no oversubscription
occurs on the current CPU, then lock the process to the ldom of
that CPU. If there is already a rank reserved on the current CPU,
then find a new CPU based on least loaded free CPUs and lock the
process to the ldom of that CPU.
Similar results can be accomplished using "mpsched" but the
procedure outlined above has the advantage of being a more load-based distribution,
and works well in psets and across multiple machines.
HP-MPI supports CPU binding with a variety of binding strategies
(see below). The option -cpu_bind is supported
in appfile, command line, and srun modes.
% mpirun -cpu_bind[_mt]=[v,][option][,v] -np \ 4 a.out
Where _mt implies thread aware CPU binding; v, and ,v request verbose
information on threads binding to CPUs; and [option] is one of:
rank — Schedule ranks on CPUs according
to packed rank id.
map_cpu — Schedule ranks on CPUs
in cyclic distribution through MAP variable.
mask_cpu — Schedule ranks on CPU
masks in cyclic distribution through MAP variable.
ll — least loaded (ll) Bind each
rank to the CPU it is currently running on.
For NUMA-based systems, the following options are also available:
ldom — Schedule ranks on ldoms according
to packed rank id.
cyclic — Cyclic dist on each ldom
according to packed rank id.
block — Block dist on each ldom
according to packed rank id.
rr — round robin (rr) Same as cyclic,
but consider ldom load average.
fill — Same as block, but consider
ldom load average.
packed — Bind all ranks to same
ldom as lowest rank.
slurm — slurm binding.
ll — least loaded (ll) Bind each
rank to ldoms it is currently running on.
map_ldom — Schedule ranks on ldoms
in cyclic distribution through MAP variable.
To generate the current supported options:
% mpirun -cpu_bind=help ./a.out
Environment variables for CPU binding:
MPI_BIND_MAP allows
specification of the integer CPU numbers, ldom numbers, or CPU masks.
These are a list of integers separated by commas (,).
MPI_CPU_AFFINITY is
an alternative method to using -cpu_bind on the
command line for specifying binding strategy. The possible settings
are LL, RANK, MAP_CPU, MASK_CPU, LDOM, CYCLIC, BLOCK, RR, FILL,
PACKED, SLURM, AND MAP_LDOM.
MPI_CPU_SPIN allows
selection of spin value. The default is 2 seconds. This value is
used to let processes busy spin such that the operating system schedules
processes to processors. The the processes bind themselves to the
appropriate processor, or core, or ldom as appropriate.
For example, the following selects a 4 second spin
period to allow 32 MPI ranks (processes) to settle into place and
then bind to the appropriate processor/core/ldom.
% mpirun -e MPI_CPU_SPIN=4 -cpu_bind -np\ 32 ./linpack
MPI_FLUSH_FCACHE Can
be set to a threshold percent of memory (0-100) which, if the file
cache currently in use meets or exceeds, initiates a flush attempt
after binding and essentially before the user’s MPI program
starts. Refer to “MPI_FLUSH_FCACHE” for more information.
MPI_THREAD_AFFINITY controls
thread affinity. Possible values are:
none — Schedule threads to
run on all cores/ldoms. This is the default.
cyclic — Schedule threads on ldoms
in cyclic manner starting after parent.
cyclic_cpu — Schedule threads on
cores in cyclic manner starting after parent.
block — Schedule threads on ldoms
in block manner starting after parent.
packed — Schedule threads on same
ldom as parent.
empty — No changes to thread affinity
are made.
MPI_THREAD_IGNSELF When
set to 'yes', parent is not included in scheduling consideration
of threads across remaining cores/ldoms. This method of thread control
can be used for explicit pthreads or OpenMP threads.
Three -cpu_bind options require the specification
of a map/mask description. This allows for very explicit binding
of ranks to processors. The three options are map_ldom, map_cpu,
and mask_cpu.
Syntax:
-cpu_bind=[map_ldom,map_cpu,mask_cpu] [:<settings>, =<settings>, -e MPI_BIND_MAP=<settings>]
Examples:
-cpu_bind=MAP_LDOM -e MPI_BIND_MAP=0,2,1,3
# map rank 0 to ldom 0, rank 1 to ldom 2,
rank 2 to ldom1 and rank 3 to ldom 3.
-cpu_bind=MAP_LDOM=0,2,3,1
# map rank 0 to ldom 0, rank 1 to ldom 2,
rank 2 to ldom 3 and rank 3 to ldom 1.
-cpu_bind=MAP_CPU:0,6,5
# map rank 0 to cpu 0, rank 1 to cpu 6,
rank 2 to cpu 5.
-cpu_bind=MASK_CPU:1,4,6
# map rank 0 to cpu 0 (0001), rank 1 to
cpu 2 (0100), rank 2 to cpu 1 or 2 (0110).
A rank binding on a clustered system uses the number of ranks
and the number of nodes combined with the rank count to determine
the CPU binding. Cyclic or blocked launch is taken into account.
On a cell-based system with multiple users, the LL strategy
is recommended rather than RANK. LL allows the operating system
to schedule the computational ranks. Then the -cpu_bind capability
locks the ranks to the CPU as selected by the operating system scheduler.