Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP-MPI User's Guide > Chapter 3 Understanding HP-MPI

Running applications on HP-UX and Linux

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

This section introduces the methods to run your HP-MPI application on HP-UX and Linux. Using one of the mpirun methods is required. The examples below demonstrate six basic methods. Refer to mpirun for all the mpirun command line options.

HP-MPI includes -mpi32 and -mpi64 options for the launch utility mpirun on Opteron and Intel®64. These options should be used to indicate the bitness of the application to be invoked so that the availability of interconnect libraries can be properly determined by the HP-MPI utilities mpirun and mpid. The default is -mpi64.

There are six methods you can use to start your application, depending on what kind of system you are using:

  • Use mpirun with the -np # option and the name of your program. For example,

    % $MPI_ROOT/bin/mpirun -np 4 hello_world

    starts an executable file named hello_world with four processes. This is the recommended method to run applications on a single host with a single executable file.

  • Use mpirun with an appfile. For example,

    % $MPI_ROOT/bin/mpirun -f appfile

    where -f appfile specifies a text file (appfile) that is parsed by mpirun and contains process counts and a list of programs.Although you can use an appfile when you run a single executable file on a single host, it is best used when a job is to be run across a cluster of machines which does not have its own dedicated launching method such as srun or prun (which are described below), or when using multiple executables. For details about building your appfile, refer to “Creating an appfile”.

  • Use mpirun with -prun using the Quadrics Elan communication processor on Linux. For example,

    % $MPI_ROOT/bin/mpirun [mpirun options] -prun \
    <prun options> <program> <args>

    This method is only supported when linking with shared libraries.

    Some features like mpirun -stdio processing are unavailable.

    Rank assignments within HP-MPI are determined by the way prun chooses mapping at runtime.

    The -np option is not allowed with -prun. The following mpirun options are allowed with -prun:

    % $MPI_ROOT/bin/mpirun [-help] [-version] [-jv] [-i <spec>] [-universe_size=#] [-sp <paths>] [-T] [-prot] [-spawn] [-1sided] [-tv] [-e var[=val]] -prun <prun options> <program> [<args>]

    For more information on prun usage:

    % man prun

    The following examples assume the system has the Quadrics Elan interconnect and is a collection of 2-CPU nodes.

    % $MPI_ROOT/bin/mpirun -prun -N4 ./a.out

    will run a.out with 4 ranks, one per node, ranks are cyclically allocated.

    n00 rank1
    n01 rank2
    n02 rank3
    n03 rank4

    % $MPI_ROOT/bin/mpirun -prun -n4 ./a.out

    (assuming nodes have 2 processors/cores each) will run a.out with 4 ranks, 2 ranks per node, ranks are block allocated. Two nodes used.

    n00 rank1
    n00 rank2
    n01 rank3
    n01 rank4

    Other forms of usage include allocating the nodes you wish to use, which creates a subshell. Then jobsteps can be launched within that subshell until the subshell is exited.

    % $MPI_ROOT/bin/mpirun -prun -A -N6

    This allocates 6 nodes and creates a subshell.

    % $MPI_ROOT/bin/mpirun -prun -n4 -m block ./a.out

    This uses 4 ranks on 4 nodes from the existing allocation. Note that we asked for block.

    n00 rank1
    n00 rank2
    n02 rank3
    n03 rank4
  • Use mpirun with -srun on HP XC clusters. For example,

    % $MPI_ROOT/bin/mpirun <mpirun options> -srun \
    <srun options> <program> <args>

    Some features like mpirun -stdio processing are unavailable.

    The -np option is not allowed with -srun. The following options are allowed with -srun:

    % $MPI_ROOT/bin/mpirun [-help] [-version] [-jv] [-i <spec>] [-universe_size=#] [-sp <paths>] [-T] [-prot] [-spawn] [-tv] [-1sided] [-e var[=val]] -srun <srun options> <program> [<args>]

    For more information on srun usage:

    % man srun

    The following examples assume the system has the Quadrics Elan interconnect, SLURM is configured to use Elan, and the system is a collection of 2-CPU nodes.

    % $MPI_ROOT/bin/mpirun -srun -N4 ./a.out

    will run a.out with 4 ranks, one per node, ranks are cyclically allocated.

    n00 rank1
    n01 rank2
    n02 rank3
    n03 rank4

    % $MPI_ROOT/bin/mpirun -srun -n4 ./a.out

    will run a.out with 4 ranks, 2 ranks per node, ranks are block allocated. Two nodes used.

    Other forms of usage include allocating the nodes you wish to use, which creates a subshell. Then jobsteps can be launched within that subshell until the subshell is exited.

    % srun -A -n4

    This allocates 2 nodes with 2 ranks each and creates a subshell.

    % $MPI_ROOT/bin/mpirun -srun ./a.out

    This runs on the previously allocated 2 nodes cyclically.

    n00 rank1
    n00 rank2
    n01 rank3
    n01 rank4
  • Use XC LSF and HP-MPI

    HP-MPI jobs can be submitted using LSF. LSF uses the SLURM srun launching mechanism. Because of this, HP-MPI jobs need to specify the -srun option whether LSF is used or srun is used.

    % bsub -I -n2 $MPI_ROOT/bin/mpirun -srun ./a.out

    LSF creates an allocation of 2 processors and srun attaches to it.

    % bsub -I -n12 $MPI_ROOT/bin/mpirun -srun -n6 \
    -N6 ./a.out

    LSF creates an allocation of 12 processors and srun uses 1 CPU per node (6 nodes). Here, we assume 2 CPUs per node.

    LSF jobs can be submitted without the -I (interactive) option.

    An alternative mechanism for achieving the one rank per node which uses the -ext option to LSF:

    % bsub -I -n3 -ext "SLURM[nodes=3]" \ $MPI_ROOT/bin/mpirun -srun ./a.out

    The -ext option can also be used to specifically request a node. The command line would look something like the following:

    % bsub -I -n2 -ext "SLURM[nodelist=n10]" mpirun -srun \ ./hello_world

    Job <1883> is submitted to default queue <interactive>.
    <<Waiting for dispatch ...>>
    <<Starting on lsfhost.localdomain>>
    Hello world! I'm 0 of 2 on n10
    Hello world! I'm 1 of 2 on n10

    Including and excluding specific nodes can be accomplished by passing arguments to SLURM as well. For example, to make sure a job includes a specific node and excludes others, use something like the following. In this case, n9 is a required node and n10 is specifically excluded:

    % bsub -I -n8 -ext "SLURM[nodelist=n9;exclude=n10]" \ mpirun -srun ./hello_world

    Job <1892> is submitted to default queue <interactive>.
    <<Waiting for dispatch ...>>
    <<Starting on lsfhost.localdomain>>
    Hello world! I'm 0 of 8 on n8
    Hello world! I'm 1 of 8 on n8
    Hello world! I’m 6 of 8 on n12
    Hello world! I’m 2 of 8 on n9
    Hello world! I’m 4 of 8 on n11
    Hello world! I’m 7 of 8 on n12
    Hello world! I’m 3 of 8 on n9
    Hello world! I’m 5 of 8 on n11

    In addition to displaying interconnect selection information, the mpirun -prot option can be used to verify that application ranks have been allocated in the desired manner:

    % bsub -I -n12 $MPI_ROOT/bin/mpirun -prot -srun \
    -n6 -N6 ./a.out

    Job <1472> is submitted to default queue <interactive>.
    <<Waiting for dispatch ...>>
    <<Starting on lsfhost.localdomain>>
    Host 0 -- ip 172.20.0.8 -- ranks 0
    Host 1 -- ip 172.20.0.9 -- ranks 1
    Host 2 -- ip 172.20.0.10 -- ranks 2
    Host 3 -- ip 172.20.0.11 -- ranks 3
    Host 4 -- ip 172.20.0.12 -- ranks 4
    Host 5 -- ip 172.20.0.13 -- ranks 5host | 0 1 2 3 4 5
    ======|===============================
    0 : SHM VAPI VAPI VAPI VAPI VAPI
    1 : VAPI SHM VAPI VAPI VAPI VAPI
    2 : VAPI VAPI SHM VAPI VAPI VAPI
    3 : VAPI VAPI VAPI SHM VAPI VAPI
    4 : VAPI VAPI VAPI VAPI SHM VAPI
    5 : VAPI VAPI VAPI VAPI VAPI SHMHello world! I'm 0 of 6 on n8
    Hello world! I'm 3of 6 on n11
    Hello world! I’m 5 of 6 on n13
    Hello world! I’m 4 of 6 on n12
    Hello world! I’m 2 of 6 on n10
    Hello world! I’m 1 of 6 on n9
  • Use LSF on non-XC systems

    On non-XC systems, to invoke the Parallel Application Manager (PAM) feature of LSF for applications where all processes execute the same program on the same host:

    % bsub <lsf_options> pam -mpi mpirun \
    <mpirun_options> program <args>

    In this case, LSF assigns a host to the MPI job.

    For example:

    % bsub pam -mpi $MPI_ROOT/bin/mpirun -np 4 compute_pi

    requests a host assignment from LSF and runs the compute_pi application with four processes.

    The load-sharing facility (LSF) allocates one or more hosts to run an MPI job. In general, LSF improves resource utilization for MPI jobs that run in multihost environments. LSF handles the job scheduling and the allocation of the necessary hosts and HP-MPI handles the task of starting up the application's processes on the hosts selected by LSF.

    By default mpirun starts the MPI processes on the hosts specified by the user, in effect handling the direct mapping of host names to IP addresses. When you use LSF to start MPI applications, the host names, specified to mpirun or implicit when the -h option is not used, are treated as symbolic variables that refer to the IP addresses that LSF assigns. Use LSF to do this mapping by specifying a variant of mpirun to execute your job.

    To invoke LSF for applications that run on multiple hosts:

    % bsub [lsf_options] pam -mpi mpirun [mpirun_options] -f appfile [-- extra_args_for_appfile]

    In this case, each host specified in the appfile is treated as a symbolic name, referring to the host that LSF assigns to the MPI job.

    For example:

    % bsub pam -mpi $MPI_ROOT/bin/mpirun -f my_appfile

    runs an appfile named my_appfile and requests host assignments for all remote and local hosts specified in my_appfile. If my_appfile contains the following items:

       -h voyager -np 10 send_receive
    -h enterprise -np 8 compute_pi

    Host assignments are returned for the two symbolic links voyager and enterprise.

    When requesting a host from LSF, you must ensure that the path to your executable file is accessible by all machines in the resource pool.

More information about appfile runs

This example teaches you to run the hello_world.c application that you built in “Examples of building on HP-UX and Linux” (above) using two hosts to achieve four-way parallelism. For this example, the local host is named jawbone and a remote host is named wizard. To run hello_world.c on two hosts, use the following procedure, replacing jawbone and wizard with the names of your machines:

  1. Edit the .rhosts file on jawbone and wizard.

    Add an entry for wizard in the .rhosts file on jawbone and an entry for jawbone in the .rhosts file on wizard. In addition to the entries in the .rhosts file, ensure that the correct commands and permissions are set up on all hosts so that you can start your remote processes. Refer to “Setting shell” for details.

  2. Ensure that the executable is accessible from each host either by placing it in a shared directory or by copying it to a local directory on each host.

  3. Create an appfile.

    An appfile is a text file that contains process counts and a list of programs. In this example, create an appfile named my_appfile containing the following two lines:

    -h jawbone -np 2 /path/to/hello_world
    -h wizard -np 2 /path/to/hello_world

    The appfile file should contain a separate line for each host. Each line specifies the name of the executable file and the number of processes to run on the host. The -h option is followed by the name of the host where the specified processes must be run. Instead of using the host name, you may use its IP address.

  4. Run the hello_world executable file:

    % $MPI_ROOT/bin/mpirun -f my_appfile

    The -f option specifies the filename that follows it is an appfile. mpirun parses the appfile, line by line, for the information to run the program. In this example, mpirun runs the hello_world program with two processes on the local machine, jawbone, and two processes on the remote machine, wizard, as dictated by the -np 2 option on each line of the appfile.

  5. Analyze hello_world output.

    HP-MPI prints the output from running the hello_world executable in non-deterministic order. The following is an example of the output:

    Hello world! I'm 2 of 4 on wizard
    Hello world! I'm 0 of 4 on jawbone
    Hello world! I'm 3 of 4 on wizard
    Hello world! I'm 1 of 4 on jawbone

    Notice that processes 0 and 1 run on jawbone, the local host, while processes 2 and 3 run on wizard. HP-MPI guarantees that the ranks of the processes in MPI_COMM_WORLD are assigned and sequentially ordered according to the order the programs appear in the appfile. The appfile in this example, my_appfile, describes the local host on the first line and the remote host on the second line.

Running MPMD applications

A multiple program multiple data (MPMD) application uses two or more separate programs to functionally decompose a problem. This style can be used to simplify the application source and reduce the size of spawned processes. Each process can execute a different program.

MPMD with appfiles

To run an MPMD application, the mpirun command must reference an appfile that contains the list of programs to be run and the number of processes to be created for each program.

A simple invocation of an MPMD application looks like this:

% $MPI_ROOT/bin/mpirun -f appfile

where appfile is the text file parsed by mpirun and contains a list of programs and process counts.

Suppose you decompose the poisson application into two source files: poisson_master (uses a single master process) and poisson_child (uses four child processes).

The appfile for the example application contains the two lines shown below (refer to “Creating an appfile” for details).

-np 1 poisson_master

-np 4 poisson_child

To build and run the example application, use the following command sequence:

% $MPI_ROOT/bin/mpicc -o poisson_master poisson_master.c

% $MPI_ROOT/bin/mpicc -o poisson_child poisson_child.c

% $MPI_ROOT/bin/mpirun -f appfile

See “Creating an appfile” for more information about using appfiles.

MPMD with prun

prun also supports running applications with MPMD using procfiles. Please refer to the prun documentation at http://www.quadrics.com.

MPMD with srun

MPMD is not directly supported with srun. However, users can write custom wrapper scripts to their application to emulate this functionality. This can be accomplished by using the environment variables SLURM_PROCID and SLURM_NPROCS as keys to selecting the appropriate executable.

Modules on Linux

Modules are a convenient tool for managing environment settings for various packages. HP-MPI for Linux provides an hp-mpi module at /opt/hpmpi/modulefiles/hp-mpi which sets MPI_ROOT and adds to PATH and MANPATH. To use it, either copy this file to a system-wide module directory, or append /opt/hpmpi/modulefiles to the MODULEPATH environment variable.

Some useful module-related commands are:

% module avail

what modules can be loaded

% module load hp-mpi

load the hp-mpi module

% module list

list currently loaded modules

% module unload hp-mpi

unload the hp-mpi module

Modules are only supported on Linux.

NOTE: On XC Linux, the HP-MPI module is named mpi/hp/default and can be abbreviated 'mpi'.

Runtime utility commands

HP-MPI provides a set of utility commands to supplement the MPI library routines. These commands are listed below and described in the following sections:

  • mpirun

  • mpirun.all (see restrictions under mpirun.all)

  • mpiexec

  • mpijob

  • mpiclean

mpirun

This section includes a discussion of mpirun syntax formats, mpirun options, appfiles, the multipurpose daemon process, and generating multihost instrumentation profiles.

The HP-MPI start-up mpirun requires that MPI be installed in the same directory on every execution host. The default is the location from which mpirun is executed. This can be overridden with the MPI_ROOT environment variable. Set the MPI_ROOT environment variable prior to starting mpirun. See “Configuring your environment”.

mpirun syntax has six formats:

  • Single host execution

  • Appfile execution

  • prun execution

  • srun execution

  • LSF on XC systems

  • LSF on non-XC systems

Single host execution

  • To run on a single host, the -np option to mpirun can be used.

    For example:

    % $MPI_ROOT/bin/mpirun -np 4 ./a.out

    will run 4 ranks on the local host.

Appfile execution

  • For applications that consist of multiple programs or that run on multiple hosts, here is a list of the most common options. For a complete list, see the mpirun man page:

    mpirun [-help] [-version] [-djpv] [-ck] [-t spec] [-i spec] [-commd] [-tv] -f appfile [-- extra_args_for_
    appfile
    ]

    Where -- extra_args_for_appfile specifies extra arguments to be applied to the programs listed in the appfile—A space separated list of arguments. Use this option at the end of your command line to append extra arguments to each line of your appfile. Refer to the example in “Adding program arguments to your appfile” for details. These extra args also apply to spawned ne applications if specified on the mpirun command line.

    In this case, each program in the application is listed in a file called an appfile. Refer to “Appfiles” for more information.

    For example:

    % $MPI_ROOT/bin/mpirun -f my_appfile

    runs using an appfile named my_appfile, which might have contents such as:

    -h hostA -np 2 /path/to/a.out

    -h hostB -np 2 /path/to/a.out

    which specify that two ranks are to run on hostA and two on hostB.

prun execution

  • Use the -prun option for applications that run on the Quadrics Elan interconnect. When using the -prun option, mpirun sets environment variables and invokes prun utilities. Refer to “Runtime environment variables” for more information about prun environment variables.

    The -prun argument to mpirun specifies that the prun command is to be used for launching. All arguments following -prun are passed unmodified to the prun command.

    % $MPI_ROOT/bin/mpirun <mpirun options> -prun \
    <prun options>

    The -np option is not allowed with prun. Some features like mpirun -stdio processing are unavailable.

    % $MPI_ROOT/bin/mpirun -prun -n 2 ./a.out

    launches a.out on two processors.

    % $MPI_ROOT/bin/mpirun -prot -prun -n 6 -N 6 ./a.out

    turns on the print protocol option (-prot is an mpirun option, and therefore is listed before -prun) and runs on 6 machines, one CPU per node.

    For more details about using prun, refer to “Running applications on HP-UX and Linux”.

    HP-MPI also provides implied prun mode. The implied prun mode allows the user to omit the -prun argument from the mpirun command line with the use of the environment variable MPI_USEPRUN. For more information about the implied prun mode see Appendix C “mpirun using implied prun or srun”.

srun execution

  • Applications that run on XC clusters require the -srun option. Startup directly from srun is not supported. When using this option, mpirun sets environment variables and invokes srun utilities. Refer to “Runtime environment variables” for more information about srun environment variables.

    The -srun argument to mpirun specifies that the srun command is to be used for launching. All arguments following -srun are passed unmodified to the srun command.

    % $MPI_ROOT/bin/mpirun <mpirun options> -srun \
    <srun options>

    The -np option is not allowed with srun. Some features like mpirun -stdio processing are unavailable.

    % $MPI_ROOT/bin/mpirun -srun -n 2 ./a.out

    launches a.out on two processors.

    % $MPI_ROOT/bin/mpirun -prot -srun -n 6 -N 6 ./a.out

    turns on the print protocol option (-prot is an mpirun option, and therefore is listed before -srun) and runs on 6 machines, one CPU per node.

    For more details about using srun, refer to “Running applications on HP-UX and Linux”.

    HP-MPI also provides implied srun mode. The implied srun mode allows the user to omit the -srun argument from the mpirun command line with the use of the environment variable MPI_USESRUN. For more information about the implied srun mode see Appendix C “mpirun using implied prun or srun”.

LSF on XC systems

HP-MPI jobs can be submitted using LSF. LSF uses the SLURM srun launching mechanism. Because of this, HP-MPI jobs need to specify the -srun option whether LSF is used or srun is used.

% bsub -I -n2 $MPI_ROOT/bin/mpirun -srun ./a.out

For more details on using LSF on XC systems, refer to “Running applications on HP-UX and Linux”.

LSF on non-XC systems

On non-XC systems, to invoke the Parallel Application Manager (PAM) feature of LSF for applications where all processes execute the same program on the same host:

% bsub <lsf_options> pam -mpi mpirun \
<mpirun_options> program <args>

For more details on using LSF on non-XC systems, refer to “Running applications on HP-UX and Linux”.

Appfiles

An appfile is a text file that contains process counts and a list of programs. When you invoke mpirun with the name of the appfile, mpirun parses the appfile to get information for the run.

Creating an appfile

The format of entries in an appfile is line oriented. Lines that end with the backslash (\) character are continued on the next line, forming a single logical line. A logical line starting with the pound (#) character is treated as a comment. Each program, along with its arguments, is listed on a separate logical line.

The general form of an appfile entry is:

[-h remote_host] [-e var[=val] [...]] [-l user] [-sp paths]
[-np
#] program [args]

where

-h remote_host

Specifies the remote host where a remote executable file is stored. The default is to search the local host. remote_host is either a host name or an IP address.

-e var=val

Sets the environment variable var for the program and gives it the value val. The default is not to set environment variables. When you use -e with the -h option, the environment variable is set to val on the remote host.

-l user

Specifies the user name on the target host. The default is the current user name.

-sp paths

Sets the target shell PATH environment variable to paths. Search paths are separated by a colon. Both -sp path and -e PATH=path do the same thing. If both are specified, the -e PATH=path setting is used.

-np #

Specifies the number of processes to run. The default value for # is 1.

program

Specifies the name of the executable to run. mpirun searches for the executable in the paths defined in the PATH environment variable.

args

Specifies command line arguments to the program. Options following a program name in your appfile are treated as program arguments and are not processed by mpirun.

Adding program arguments to your appfile

When you invoke mpirun using an appfile, arguments for your program are supplied on each line of your appfile—Refer to “Creating an appfile”. HP-MPI also provides an option on your mpirun command line to provide additional program arguments to those in your appfile. This is useful if you wish to specify extra arguments for each program listed in your appfile, but do not wish to edit your appfile.

To use an appfile when you invoke mpirun, use one of the following as described in mpirun :

  • mpirun [mpirun_options] -f appfile \
    [-- extra_args_for_appfile]

  • bsub [lsf_options] pam -mpi mpirun [mpirun_options] -f appfile \ [-- extra_args_for_appfile]

The -- extra_args_for_appfile option is placed at the end of your command line, after appfile, to add options to each line of your appfile.

CAUTION: Arguments placed after - - are treated as program arguments, and are not processed by mpirun. Use this option when you want to specify program arguments for each line of the appfile, but want to avoid editing the appfile.

For example, suppose your appfile contains

-h voyager -np 10 send_receive arg1 arg2
-h enterprise -np 8 compute_pi

If you invoke mpirun using the following command line:

mpirun -f appfile -- arg3 -arg4 arg5

  • The send_receive command line for machine voyager becomes:

    send_receive arg1 arg2 arg3 -arg4 arg5

  • The compute_pi command line for machine enterprise becomes:

    compute_pi arg3 -arg4 arg5

When you use the -- extra_args_for_appfile option, it must be specified at the end of the mpirun command line.

Setting remote environment variables

To set environment variables on remote hosts use the -e option in the appfile. For example, to set the variable MPI_FLAGS:

-h remote_host -e MPI_FLAGS=val [-np #] program [args]

For instructions on how to set environment variables on HP-UX and Linux, refer to “Setting environment variables on the command line for HP-UX and Linux”.

For instructions on how to set environment variables on Windows CCP, refer to “Runtime environment variables for Windows CCP”.

Assigning ranks and improving communication

The ranks of the processes in MPI_COMM_WORLD are assigned and sequentially ordered according to the order the programs appear in the appfile.

For example, if your appfile contains

-h voyager -np 10 send_receive
-h enterprise -np 8 compute_pi

HP-MPI assigns ranks 0 through 9 to the 10 processes running send_receive and ranks 10 through 17 to the 8 processes running compute_pi.

You can use this sequential ordering of process ranks to your advantage when you optimize for performance on multihost systems. You can split process groups according to communication patterns to reduce or remove interhost communication hot spots.

For example, if you have the following:

  • A multi-host run of four processes

  • Two processes per host on two hosts

  • There is higher communication traffic between ranks 0—2 and 1—3.

You could use an appfile that contains the following:

-h hosta -np 2 program1
-h hostb -np 2 program2

However, this places processes 0 and 1 on hosta and processes 2 and 3 on hostb, resulting in interhost communication between the ranks identified as having slow communication:

A more optimal appfile for this example would be

-h hosta -np 1 program1
-h hostb -np 1 program2
-h hosta -np 1 program1
-h hostb -np 1 program2

This places ranks 0 and 2 on hosta and ranks 1 and 3 on hostb. This placement allows intrahost communication between ranks that are identified as communication hot spots. Intrahost communication yields better performance than interhost communication.

Multipurpose daemon process

HP-MPI incorporates a multipurpose daemon process that provides start-up, communication, and termination services. The daemon operation is transparent. HP-MPI sets up one daemon per host (or appfile entry) for communication.

NOTE: Because HP-MPI sets up one daemon per host (or appfile entry) for communication, when you invoke your application with -np x, HP-MPI generates x+1 processes.
Generating multihost instrumentation profiles

When you enable instrumentation for multihost runs, and invoke mpirun either on a host where at least one MPI process is running, or on a host remote from all your MPI processes, HP-MPI writes the instrumentation output file (prefix.instr) to the working directory on the host that is running rank 0.

mpirun.all

We recommend using the mpirun launch utility. However, for HP-UX and PA-RISC systems, HP-MPI provides a self-contained launch utility, mpirun.all that allows HP-MPI to be used without installing it on all hosts.

The restrictions for mpirun.all include

  • Applications must be linked statically

  • Start-up may be slower

  • TotalView® is unavailable to executables launched with mpirun.all

  • Files will be copied to a temporary directory on target hosts

  • The remote shell must accept stdin

mpirun.all is not available on HP-MPI for Linux or Windows.

mpiexec

The MPI-2 standard defines mpiexec as a simple method to start MPI applications. It supports fewer features than mpirun, but it is portable. mpiexec syntax has three formats:

  • mpiexec offers arguments similar to a MPI_Comm_spawn call, with arguments as shown in the following form:

    mpiexec [-n maxprocs][-soft ranges][-host host][-arch arch][-wdir dir][-path dirs][-file file]command-args

    For example:

    % $MPI_ROOT/bin/mpiexec -n 8 ./myprog.x 1 2 3

    creates an 8 rank MPI job on the local host consisting of 8 copies of the program myprog.x, each with the command line arguments 1, 2, and 3.

  • It also allows arguments like a MPI_Comm_spawn_multiple call, with a colon separated list of arguments, where each component is like the form above.

    For example:

    % $MPI_ROOT/bin/mpiexec -n 4 ./myprog.x : -host host2 -n \ 4 /path/to/myprog.x

    creates a MPI job with 4 ranks on the local host and 4 on host2.

  • Finally, the third form allows the user to specify a file containing lines of data like the arguments in the first form.

    mpiexec [-configfile file]

    For example:

    % $MPI_ROOT/bin/mpiexec -configfile cfile

    gives the same results as in the second example, but using the -configfile option (assuming the file cfile contains -n 4 ./myprog.x -host host2 -n 4 -wdir /some/path ./myprog.x)

where mpiexec options are:

-n maxprocs

Create maxprocs MPI ranks on the specified host.

-soft range-list

Ignored in HP-MPI.

-host host

Specifies the host on which to start the ranks.

-arch arch

Ignored in HP-MPI.

-wdir dir

Working directory for the created ranks.

-path dirs

PATH environment variable for the created ranks.

-file file

Ignored in HP-MPI.

This last option is used separately from the options above.

-configfile file

Specify a file of lines containing the above options.

mpiexec does not support prun or srun startup.

mpiexec is not available on HP-MPI V1.0 for Windows.

mpijob

mpijob lists the HP-MPI jobs running on the system. mpijob can only be used for jobs started in appfile mode. Invoke mpijob on the same host as you initiated mpirun. mpijob syntax is shown below:

mpijob [-help] [-a] [-u] [-j id] [id id ...]]

where

-help

Prints usage information for the utility.

-a

Lists jobs for all users.

-u

Sorts jobs by user name.

-j id

Provides process status for job id. You can list a number of job IDs in a space-separated list.

When you invoke mpijob, it reports the following information for each job:

JOB

HP-MPI job identifier.

USER

User name of the owner.

NPROCS

Number of processes.

PROGNAME

Program names used in the HP-MPI application.

By default, your jobs are listed by job ID in increasing order. However, you can specify the -a and -u options to change the default behavior.

An mpijob output using the -a and -u options is shown below listing jobs for all users and sorting them by user name.

JOB      USER      NPROCS   PROGNAME
22623 charlie 12 /home/watts
22573 keith 14 /home/richards
22617 mick 100 /home/jagger
22677     ron 4 /home/wood

When you specify the -j option, mpijob reports the following for each job:

RANK

Rank for each process in the job.

HOST

Host where the job is running.

PID

Process identifier for each process in the job.

LIVE

Indicates whether the process is running (an x is used) or has been terminated.

PROGNAME

Program names used in the HP-MPI application.

mpijob does not support prun or srun startup.

mpijob is not available on HP-MPI V1.0 for Windows.

mpiclean

mpiclean kills processes in HP-MPI applications started in appfile mode. Invoke mpiclean on the host on which you initiated mpirun.

The MPI library checks for abnormal termination of processes while your application is running. In some cases, application bugs can cause processes to deadlock and linger in the system. When this occurs, you can use mpijob to identify hung jobs and mpiclean to kill all processes in the hung application.

mpiclean syntax has two forms:

  1. mpiclean [-help] [-v] -j id [id id ....]

  2. mpiclean [-help] [-v] -m

where

-help

Prints usage information for the utility.

-v

Turns on verbose mode.

-m

Cleans up your shared-memory segments.

-j id

Kills the processes of job number id. You can specify multiple job IDs in a space-separated list. Obtain the job ID using the -j option when you invoke mpirun.

You can only kill jobs that are your own.

The second syntax is used when an application aborts during MPI_Init, and the termination of processes does not destroy the allocated shared-memory segments.

mpiclean does not support prun or srun startup.

mpiclean is not available on HP-MPI V1.0 for Windows.

Interconnect support

HP-MPI supports a variety of high-speed interconnects. On HP-UX and Linux, HP-MPI will attempt to identify and use the fastest available high-speed interconnect by default. On Windows, the selection must be made explicitly by the user.

On HP-UX and Linux, the search order for the interconnect is determined by the environment variable MPI_IC_ORDER (which is a colon separated list of interconnect names), and by command line options which take higher precedence.

Table 3-3 Interconnect command line options

command line optionprotocol specifiedapplies to OS

-ibv / -IBV

IBV— OpenFabrics InfiniBand

Linux

-vapi / -VAPI

VAPI— Mellanox Verbs API

Linux

-udapl / -UDAPL

uDAPL—InfiniBand and some others

Linux

-psm / -PSM

PSM—QLogic InfiniBand

Linux

-mx / -MX

MX—Myrinet

Linux

-gm / -GM

GM—Myrinet

Linux

-elan / -ELAN

Quadrics Elan3 or Elan4

Linux

-itapi / -ITAPI

ITAPI—InfiniBand

HP-UX

-ibal / -IBAL

IBAL—Windows IB Access Layer

Windows

-TCP

TCP/IP

All

 

The interconnect names used in MPI_IC_ORDER are like the command line options above, but without the dash. On Linux, the default value of MPI_IC_ORDER is

ibv:vapi:udapl:psm:mx:gm:elan:tcp

If command line options from the above table are used, the effect is that the specified setting is implicitly prepended to the MPI_IC_ORDER list, thus taking higher precedence in the search.

Interconnects specified in the command line or in the MPI_IC_ORDER variable can be lower or upper case. Lower case means the interconnect will be used if available. Upper case instructs HP-MPI to abort if the specified interconnect is unavailable.

The availability of an interconnect is determined based on whether the relevant libraries can be dlopened / shl_loaded, and on whether a recognized module is loaded in Linux. If either condition is not met, the interconnect is determined to be unavailable.

On Linux, the names and locations of the libraries to be opened, and the names of the recognized interconnect module names are specified by a collection of environment variables which are contained in $MPI_ROOT/etc/hpmpi.conf.

The hpmpi.conf file can be used for any environment variables, but arguably its most important use is to consolidate the environment variables related to interconnect selection.

The default value of MPI_IC_ORDER is specified there, along with a collection of variables of the form

MPI_ICLIB_XXX__YYY
MPI_ICMOD_XXX__YYY

where XXX is one of the interconnects (IBV, VAPI, etc.) and YYY is an arbitrary suffix. The MPI_ICLIB_* variables specify names of libraries to be dlopened. The MPI_ICMOD_* variables specify regular expressions for names of modules to search for.

An example is the following two pairs of variables for PSM:

MPI_ICLIB_PSM__PSM_MAIN = libpsm_infinipath.so.1
MPI_ICMOD_PSM__PSM_MAIN = "^ib_ipath "

and

MPI_ICLIB_PSM__PSM_PATH = /usr/lib64/libpsm_infinipath.so.1
MPI_ICMOD_PSM__PSM_PATH = "^ib_ipath "

The suffixes PSM_MAIN and PSM_PATH are arbitrary, and represent two different attempts that will be made when determining if the PSM interconnect is available.

The list of suffixes is contained in the variable MPI_IC_SUFFIXES which is also set in the hpmpi.conf file.

So, when HP-MPI is determining the availability of the PSM interconnect, it will first look at

MPI_ICLIB_PSM__PSM_MAIN
MPI_ICMOD_PSM__PSM_MAIN

for the library to dlopen and module name to look for. Then, if that fails, it will continue on to the next pair

MPI_ICLIB_PSM__PSM_PATH
MPI_ICMOD_PSM__PSM_PATH

which in this case specifies a full path to the PSM library.

The MPI_ICMOD_* variables allow relatively complex values to specify what module names will be considered as evidence that the specified interconnect is available. Consider the example

MPI_ICMOD_VAPI__VAPI_MAIN = \
    "^mod_vapi " || "^mod_vip " || "^ib_core "

This means any of those three names will be accepted as evidence that VAPI is available. Each of the three strings individually is a regular expression that will be grepped for in the output from /sbin/lsmod.

In many cases, if a system has a high-speed interconnect that is not found by HP-MPI due to changes in library names and locations or module names, the problem can be fixed by simple edits to the hpmpi.conf file. Contacting HP-MPI support for assistance is encouraged.

Protocol-specific options and information

This section briefly describes the available interconnects and illustrates some of the more frequently used interconnects options.

The environment variables and command line options mentioned below are described in more detail in “mpirun options”, and “List of runtime environment variables”.

TCP/IP

TCP/IP is supported on many types of cards.Machines often have more than one IP address, and a user may wish to specify which interface is to be used to get the best performance.

HP-MPI does not inherently know which IP address corresponds to the fastest available interconnect card.By default IP addresses are selected based on the list returned by gethostbyname(). The mpirun option -netaddr can be used to gain more explicit control over which interface is used.

IBAL

IBAL is only supported on Windows. Lazy deregistration is not supported with IBAL.

IBV

HP-MPI claims support for OpenFabrics V1.0 and V1.1. OpenFabrics is not supported on Itanium2 platforms.

In order to use OpenFabrics on Linux, the memory size for locking must be specified. It is controlled by the /etc/security/limits.conf file for Red Hat and the /etc/syscntl.conf file for SuSE.

* soft memlock 524288
* hard memlock 524288

The example above uses the max locked-in-memory address space in KB units. The recommendation is to set the value to half of the physical memory.

Machines can have multiple InfiniBand cards. By default each HP-MPI rank selects one card for its communication, and the ranks cycle through the available cards on the system, so the first rank uses the first card, the second rank uses the second card, etc.

The environment variable MPI_IB_CARD_ORDER can be used to control which card the different ranks select. Or, for increased potential bandwidth and greater traffic balance between cards, each rank can be instructed to use multiple cards by using the variable MPI_IB_MULTIRAIL.

Lazy deregistration is a performance enhancement used by HP-MPI on several of the high speed interconnects on Linux. This option is turned on by default, and requires the application to be linked in such a way that HP-MPI is able to intercept calls to malloc, munmap, etc. Most applications are linked that way, but if one is not then HP-MPI's lazy deregistration can be turned off with the command line -ndd.

Some applications decline to directly link against libmpi and instead link against a wrapper library which is in turn linked against libmpi. In this case it is still possible for HP-MPI's malloc etc. interception to be used by supplying the --auxiliary option to the linker when creating the wrapper library, by using a compiler flag such as -Wl, --auxiliary, libmpi.so.

Note that dynamic linking is required with all InfiniBand use on Linux.

HP-MPI does not use the Connection Manager (CM) library with OpenFabrics.

VAPI

The MPI_IB_CARD_ORDER card selection option and the -ndd option described above for IBV applies to VAPI.

uDAPL

The -ndd option described above for IBV applies to uDAPL.

GM

The -ndd option described above for IBV applies to GM.

Elan

HP-MPI supports the Elan3 and Elan4 protocols for Quadrics.

By default HP-MPI uses Elan collectives for broadcast and barrier.If messages are outstanding at the time the Elan collective is entered and the other side of the message enters a completion routine on the outstanding message before entering the collective call, it is possible for the application to hang due to lack of message progression while inside the Elan collective. This is actually a rather uncommon situation in real applications. But if such hangs are observed, then the use of Elan collectives can be disabled by using the environment variable MPI_USE_LIBELAN=0.

ITAPI

On HP-UX InfiniBand is available by using the ITAPI protocol, which requires MLOCK privileges. When setting up InfiniBand on an HP-UX system, all users (other than root) who wish to use InfiniBand need to have their group id in the /etc/privgroup file and the permissions for access must be enabled via:

% setprivgrp -f /etc/privgroup

The above may be done automatically at boot time, but should also be performed once manually after setup of the InfiniBand drivers to ensure access. For example:

% grep user /etc/passwd

user:UJqaKNCCsESLo,O.fQ:836:1007:User Name:/home/user:/bin/tcsh

% grep 1007 /etc/group

ibusers::1007:

% cat /etc/privgroup

ibusers MLOCK

#add entries to /etc/privgroup

% setprivgrp -f /etc/privgroup

A one-time setting can also be done using:

/usr/sbin/setprivgrp <group> MLOCK

The above setting will not survive a reboot.

Interconnect selection examples

The default MPI_IC_ORDER will generally result in the fastest available protocol being used. The following example uses the default ordering and also supplies a -netaddr setting, in case TCP/IP is the only interconnect available.

% echo MPI_IC_ORDER
ibv:vapi:udapl:psm:mx:gm:elan:tcp

% export MPIRUN_SYSTEM_OPTIONS="-netaddr 192.168.1.0/24"

% export MPIRUN_OPTIONS="-prot"

% $MPI_ROOT/bin/mpirun -srun -n4 ./a.out

The command line for the above will appear to mpirun as $MPI_ROOT/bin/mpirun -netaddr 192.168.1.0/24 -prot -srun -n4 ./a.out and the interconnect decision will look for IBV, then VAPI, etc. down to TCP/IP. If TCP/IP is chosen, it will use the 192.168.1.* subnet.

If TCP/IP is desired on a machine where other protocols are available, the -TCP option can be used.

This example is like the previous, except TCP is searched for and found first. (TCP should always be available.) So TCP/IP would be used instead of IBV or Elan, etc.

% $MPI_ROOT/bin/mpirun -TCP -srun -n4 ./a.out

The following example output shows three runs on an Elan system; first using Elan as the protocol, then using TCP/IP over GigE, then using TCP/IP over the Quadrics card.

  • This runs on Elan

    [user@opte10 user]$ bsub -I -n3 -ext "SLURM[nodes=3]" $MPI_ROOT/bin/mpirun -prot -srun ./a.out 
    Job <59304> is submitted to default queue <normal>.<<Waiting for dispatch ...>>
    <<Starting on lsfhost.localdomain>>
    Host 0 -- ELAN node 0 -- ranks 0
    Host 1 -- ELAN node 1 -- ranks 1
    Host 2 -- ELAN node 2 -- ranks 2 host | 0 1 2
    ======|================
         0 : SHM ELAN ELAN
         1 : ELAN SHM ELAN
         2 : ELAN ELAN SHM

    Hello world! I'm 0 of 3 on opte6
    Hello world! I'm 1 of 3 on opte7
    Hello world! I'm 2 of 3 on opte8
  • This runs on TCP/IP over the GigE network configured as 172.20.x.x on eth0

    [user@opte10 user]$ bsub -I -n3 -ext "SLURM[nodes=3]" $MPI_ROOT/bin/mpirun -prot -TCP -srun ./a.out 
    Job <59305> is submitted to default queue <normal>.
    <<Waiting for dispatch ...>>
    <<Starting on lsfhost.localdomain>>
    Host 0 -- ip 172.20.0.6 -- ranks 0
    Host 1 -- ip 172.20.0.7 -- ranks 1
    Host 2 -- ip 172.20.0.8 -- ranks 2 host | 0 1 2
    ======|================
        0 : SHM TCP TCP
        1 : TCP SHM TCP
        2 : TCP TCP SHMHello world! I'm 0 of 3 on opte6
    Hello world! I'm 1 of 3 on opte7
    Hello world! I'm 2 of 3 on opte8
  • This uses TCP/IP over the Elan subnet using the -TCP option in combination with the -netaddr option for the Elan interface 172.22.x.x

    [user@opte10 user]$ bsub -I -n3 -ext "SLURM[nodes=3]" $MPI_ROOT/bin/mpirun -prot -TCP -netaddr 172.22.0.10 -srun ./a.out 
    Job <59307> is submitted to default queue <normal>.
    <<Waiting for dispatch ...>>
    <<Starting on lsfhost.localdomain>>
    Host 0 -- ip 172.22.0.2 -- ranks 0
    Host 1 -- ip 172.22.0.3 -- ranks 1
    Host 2 -- ip 172.22.0.4 -- ranks 2

     host | 0 1 2
    ======|================
        0 : SHM TCP TCP
        1 : TCP SHM TCP
        2 : TCP TCP SHMHello world! I'm 0 of 3 on opte2
    Hello world! I'm 1 of 3 on opte3
    Hello world! I'm 2 of 3 on opte4
  • Elan interface

    [user@opte10 user]$ /sbin/ifconfig eip0
    eip0 Link encap:Ethernet HWaddr 00:00:00:00:00:0F
              inet addr:172.22.0.10 Bcast:172.22.255.255           Mask:255.255.0.0
              UP BROADCAST RUNNING MULTICAST MTU:65264 Metric:1
              RX packets:38 errors:0 dropped:0 overruns:0 frame:0
              TX packets:6 errors:0 dropped:3 overruns:0 carrier:0
              collisions:0 txqueuelen:1000
              RX bytes:1596 (1.5 Kb) TX bytes:252 (252.0 b)
  • GigE interface

    [user@opte10 user]$ /sbin/ifconfig eth0
    eth0 Link encap:Ethernet HWaddr 00:00:1A:19:30:80
              inet addr:172.20.0.10 Bcast:172.20.255.255           Mask:255.0.0.0
              UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
              RX packets:133469120 errors:0 dropped:0 overruns:0           frame:0
              TX packets:135950325 errors:0 dropped:0 overruns:0           carrier:0
              collisions:0 txqueuelen:1000
              RX bytes:24498382931 (23363.4 Mb) TX bytes:29823673137           (28442.0Mb)
              Interrupt:31
Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 1979-2007 Hewlett-Packard Development Company, L.P.