Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP XC System Software: User's Guide > Chapter 5 Submitting Jobs

Submitting Multiple MPI Jobs Across the Same Set of Nodes

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

There are two ways to run multiple MPI jobs across the same set of nodes at the same time; they are:

  • Using a script

  • Using a Makefile

The following sections show these methods. In both methods, the jobs submitted are parallel jobs using the HP-MPI message passing interface and use the ping_pong_ring program, which is delivered with the HP-MPI software.

Using a Script to Submit Multiple Jobs

You can write a script that consists of multiple commands that launch jobs. In this example, the ping_pong_ring command is run first in the background then again in the foreground:

$ cat script
#!/bin/sh
mpirun -srun -N2 -n4 ./ping_pong_ring 100000 &
mpirun -srun -N2 -n4 ./ping_pong_ring 100000

The following command line executes the script, which submits the jobs:

$ bsub -o %J.out -n2 -ext "SLURM[nodes=2]" ./script
Job <111> is submitted to default queue <normal>.

The bjobs command provides information on the execution of the script:

$ bjobs
JOBID   USER    STAT  QUEUE    FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
111     lsfadmi PEND  normal   lsfhost.loc             ./script   date and time

Use the squeue command to acquire information on the jobs:

$ squeue -s
STEPID   NAME          PARTITION USER      TIME NODELIST
13.0     hptclsf@111         lsf lsfadmin  0:07 n14
13.1     ping_pong_ring      lsf lsfadmin  0:07 n[14-15]
13.2     ping_pong_ring      lsf lsfadmin  0:07 n[14-15]
$ bjobs
No unfinished job found

Using a Makefile to Submit Multiple Jobs

You can submit multiple jobs across the same nodes by using a Makefile.

For information on Makefiles and the make utility, see The GNU Make Manual and make(1L).

The ping_pong_ring application is submitted twice in a Makefile named mymake; the first time as run1 and the second as run2:

$ cat mymake

  PPR_ARGS=10000
  NODES=2
  TASKS=4

  all: run1 run2

  run1:
         mpirun -srun -N ${NODES} -n ${TASKS} ./ping_pong_ring ${PPR_ARGS}

  run2:
         mpirun -srun -N ${NODES} -n ${TASKS} ./ping_pong_ring ${PPR_ARGS}

The following command line makes the program and executes it:

$ bsub -o %J.out -n2 -ext "SLURM[nodes=2]" make -j2 -f ./mymake
PPR_ARGS=1000000
Job <113> is submitted to default queue <normal>.

Use the squeue command to acquire information on the jobs:

$ squeue -s
STEPID   NAME          PARTITION USER      TIME NODELIST
15.0     hptclsf@113         lsf lsfadmin  0:04 n14
15.1     ping_pong_ring      lsf lsfadmin  0:04 n[14-15]
15.2     ping_pong_ring      lsf lsfadmin  0:04 n[14-15]

The following command displays the final ten lines of the output file generated by the execution of the application made from mymake:

$ tail 113.out
1000000 bytes: 937.33 MB/sec
[2:n15] ping-pong 1000000 bytes ...
1000000 bytes: 1048.41 usec/msg
1000000 bytes: 953.82 MB/sec
[3:n15] ping-pong 1000000 bytes ...
1000000 bytes: 15308.02 usec/msg
1000000 bytes: 65.33 MB/sec
[3:n15] ping-pong 1000000 bytes ...
1000000 bytes: 15343.11 usec/msg
1000000 bytes: 65.18 MB/sec

The following illustrates how an error in the Makefile is reported. This Makefile specifies a nonexistent program:

$ cat mymake

PPR_ARGS=10000
NODES=2
TASKS=4

all: run1 run2 run3

run1:
     mpirun -srun -N ${NODES} -n ${TASKS} ./ping_pong_ring ${PPR_ARGS}

run2:
     mpirun -srun -N ${NODES} -n ${TASKS} ./ping_pong_ring ${PPR_ARGS}

run3:
     mpirun -srun -N ${NODES} -n ${TASKS} ./ping_bogus ${PPR_ARGS} 1
1

This line attempts to submit a program that does not exist.

The following command line makes the program and executes it:

$ bsub -o %J.out -n2 -ext "SLURM[nodes=2]" make -j3 \
-f ./mymake PPR_ARGS=100000 
Job <117> is submitted to default queue <normal>.

The output file contains error messages related to the attempt to launch the nonexistent program.

$ cat 117.out

  .
  .
  .

mpirun -srun -N 2 -n 4 ./ping_pong_ring 100000
mpirun -srun -N 2 -n 4 ./ping_pong_ring 100000
mpirun -srun -N 2 -n 4 ./ping_bogus 100000
slurmstepd: [19.3]: error: execve(): ./ping_bogus: No such file or directory
slurmstepd: [19.3]: error: execve(): ./ping_bogus: No such file or directory
srun: error: n14: task0: Exited with exit code 2
srun: Terminating job
slurmstepd: [19.3]: error: execve(): ./ping_bogus: No such file or directory
slurmstepd: [19.3]: error: execve(): ./ping_bogus: No such file or directory
make: *** [run3] Error 2
make: *** Waiting for unfinished jobs....
[0:n14] ping-pong 100000 bytes ...
100000 bytes: 99.06 usec/msg
100000 bytes: 1009.51 MB/sec
[0:n14] ping-pong 100000 bytes ...
100000 bytes: 99.76 usec/msg
100000 bytes: 1002.43 MB/sec
[1:n14] ping-pong 100000 bytes ...
100000 bytes: 1516.83 usec/msg
100000 bytes: 65.93 MB/sec
[1:n14] ping-pong 100000 bytes ...
100000 bytes: 1519.73 usec/msg
100000 bytes: 65.80 MB/sec
[2:n15] ping-pong 100000 bytes ...
100000 bytes: 108.65 usec/msg
100000 bytes: 920.38 MB/sec
[2:n15] ping-pong 100000 bytes ...
100000 bytes: 99.44 usec/msg
100000 bytes: 1005.65 MB/sec
[3:n15] ping-pong 100000 bytes ...
100000 bytes: 1877.35 usec/msg
100000 bytes: 53.27 MB/sec
[3:n15] ping-pong 100000 bytes ...
100000 bytes: 1888.22 usec/msg
100000 bytes: 52.96 MB/sec

The sacct command, which displays SLURM accounting information, reflects the error:

  [lsfadmin@n16 ~]$ sacct -j 19
  Jobstep    Jobname            Partition    Ncpus Status     Error
  ---------- ------------------ ---------- ------- ---------- -----
  19         hptclsf@117        lsf              8 CANCELLED      2
  19.0       hptclsf@117        lsf              0 FAILED         2
  19.1       hptclsf@117        lsf              8 COMPLETED      0
  19.2       hptclsf@117        lsf              8 COMPLETED      0
  19.3       hptclsf@117        lsf              8 FAILED         2 
Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 2003 Hewlett-Packard Development Company, L.P.