This section describes what happens in the HP XC system
when a job is submitted to LSF-HPC. Figure 10-1 illustrates
this process. Use the numbered steps in the text and depicted in the
illustration as an aid to understanding the process.
Consider the HP XC system configuration
shown in Figure 10-1, in which lsfhost.localdomain is the virtual IP name assigned to the LSF execution
host, node n16 is the login node, and
nodes n[1-10] are compute nodes in the lsf partition. All nodes contain two cores, providing
20 cores for use by LSF-HPC jobs.
A user logs in to login
node n16.
The user executes the following
LSF bsub command on login node n16:
$ bsub -n4 -ext "SLURM[nodes=4]" -o output.out ./myscript |
This bsub command launches a
request for four cores (from the -n4 option of
the bsub command) across four nodes (from the -ext "SLURM[nodes=4]" option); the job is launched on those cores.
The script, myscript, which is shown here, runs
the job:
#!/bin/sh
hostname
srun hostname
mpirun -srun ./hellompi |
LSF-HPC schedules
the job and monitors the state of the resources (compute
nodes) in the SLURM lsf partition.
When the LSF-HPC scheduler determines that the required resources
are available, LSF-HPC allocates those resources in SLURM and
obtains a SLURM job identifier (jobID) that corresponds
to the allocation.
In this example, four cores
spread over four nodes (n1,n2,n3,n4) are allocated for myscript, and the SLURM job id of 53 is assigned to the
allocation.
LSF-HPC prepares the
user environment for the job on the LSF execution host node and dispatches
the job with the job_starter.sh script. This user
environment includes standard LSF environment variables and two SLURM-specific
environment variables: SLURM_JOBID and SLURM_NPROCS.
SLURM_JOBID is the SLURM job ID of the job. Note that this is not the same as
the LSF-HPC jobID. “Translating SLURM and LSF-HPC JOBIDs” describes the relationship between
the SLURM_JOBID and the LSF-HPC JOBID.
SLURM_NPROCS is the number of
processes allocated.
These environment variables are intended for use
by the user's job, whether it is explicitly (user scripts may
use these variables as necessary) or implicitly (the srun commands in the user’s job use these variables to determine
its allocation of resources).
The value for SLURM_NPROCS is
4 and the SLURM_JOBID is 53 in this example.
The user job myscript begins execution on compute node n1.
The first line in myscript is the hostname command. It executes locally
and returns the name of node, n1.
The second line in the myscript script is the srun hostname command. The srun command in myscript inherits SLURM_JOBID and SLURM_NPROCS from the environment and executes the hostname command on each compute node in the allocation.
The output of the hostname tasks (n1, n2, n3, and n4). is aggregated
back to the srun launch command (shown as dashed
lines in Figure 10-1), and is ultimately returned to the srun command in the job starter script, where it is collected
by LSF-HPC.
The last line in myscript is
the mpirun -srun ./hellompi command. The srun command inside the mpirun command
in myscript inherits the SLURM_JOBID and SLURM_NPROCS environment variables from the
environment and executes hellompi on each compute
node in the allocation.
The output of the hellompi tasks
is aggregated back to the srun launch command where
it is collected by LSF-HPC.
The command executes on the allocated compute nodes n1, n2, n3, and n4.
When the job finishes, LSF-HPC cancels the
SLURM allocation, which frees the compute nodes for use by another
job.