 |
» |
|
|
 |
There are several ways you can get information about a specific job after it has been submitted to LSF-HPC integrated with SLURM. This section briefly describes some of the commands that are available under LSF-HPC integrated with SLURM to gather information about a job. This section is only intended to give you an idea of the commonly used commands and to describe any differences there may be in their operation in the HP XC environment, not as a complete reference on this topic. See the LSF manpages for full information about the commands described in this section. The following LSF commands are described in this section: Getting Job Allocation Information |  |
Before a job runs, LSF-HPC integrated with SLURM allocates SLURM compute nodes based on job resource requirements. After LSF-HPC integrated with SLURM allocates nodes for a job, it attaches allocation information to the job. The bjobs -l command provides job allocation information on running jobs. The bhist -l command provides job allocation information for a finished job. For details about using these commands, see the LSF manpages . A job allocation information string resembles the following: slurm_id=slurm_jobid;ncpus=slurm_nprocs;slurm_alloc=node_list This allocation string has the following values: - slurm_id
SLURM_JOBID environment variable. This is SLURM allocation ID (Associates LSF-HPC job with SLURM allocated resources.) - ncpus
SLURM_NPROCS environment variable. This the actual number of allocated cores. Under node-level allocation scheduling, this number may be bigger than what the job requests.) - slurm_alloc
A comma separated list of allocated nodes.
LSF-HPC integrated with SLURM sets the SLURM_JOBID and SLURM_NPROCS environment variables, when it starts a job. Example 10-1 illustrates how to use the the bjobs -l command to obtain job allocation information about a running job: Example 10-1 Job Allocation Information for a Running Job $ bjobs -l 24
Job <24>, User <lsfadmin>, Project <default>,
Status <RUN>, Queue <normal>,
Interactive pseudo-terminal shell mode,
Extsched <SLURM[nodes=4]>, Command </bin/bash>
date and time stamp: Submitted from host <n2>, CWD <$HOME>,
4 Processors Requested, Requested Resources <type=any>;
date and time stamp: Started on 4 Hosts/Processors <4*lsfhost.localdomain>;
date and time stamp: slurm_id=22;ncpus=8;slurm_alloc=n[5-8];
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
EXTERNAL MESSAGES:
MSG_ID FROM POST_TIME MESSAGE ATTACHMENT
0 - - - -
1 lsfadmin date and time stamp SLURM[nodes=4] N |
In particular, note the node and job allocation information provided in the above output: date and time stamp: Started on 4 Hosts/Processors <4*lsfhost.localdomain>;
date and time stamp: slurm_id=22;ncpus=8;slurm_alloc=n[5-8]; |
Example 10-2 illustrates how to use the output obtained using the bhist -l command to obtain job allocation information about a job that has run: Example 10-2 Job Allocation Information for a Finished Job $ bhist -l 24
Job <24>, User <lsfadmin>, Project <default>,
Interactive pseudo-terminal shell mode,
Extsched <SLURM[nodes=4]>, Command </bin/bash>
date and time stamp: Submitted from host <n2>, to Queue <normal>, CWD <$HOME>,
4 Processors Requested, Requested Resources <type=any>;
date and time stamp: Dispatched to 4 Hosts/Processors
<4*lsfhost.localdomain>;
date and time stamp: slurm_id=22;ncpus=8;slurm_alloc=n[5-8];date and time stamp: Starting (Pid 4785);
date and time stamp: Done successfully.
The CPU time used is 0.1 seconds;
date and time stamp: Post job process done successfully;
Summary of time in seconds spent in various states by date and time stamp
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
11 0 220 0 0 0 231 |
In particular, note the node and job allocation information provided in the above output: date and time stamp: Dispatched to 4 Hosts/Processors
<4*lsfhost.localdomain>;
date and time stamp: slurm_id=22;ncpus=8;slurm_alloc=n[5-8]; |
Examining the Status of a Job |  |
Once a job is submitted, you can use the bjobs command to track the job's progress. The bjobs command reports the status of a job submitted to LSF-HPC. By default, bjobs lists only the user's jobs that have not finished or exited. The following are examples of bjobs command usage, and show the output it produces on an HP XC system. For more information about the bjobs command and its output, see the LSF-HPC manpages. Example 10-3 provides abbreviated output of the bjobs command. Example 10-3 Using the bjobs Command (Short Output) $ bjobs 24
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
24 msmith RUN normal n16 lsfhost.localdomain /bin/bash date and time |
As shown in the previous output, the bjobs command returns information that includes the job id, user name, job status, queue name, submitting host, executing host, job name, and submit time. In this example, the output shows that job /bin/bash was submitted from node n16 and launched on the execution host (lsfhost.localdomain). Example 10-4 provides extended output of the bjobs command. Example 10-4 Using the bjobs Command (Long Output) $ bjobs -l 24
Job <24>, User <msmith>,Project <default>,Status <RUN>,
Queue <normal>, Interactive pseudo-terminal shell
mode, Extsched <SLURM[nodes=4]>, Command </bin/bash>
date and time stamp: Submitted from host <n16>, CWD <$HOME>,
4 Processors Requested, Requested Resources <type=any>;
date and time stamp: Started on 4 Hosts/Processors
<4*lsfhost.localdomain>;
date and time stamp: slurm_id=22;ncpus=8;slurm_alloc=n[5-8];
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - -
EXTERNAL MESSAGES:
MSG_ID FROM POST_TIME MESSAGE ATTACHMENT
0 - - - -
1 lsfadmin date and time stamp SLURM[nodes=4] N |
Viewing the Historical Information for a Job |  |
The LSF bhist command is a good tool for tracking the lifetime of a job within LSF-HPC. The bhist command provides detailed information about a job, including running, pending, and suspended jobs, information such as the amount of time spent in various states, and in-depth information about a job's progress while the job was under LSF-HPC control. See the LSF bhist manpage for more information about this command, its options, and its output. A brief summary about a finished job can be obtained with the bhist command, shown in Example 10-5. This command provides statistics about the amount of time that the job has spent in various states. Example 10-5 Using the bhist Command (Short Output) $ bhist 24
Summary of time in seconds spent in various states:
JOBID USER JOB_NAME PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
24 smith bin/bash 11 0 220 0 0 0 231 |
The information in the output provided by this example is explained in Table 10-1. Table 10-1 Output Provided by the bhist Command | Field | Description |
|---|
| JOBID | The job ID that LSF-HPC assigned to the job. | | USER | The user who submitted the job. | | JOB_NAME | The job name assigned by the user. | | PEND | The total waiting time, excluding user suspended time, before the job is dispatched. | | PSUSP | The total user suspended time of a pending job. | | RUN | The total run time of the job. | | USUSP | The total user suspended time after the job is dispatched. | | SSUSP | The total system suspended time after the job is dispatched. | | UNKWN | The total unknown time of the job. | | TOTAL | The total time that the job has spent in all states. |
For detailed information about a finished job, add the -l option to the bhist command, shown in Example 10-6. The -l option specifies that the long format is requested. Example 10-6 Using the bhist Command (Long Output) $ bhist -l 24
Job <24>, User <lsfadmin>, Project <default>,
Interactive pseudo-terminal shell mode,
Extsched <SLURM[nodes=4]>, Command </bin/bash>
date and time stamp: Submitted from host <n2>,
to Queue <normal>, CWD <$HOME>,
4 Processors Requested, Requested Resources <type=any>;
date and time stamp: Dispatched to 4 Hosts/Processors
<4*lsfhost.localdomain>;
date and time stamp: slurm_id=22;ncpus=8;slurm_alloc=n[5-8];
date and time stamp: Starting (Pid 4785);
Summary of time in seconds spent in various states by
date and time stamp
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
11 0 124 0 0 0 135 |
|