 |
» |
|
|
 |
SLURM on the HP XC system can collect job accounting data, store it in a log file, and display it. The accounting data is available only at the conclusion of a job; you cannot obtain it in real time while the job is running. You must enable this SLURM job accounting capability to support LSF-HPC on HP XC. LSF-HPC is discussed in the next chapter. This section briefly describes the sacct command, which you can use to display the stored job accounting data, and discusses how to deconfigure and configure job accounting. Information on jobs that are invoked with the SLURM srun command is logged automatically into the job accounting file. Entries in the slurm.conf file enable job accounting and designate the name of the job accounting log file. The default (and recommended) job accounting log file is /hptc_cluster/slurm/job/jobacct.log. SLURM job accounting attempts to gather all the statistics available on the systems on which it is run. The following statistics are valid for HP XC systems: Maximum number of minor page faults (page reclaims) for any process Maximum number of major page faults for any process Total number of processes used Total number of processors allocated to the job Job status or state (running, completed, failed, timed out, or node fail) First nonzero error code returned by jobstep Sum of system processor time and user processor time
Using the sacct Command |  |
The sacct command enables you to analyze the system accounting data collected in the job accounting log file. As the superuser, you can examine the accounting data for all jobs and job steps recorded in the job accounting log. The --uid and --gid options enable you to filter the output to report only the jobs and jobsteps from a specific user ID or group ID, respectively. The default invocation of the sacct command provides the jobstep, the job name, the name of partition that the job was run on, the number of processes run, their status, and the return value for the processes. Following is an example: # sacct
Jobstep Jobname Partition Nprocs Status Error
---------- ---------- ---------- ------- ---------- -----
2 script23 partn1 2 COMPLETED 0
2.0 partn1 2 COMPLETED 0
3 script51 partn1 2 COMPLETED 0
3.0 partn1 2 COMPLETED 0 |
The sacct command also provides a variety of options that enable you to tailor the output according to your needs. These options include the following: - --brief
Displays only the jobstep, status, and error (return value) fields. - --long
Displays a lengthier list, including the jobstep, the processor time in user space, the processor time in system space, the number of processes used, the total number of processors allocated to the job, the elapsed time, status, and error fields. - --jobs
Displays only the information on a specified job or list of jobs. - --state
Displays only the information on jobs that are in a given state (running, completed, and so on). - --total
Displays only the cumulative statistics for each job. - --file=file
Examines the job accounting data from the specified file instead of the job accounting log. - --fields
Displays only the specified statistics.
For more information, see sacct(1). Disabling Job Accounting |  |
Job accounting is turned on by default. Note that job accounting is required if you are using LSF. Follow this procedure to turn off job accounting: Log in as the superuser on the SLURM server (see “Configuring SLURM Servers”). Use the text editor of your choice to edit the /hptc_cluster/slurm/etc/slurm.conf file as follows: Locate the parameter JobAcctType. Set the value of JobAcctType to disable job accounting as follows: Verify that this portion of the slurm.conf file resembles the following (the changes are shown in bold): .
.
.
#
# o Define the job accounting mechanism
#
JobAcctType=jobacct/none
#
# o Define the location where job accounting logs are to
# be written. For
# - jobacct/none - this parameter is ignored
# - jobacct/log - the fully-qualified file name
# for the data file
#
JobAcctLoc=/hptc_cluster/slurm/job/jobacct.log
.
.
. |
Save the file.
Restart the slurmctld and slurmd daemons: # cexec -a "service slurm restart" |
Configuring Job Accounting |  |
Job accounting is turned on by default. If job accounting has been disabled, follow this procedure to turn job accounting back on and to designate the log file that collects the system data: Log in as the superuser on the SLURM server (see “Configuring SLURM Servers”). Choose a log file to collect the job accounting data. You can choose to isolate this data log on one node or in the /htpc_cluster directory so that all nodes can access it. However, this log file must be accessible to the following: Nodes that run the slurmctld daemon Any node from which you execute the sacct command
This example uses the file /hptc_cluster/slurm/job/jobacct.log, which is the default and recommended file. Ensure that the configured job accounting log directory exists. Determine the value for the Job Account Polling Frequency parameter. For jobs that run for a very short time (that is, shorter than the polling frequency), the values of psize and vsize may be sampled before the program has expanded to its normal working size. You can set this parameter to a value indicating the polling interval (in seconds). The majority of the jobs run should have longer run times than this value. The default value is 30. Setting this value to 0 causes the sacct command always to report values of 0 for psize and vsize. Use the text editor of your choice to edit the /hptc_cluster/slurm/etc/slurm.conf file as follows: Locate the parameter JobAcctType: Set the value of JobAcctType to enable job accounting as follows: Locate the parameter JobAcctLoc: Set the value of the JobAcctLoc parameter to designate the log file you chose in step 2: JobAcctLoc=/hptc_cluster/slurm/job/jobacct.log |
Set the value of the JobAcctParameters parameter to designate the log file you chose in step 3; this example uses a frequency of 10 seconds: JobAcctParameters="Frequency=10" |
You can also set the following additional parameters: - MaxSendRetries
The number of times an accounting message is retried. The default value is 3 times. - MaxSendRetryDelay
The maximum number of seconds to pause before sending an accounting message. The actual delay is a random value between 1 and this value. The default value is 5 seconds. - StaggerSlotSize
Generally, the increment of time a process pauses before sending its message. For n tasks, an equal number of staggered time slots are defined in increments of (StaggerSlotSize * 0.001) seconds. The first task sends its message immediately; the second task pauses one increment before sending its mesage; the third task pauses two increments before sending its message; and so on. The default value of this parameter is 1.
If you change the values of any of these parameters, assign them in a comma-separated horizontal list in quotation marks, as shown here: JobAcctParameters="Frequency=10,MaxSendRetries=5,StaggerSlotSize=2" |
Verify that this portion of the slurm.conf file resembles the following (the changes are shown in bold): .
.
.
#
# o Define the job accounting mechanism
#
JobAcctType=jobacct/log
#
# o Define the location where job accounting logs are to
# be written. For
# - jobacct/none - this parameter is ignored
# - jobacct/log - the fully-qualified file name
# for the data file
#
JobAcctLoc=/hptc_cluster/slurm/job/jobacct.logJobAcctParameters="Frequency=10"
.
.
. |
Save the file.
Restart the slurmctld and slurmd daemons: # cexec -a "service slurm restart" |
|