Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP XC System Software : Administration Guide > Chapter 12 Managing SLURM

Job Accounting

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

SLURM on the HP XC system can collect job accounting data, store it in a log file, and display it. The accounting data is available only at the conclusion of a job; you cannot obtain it in real time while the job is running. You must enable this SLURM job accounting capability to support LSF-HPC on HP XC. LSF-HPC is discussed in the next chapter.

This section briefly describes the sacct command, which you can use to display the stored job accounting data, and discusses how to deconfigure and configure job accounting.

Information on jobs that are invoked with the SLURM srun command is logged automatically into the job accounting file. Entries in the slurm.conf file enable job accounting and designate the name of the job accounting log file. The default (and recommended) job accounting log file is /hptc_cluster/slurm/job/jobacct.log.

SLURM job accounting attempts to gather all the statistics available on the systems on which it is run. The following statistics are valid for HP XC systems:

  • User processor time

  • System processor time

  • Maximum number of minor page faults (page reclaims) for any process

  • Maximum number of major page faults for any process

  • Total number of processes used

  • Total number of processors allocated to the job

  • Job's elapsed time

  • Job status or state (running, completed, failed, timed out, or node fail)

  • First nonzero error code returned by jobstep

  • Sum of system processor time and user processor time

Note:

These statistics are gathered after each jobstep completes.

Using the sacct Command

The sacct command enables you to analyze the system accounting data collected in the job accounting log file.

As the superuser, you can examine the accounting data for all jobs and job steps recorded in the job accounting log. The --uid and --gid options enable you to filter the output to report only the jobs and jobsteps from a specific user ID or group ID, respectively.

The default invocation of the sacct command provides the jobstep, the job name, the name of partition that the job was run on, the number of processes run, their status, and the return value for the processes. Following is an example:

# sacct
Jobstep    Jobname    Partition   Nprocs Status     Error
---------- ---------- ---------- ------- ---------- -----
2          script23   partn1           2 COMPLETED      0
2.0                   partn1           2 COMPLETED      0
3          script51   partn1           2 COMPLETED      0
3.0                   partn1           2 COMPLETED      0

The sacct command also provides a variety of options that enable you to tailor the output according to your needs. These options include the following:

--brief

Displays only the jobstep, status, and error (return value) fields.

--long

Displays a lengthier list, including the jobstep, the processor time in user space, the processor time in system space, the number of processes used, the total number of processors allocated to the job, the elapsed time, status, and error fields.

--jobs

Displays only the information on a specified job or list of jobs.

--state

Displays only the information on jobs that are in a given state (running, completed, and so on).

--total

Displays only the cumulative statistics for each job.

--file=file

Examines the job accounting data from the specified file instead of the job accounting log.

--fields

Displays only the specified statistics.

For more information, see sacct(1).

Note:

The bacct command reports a slightly increased value for a job's runtime when compared to the value reported by the sacct command. LSF-HPC sums the resource usage values reported by itself and SLURM.

Disabling Job Accounting

Job accounting is turned on by default. Note that job accounting is required if you are using LSF. Follow this procedure to turn off job accounting:

  1. Log in as the superuser on the SLURM server (see “Configuring SLURM Servers”).

  2. Use the text editor of your choice to edit the /hptc_cluster/slurm/etc/slurm.conf file as follows:

    1. Locate the parameter JobAcctType.

      JobAcctType=jobacct/log
    2. Set the value of JobAcctType to disable job accounting as follows:

      JobAcctType=jobacct/none
    3. Verify that this portion of the slurm.conf file resembles the following (the changes are shown in bold):

      .
      .
      .
      #
      # o Define the job accounting mechanism
      #
      
      JobAcctType=jobacct/none
      
      #
      # o Define the location where job accounting logs are to
      #   be written. For
      #    - jobacct/none - this parameter is ignored
      #    - jobacct/log  - the fully-qualified file name 
      #                     for the data file
      #
      JobAcctLoc=/hptc_cluster/slurm/job/jobacct.log
      .
      .
      .
    4. Save the file.

  3. Restart the slurmctld and slurmd daemons:

    # cexec -a "service slurm restart"

Configuring Job Accounting

Job accounting is turned on by default. If job accounting has been disabled, follow this procedure to turn job accounting back on and to designate the log file that collects the system data:

  1. Log in as the superuser on the SLURM server (see “Configuring SLURM Servers”).

  2. Choose a log file to collect the job accounting data.

    Note:

    You must specify an absolute path name for the log file; it must begin with the / character.

    You can choose to isolate this data log on one node or in the /htpc_cluster directory so that all nodes can access it. However, this log file must be accessible to the following:

    • Nodes that run the slurmctld daemon

    • LSF

    • Any node from which you execute the sacct command

    Note:

    Ensure that the log file is located on a file system with adequate storage to avoid file system full conditions.

    This example uses the file /hptc_cluster/slurm/job/jobacct.log, which is the default and recommended file.

    Ensure that the configured job accounting log directory exists.

  3. Determine the value for the Job Account Polling Frequency parameter.

    For jobs that run for a very short time (that is, shorter than the polling frequency), the values of psize and vsize may be sampled before the program has expanded to its normal working size.

    You can set this parameter to a value indicating the polling interval (in seconds). The majority of the jobs run should have longer run times than this value.

    The default value is 30.

    Setting this value to 0 causes the sacct command always to report values of 0 for psize and vsize.

  4. Use the text editor of your choice to edit the /hptc_cluster/slurm/etc/slurm.conf file as follows:

    1. Locate the parameter JobAcctType:

      JobAcctType=
    2. Set the value of JobAcctType to enable job accounting as follows:

      JobAcctType=jobacct/log
    3. Locate the parameter JobAcctLoc:

      JobAcctLoc=
    4. Set the value of the JobAcctLoc parameter to designate the log file you chose in step 2:

      JobAcctLoc=/hptc_cluster/slurm/job/jobacct.log
      Note:

      Ensure that the leading hash character (#), which designates a comment line, is removed.

    5. Set the value of the JobAcctParameters parameter to designate the log file you chose in step 3; this example uses a frequency of 10 seconds:

      JobAcctParameters="Frequency=10"

      You can also set the following additional parameters:

      MaxSendRetries

      The number of times an accounting message is retried. The default value is 3 times.

      MaxSendRetryDelay

      The maximum number of seconds to pause before sending an accounting message. The actual delay is a random value between 1 and this value. The default value is 5 seconds.

      StaggerSlotSize

      Generally, the increment of time a process pauses before sending its message. For n tasks, an equal number of staggered time slots are defined in increments of (StaggerSlotSize * 0.001) seconds. The first task sends its message immediately; the second task pauses one increment before sending its mesage; the third task pauses two increments before sending its message; and so on. The default value of this parameter is 1.

      If you change the values of any of these parameters, assign them in a comma-separated horizontal list in quotation marks, as shown here:

      JobAcctParameters="Frequency=10,MaxSendRetries=5,StaggerSlotSize=2"
    6. Verify that this portion of the slurm.conf file resembles the following (the changes are shown in bold):

      .
      .
      .
      #
      # o Define the job accounting mechanism
      #
      JobAcctType=jobacct/log
      #
      # o Define the location where job accounting logs are to
      #   be written. For
      #    - jobacct/none - this parameter is ignored
      #    - jobacct/log  - the fully-qualified file name 
      #                     for the data file
      #
      JobAcctLoc=/hptc_cluster/slurm/job/jobacct.logJobAcctParameters="Frequency=10"
      .
      .
      .
    7. Save the file.

  5. Restart the slurmctld and slurmd daemons:

    # cexec -a "service slurm restart"
Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 2003 Hewlett-Packard Development Company, L.P.