 |
» |
|
|
 |
The Platform Load Sharing Facility for High Performance Computing (LSF-HPC) product is installed and configured as an embedded component of the HP XC system during installation. This product has been integrated with SLURM to provide a comprehensive high-performance workload management solution for the HP XC system. This section describes the LSF-HPC product, its installation, and its operation on the HP XC system with SLURM, and explains the subtle differences between this product and the Platform Standard LSF product. This section addresses the following topics: See “Troubleshooting” for information on LSF-HPC troubleshooting. See “Installing LSF-HPC for SLURM into an Existing Standard LSF Cluster ” for information on extending the LSF-HPC cluster. Integration of LSF-HPC with SLURM |  |
LSF-HPC acts primarily as the workload scheduler and node allocator running on top of SLURM. SLURM provides a job execution and monitoring layer for LSF-HPC. LSF-HPC uses SLURM interfaces to perform the following: To query system topology information for scheduling purposes. To create allocations for user jobs. To dispatch and launch user jobs. To monitor user job status. To signal user jobs and cancel allocations. To gather user job accounting information.
The major difference between LSF-HPC and Standard LSF is that LSF-HPC daemons run on only one node in the HP XC system, that node is known as the LSF execution host. The LSF-HPC daemons rely on SLURM to provide information on the other computing resources (nodes) in the system. The LSF-HPC daemons consolidate this information into one entity, such that these daemons present the HP XC system as one virtual LSF host. Example 13-1 shows how to use the controllsf command to determine which node is the LSF-HPC execution host. Example 13-1 Determining the LSF-HPC Execution Host # controllsf show current
LSF is currently running on n16, and assigned to n16 |
All LSF-HPC administration must be done from the LSF-HPC execution host. You can run the lsadmin and badmin commands only on this host; they are not intended to be run on any other nodes in the HP XC system and may produce false results if they are. When the LSF-HPC scheduler determines that it is time to dispatch a job, it requests an allocation of nodes from SLURM. After the successful allocation, LSF-HPC prepares the job environment with the necessary SLURM allocation variables, that is, SLURM_JOBID and SLURM_NPROCS. The SLURM_JOBID is a 32-bit integer that uniquely identifies a SLURM allocation in the system; the SLURM_JOBID can be reused. The job dispatch depends on the type of job: - For a batch job:
LSF-HPC submits the job to SLURM as a batch job and passively monitors it with the squeue command. - For an interactive job:
LSF-HPC launches the user's job locally on the LSF-HPC execution host.
An LSF-HPC job starter script for LSF-HPC queues is provided and configured by default on the HP XC system to launch interactive jobs on the first allocated node. This ensures that interactive jobs behave just as they would if they were batch jobs. The job starter script is discussed in more detail in “Job Starter Scripts”. The environment in which the job is launched contains SLURM and LSF-HPC environment variables that describe the job's allocation. SLURM srun commands in the user's job make use of the SLURM environment variables to distribute the tasks throughout the allocation. The integration of SLURM and LSF-HPC has one drawback: the bsub command's -i option for providing input to the user job is not supported. A workaround is to provide any file input directly to the job. The SLURM srun command supports an --input option (also available in its short form as the -i option) that provides input to all tasks. LSF-HPC dispatches all jobs locally. The default installation of LSF-HPC for SLURM on the HP XC system provides a job starter script that is configured for use by all LSF-HPC queues. This job starter script adjusts the LSB_HOSTS and LSB_MCPU_HOSTS environment variables to the correct resource values in the allocation. Then, the job starter script uses the srun command to launch the user task on the first node in the allocation. If this job starter script is not configured for a queue, the user jobs begin execution locally on the LSF-HPC execution host. In this case, it is recommended that the user job uses one or more srun commands to make use of the resources allocated to the job. Work done on the LSF-HPC execution host competes for core time with the LSF-HPC daemons, and could affect the overall performance of LSF-HPC on the HP XC system. The bqueues -l command displays the full queue configuration, including whether or not a job starter script has been configured. See the Platform LSF documentation or the bqueues(1) manpage for more information on the use of this command. For example, consider an LSF-HPC configuration in which node n20 is the LSF-HPC execution host and nodes n[1-10] are in the SLURM lsf partition. The default normal queue contains the job starter script, but the unscripted queue does not have the job starter script configured. Example 13-2 Comparison of Queues and the Configuration of the Job Starter Script $ bqueues -l normal | grep JOB_STARTER
JOB_STARTER: /opt/hptc/lsf/bin/job_starter.sh
$ bqueues -l unscripted | grep JOB_STARTER
JOB_STARTER:
$ bsub -Is hostname
Job <66> is submitted to the default queue <normal>.
<<Waiting for dispatch...>>
<<Starting on lsfhost.localdomain>>
n10
$ bsub -Is -q unscripted hostname
Job <67> is submitted to the default queue <unscripted>.
<<Waiting for dispatch...>>
<<Starting on lsfhost.localdomain>>
n20 |
This release of the HP XC System Software provides an LSF-HPC queue JOB_STARTER script, which is configured for all default queues during HP XC installation. This JOB_STARTER script performs three tasks: It creates an accurate LSB_HOSTS environment variable. It creates an accurate LSB_MCPU_HOSTS environment variable. It uses a SLURM srun command to launch a user's interactive job on the first allocated compute node.
The LSB_HOSTS and LSB_MCPU_HOSTS environment variables, as initially established by LSF-HPC, do not accurately reflect the host names of the HP XC system nodes that SLURM allocated for the user's job. This JOB_STARTER script corrects these environment variables so that existing applications compatible with LSF can use them without further adjustment. The SLURM srun command used by the JOB_STARTER script ensures that every interactive job submitted by a user begins on the first allocated node. Without the JOB_STARTER script, all interactive user jobs would start on the LSF-HPC execution host. This behavior is not consistent with batch job submissions or Standard LSF behavior in general, and creates the potential for a bottleneck in performance as both the LSF-HPC daemons and local user tasks compete for processor cycles. The JOB_STARTER script has one drawback: all interactive I/O runs through the srun command in the JOB_STARTER script. This means full tty support is not available for interactive sessions, resulting in no prompting when a shell is launched. The workaround is to set your display to support launching an xterm instead of a shell. The JOB_STARTER script is located at /opt/hptc/lsf/bin/job_starter.sh, and is preconfigured for all of the queues created during the default LSF-HPC installation on the HP XC system. HP recommends that you configure the JOB_STARTER script for all queues. To disable the JOB_STARTER script, simply remove it or comment it out from the lsb.queues configuration file. For more information on the JOB_STARTER option and configuring queues, see Administering Platform LSF on the HP XC Documentation CD. For more information on configuring JOB_STARTER scripts and how they work, see the Standard LSF documentation The integration of LSF-HPC with SLURM includes the addition of a SLURM-based external scheduler. Users can submit SLURM parameters in the context of their jobs. This enables users to make specific topology-based allocation requests. See the HP XC System Software User's Guide for more information. An lsf partition is created in SLURM; this partition contains all the nodes that LSF-HPC manages. This partition must be configured such that only the superuser can make allocation requests (RootOnly=YES). This configuration prevents other users from directly accessing the resources that are being managed by LSF-HPC. The LSF-HPC daemons, running as the superuser, make allocation requests on behalf of the owner of the job to be dispatched. This is how LSF-HPC creates SLURM allocations for users' jobs to be run. The lsf partition must be configured so that the nodes can be shared by default (Shared=FORCE). Thus, LSF-HPC can allocate serial jobs by different users on a per-processor basis (rather than on a per-node basis) by default, which makes the best use of the resources. This setting also enables LSF-HPC to support preemption by allowing a new job to run while an existing job is suspended on the same resource. SLURM nodes can be in various states. Table 13-1 describes how LSF-HPC interprets each node state. Table 13-1 LSF-HPC Interpretation of SLURM Node States | Node | Description |
|---|
Free | A node that is configured in the LSF-HPC partition and is not allocated to any job. The node is in the following state: | IDLE | The node is not allocated to any job and is available for use. | In Use | A node in any of the following states: | ALLOCATED | The node is allocated to a job. | COMPLETING | The node is allocated to a job that is in the process of completing. The node state is removed when all the job processes have ended and the SLURM epilog program (if any) has ended. | DRAINING | The node is currently running a job but will not be allocated to additional jobs. The node state changes to state DRAINED when the last job on it completes. | Unavailable | A node that is not available for use; its status is one of the following: | DOWNED | The node is not available for use. | DRAINED | The node is not available for use per system administrator request. | UNKNOWN | The SLURM controller has just started and the node state is not yet determined. | | |
LSF-HPC failover is of critical concern because only one node in the HP XC system runs the LSF-HPC daemons. During installation, you select the primary LSF execution host from the nodes on the HP XC system that have the resource management role; although that node could also be a compute node, it is not recommended. Other nodes that also have the resource management role are designated as potential LSF execution host backups. To address this concern, LSF-HPC is configured on HP XC with a virtual host name (vhost) and a virtual IP (vIP). The virtual IP and host name are used because they can be moved from one node to another, and maintain a consistent LSF interface. By default, the virtual IP is an internal IP on the HP XC administration network, and the virtual host name is lsfhost.localdomain. The LSF execution host is configured to host the vIP, then the LSF-HPC daemons are started on that node. The Nagios infrastructure contains a module that monitors the LSF-HPC virtual IP. If it detects a problem with the virtual IP (for example, the inability to ping it), the monitoring code assumes the node is down and chooses a new LSF execution host from the backup candidate nodes on which to set up the virtual IP and restart LSF-HPC. Installation of LSF-HPC on SLURM |  |
When selected, LSF-HPC is automatically installed during cluster_config execution. This installation is optimized for operational scalability and efficiency within the HP XC system, and is a very good solution for the HP XC system. Depending how you manage your overall LSF cluster file system, this installation is sufficient for adding the HP XC system to an existing LSF cluster. For more information, see “Installing LSF-HPC for SLURM into an Existing Standard LSF Cluster ”. The LSF-HPC tar files to be installed are located in the /opt/hptc/lsf/files directory. Before the installation begins, you are prompted for the following information: Primary LSF administrator This user account is necessary for establishing ownership of the LSF-HPC configuration file. If the lsfadmin user account does not exist, it will be created locally within HP XC. You can configure other LSF administrators after the installation has completed. For more information, see Administering Platform LSF on the HP XC Documentation CD. The name of the LSF cluster This name must be unrelated to any network host name. This name must be unique unless the intent is to add the HP XC system to an existing LSF cluster. In such a case, the name must match the name of the existing LSF cluster. The default name is hptclsf.
After these values are obtained and verified, the LSF-HPC installation runs installing the appropriate files under /opt/hptc/lsf/top/. On completion, the following post-installation procedures are performed: LSF-HPC directories are relocated to take advantage of the HP XC file system hierarchy. The location of the LSF-HPC installation is /opt/hptc/lsf/top, which contains four directories: - conf
The conf directory is moved to /hptc_cluster/lsf/conf; it is linked through a soft link to /opt/hptc/lsf/top/conf. - work
The work directory is moved to /hptc_cluster/lsf/work; it is linked through a soft link to /opt/hptc/lsf/top/work. - log
The log directory is moved to /var/lsf/log; it is linked through a soft link to /opt/hptc/lsf/top/log. This ensures that all LSF-HPC logging remains local to the node currently running LSF-HPC. - 6.1
This directory remains in place and is imaged to each node of the HP XC system.
The SLURM resource is added to the configured LSF execution host HP OEM licensing is configured. HP OEM licensing is enabled in LSF-HPC by adding the following string to the LSF-HPC configuration file, /opt/hptc/lsf/top/conf/lsf.conf. This tells LSF-HPC where to look for the shared object to interface with HP OEM licensing. XC_LIBLIC=/opt/hptc/lib/libsyslic.so |
Access to LSF-HPC from every node in the cluster is configured. Configuring all nodes in the HP XC system as LSF-HPC floating client nodes makes available access to LSF-HPC from all nodes. Two files are edited to perform this configuration: Adding LSF_SERVER_HOSTS="lsfhost.localdomain" to the lsf.conf configuration file. Adding FLOAT_CLIENTS_ADDR_RANGE=172.20 on its own line in the Parameters Section of the file /opt/hptc/lsf/top/conf/lsf.cluster.clustername. The FLOAT_CLIENTS_ADDR_RANGE value (in this case 172.20) must be the management network IP address range that is configured for the HP XC system. This value should be equal to the value of nodeBase in the /opt/hptc/config/base_addr.ini file .
The HP XC system /etc/hosts file has an entry for lsfhost.localdomain, which allows the LSF-HPC installation to install itself with the name lsfhost.localdomain. The /opt/hptc/lsf/top/conf/hosts file maps lsfhost.localdomain and its virtual IP to the designated LSF execution host An initial LSF-HPC hosts file to map the virtual host name (lsfhost.localdomain) to an actual nodename is provided. Sets the default LSF-HPC environment for all users who log into the HP XC system. Files named lsf.sh and lsf.csh are added to the /etc/profile.d/ directory; these files source the respective /opt/hptc/lsf/top/conf/profile.lsf and /opt/hptc/lsf/top/conf/cshrc.lsf files. The JOB_ACCEPT_INTERVAL= entry in the lsf.params file is set to 0 (zero) to allow more than one job to be dispatched to the LSF execution host per dispatch cycle. If this setting is nonzero, jobs are dispatched at a slower rate. A soft link from /etc/init.d/lsf to /opt/hptc/sbin/controllsf is created. A scratch area for LSF-HPC is created in /hptc_cluster/lsf/tmp/. It must be read-writable by all. The controllsf set primary command is invoked with the highest-numbered node that has the resource management role. If this is not done, LSF-HPC starts on the head node even if the head node is not a resource management node.
LSF-HPC Startup and Shutdown |  |
This section discusses starting up and shutting down LSF-HPC. LSF-HPC is configured to start up automatically when the HP XC system starts up, through the use of the /etc/init.d/lsf script. If LSF-HPC stops running, you can start it with the controllsf command, as shown here: This command searches through a list of nodes with the lsf service until it finds a node to run LSF-HPC. Alternatively, you can invoke the following command to start LSF-HPC on the current node: At system shutdown, the /etc/init.d/lsf script ensures an orderly shutdown of LSF-HPC. You can use the controllsf command, as shown here, to stop LSF-HPC regardless of where it is active in the HP XC system: Controlling the LSF-HPC Service |  |
You can use the service command to start or stop the LSF-HPC service on the HP XC system, or to obtain the system's current status: - service lsf start
This command is primarily of interest for automated startup. If the current node is the primary LSF execution host, it sets the state to RUNNING, then starts LSF-HPC unless it is already running somewhere on the HP XC system. If the node is not the primary LSF execution host, it ignores the command. - service lsf stop
This command stops the LSF-HPC environment if it is running on the current node. Invoking this command on the LSF execution host or on the head node shuts down the LSF-HPC environment regardless where it is on the HP XC system, and sets the state to SHUT DOWN to prevent any attempt to fail over the LSF-HPC service to another node. - service lsf status
This command reports the current state (UP or DOWN) of LSF-HPC. This command has the same function as controllsf status.
Load Indexes and Resource Information |  |
LSF-HPC gathers limited resource information and load indexes from the LSF execution host and from its integration with SLURM. Not all indexes are reported because SLURM does not provide the same information that LSF-HPC usually reports. The LSF lshosts and lsload commands are two common commands for obtaining resource information from LSF-HPC. The lshosts command reports the following resource information: - ncpus
The total number of available processors within the SLURM lsf partition. This value is calculated as the minimum value between the number of processors in all available nodes in the lsf partition and the number of licensed CPUs. If total number of usable CPUs is 0, LIM sets the value of ncpus to 1 and closes the host. - maxmem
The minimum value of configured SLURM memory for all nodes. This value represents the maximum value of memory on the node with the least memory. Running a job on all nodes must account for this value. This value is calculated as a minimal value over all the compute nodes. - maxtmp
The minimum value of configured SLURM TmpDisk space for all nodes. This value represents the maximum value of disk space on the node with the least amount of disk space. Running a job on all nodes must account for this value. This value is calculated as a minimal value over all the compute nodes.
The following is an example of the LSF lshosts command. $ lshosts
HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES
lsfhost.loc SLINUX6 Opteron8 16.0 22 2048M - Yes (slurm)
|
The lshosts command reports a hyphen (-) for all the other load index and resource information. Initially, SLURM is not configured with any memory or temporary disk space, so LIM reports the default value of 1 MB for each index. For more information on the lshosts command, see lshosts(1) . The lsload command reports the load index, that is, the number of current login users on the LSF execution host. The following is an example of the LSF lsload command. $ lsload
HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem
lsfhost.localdo ok - - - - - 1 - - - - |
For more information on the lsload command, see lsload(1) . In these examples, 22 processors on this HP XC system are available for use by LSF-HPC. You can verify this information, which is obtained by LSF-HPC, with the SLURM sinfo command: $ sinfo --Node --long
NODELIST NODES PARTITION STATE CPUS MEMORY TMP_DISK ...
xc5n[1-10,16] 11 lsf idle 2 2048 1 ...
|
The output of the sinfo command shows that 11 nodes are available, and that each node has 2 processors. The LSF lshosts command and the SLURM sinfo command both report the memory for each node as 2,048 MB. This memory value is configured for each node in /hptc_cluster/slurm/etc/slurm.conf; it is not obtained directly from the nodes. See the SLURM documentation for more information on configuring the slurm.conf file. Launching Jobs with LSF-HPC |  |
You may not submit LSF-HPC jobs as superuser (root). You may find it convenient to run jobs as the local lsfadmin user. An example would be a job to test a new queue configuration. The LSF-HPC daemons run on one node only: the LSF execution host. Therefore, they can dispatch jobs only on that node. The JOB_STARTER script, described in “Job Starter Scripts”, ensures that user jobs execute on their reserved nodes, and that these jobs do not contend for the LSF execution host. Consider an HP XC system in which node n120 is the LSF execution host, and nodes n1 through n99 are compute nodes. The following series of examples shows jobs launched without the JOB_STARTER script with varied results. Example 13-3 illustrates the launching of a job in its most basic form. Example 13-3 Basic Job Launch Without the JOB_STARTER Script Configured $ bsub -I hostname
Job <20> is submitted to default queue <normal>.
<<Waiting for dispatch...>>
<<starting on lsfhost.localdomain>>
n120 |
Example 13-4 is a similar example, but 20 processors are reserved. Example 13-4 Launching Another Job Without the JOB_STARTER Script Configured $ bsub -I -n20 hostname
Job <21> is submitted to default queue <normal>.
<<Waiting for dispatch...>>
<<starting on lsfhost.localdomain>>
n120 |
In both of the previous examples, processors were reserved but not used. To ensure that a job is launched properly on the reserved nodes without the JOB_STARTER script configured, the user must preface each command with either the srun command or the mpirun -srun command, as shown in Example 13-5 and Example 13-6. Example 13-5 Launching a Job Successfully Without the JOB_STARTER Script Using srun $ bsub -I srun hostname
Job <22> is submitted to default queue <normal>.
<<Waiting for dispatch...>>
<<starting on lsfhost.localdomain>>
n99 |
Example 13-6 Launching a Job Successfully Without the JOB_STARTER Script Using mpirun $ bsub -I -n4 mpirun -srun hostmpi
Job <23> is submitted to default queue <normal>.
<<Waiting for dispatch...>>
<<starting on lsfhost.localdomain>>
task 0 running on n1
task 1 running on n1
task 2 running on n2
task 3 running on n2 |
Example 13-7 illustrates launching the same job as in Example 13-3, but with the JOB_STARTER script configured. Example 13-7 Basic Job Launch with the JOB_STARTER Script Configured $ bsub -I hostname
Job <24> is submitted to default queue <normal>.
<<Waiting for dispatch...>>
<<starting on lsfhost.localdomain>>
n99 |
Monitoring and Controlling LSF-HPC Jobs |  |
All the standard LSF commands for monitoring a job are supported. The bjobs command reports the status of a job. The following is an example of the bjobs command: $ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
116 lsfadmi RUN normal lsfhost.loc 8*lsfhost.l * sleep 50 date time |
You can use the -l (long) option to obtain detailed information about a job, as shown in this example: $ bjobs -l 116
Job <116>, User <lsfadmin>, Project default, Status <RUN>, Queue <normal>, Co
mmand <srun sleep 50>
date time: Submitted from host <lsfhost.localdomain>, CWD <$HOME>, Ou
tput File <./>, 8 Processors Requested;
date time: Started on 8 Hosts/Processors <8*lsfhost.localdomain>, Exe
cution Home <hptc_cluster/hptc_cluster>, Execution CWD <hptc/hptc;
_cluster/lsf/home>;
date time: slurm_id=7;ncpus=8;slurm_alloc=n[1-4];
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
|
Note the output that identifies the SLURM_JOBID and the SLURM allocation: date time: slurm_id=7;ncpus=8;slurm_alloc=n[1-4];
|
You can use the SLURM_JOBID with various SLURM commands, for example, use the squeue command to view information about jobs in the SLURM scheduling queue and use the scontrol show command to display the state of the job. $ squeue -j 7
JOBID PARTITION NAME USER ST TIME NODES NODELIST
7 lsf hptclsf@ lsfadmin R 0:14 4 n[1-4]
$ scontrol show job 7
JobId=7 UserId=lsfadmin(502) GroupId=lsfadmin(503)
Name=LSFclustername@LSF_JOBID JobState=RUNNING
Priority=4294901755 Partition=lsf BatchFlag=0
AllocNode:Sid=n16:27450 TimeLimit=UNLIMITED
StartTime=10/11-17:54:05 EndTime=NONE
NodeList=n[1-4] NodeListIndecies=0,3,-1
ReqProcs=0 MinNodes=0 Shared=0 Contiguous=0
MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
ReqNodeList=(null) ReqNodeListIndecies=-1
ExcNodeList=(null) ExcNodeListIndecies=-1
|
The NAME= output of the scontrol show command returns the name of the LSF cluster (the installation default is hptclsf) and the LSF-HPC job number, separated by the at character (@). The bhist command reports the history of a job. After you have gathered information about a job, you can use other useful LSF commands to control LSF-HPC jobs: bkill, bstop, and bresume. The bkill command kills a running job. This command uses the SLURM scancel command. The bstop command suspends the execution of a running job. The bresume command resumes the execution of a suspended job. For more information, see bkill(1), bstop(1), and bresume(1). Job Accounting |  |
Standard LSF job accounting using the bacct command is available. The output of a job contains total CPU time and memory usage: $ cat 231.out
.
.
.
Resource usage summary:
CPU time : 8252.65 sec.
Max Memory : 4 MB
Max Swap : 113 MB
.
.
. |
The LSF bacct command provides accurate job accounting data on the following: Jobs submitted by all users Jobs accounted on all projects Jobs completed normally or exited Jobs executed on all hosts Jobs submitted to all queues Jobs accounted on all service classes
Consider using the -l of the bacct command to display the accounting data in its long format: For more information, see bacct(1). LSF-HPC Failover |  |
This section discusses aspects of the LSF-HPC failover mechanism. Overview of LSF-HPC Monitoring and Failover SupportThe Nagios LSF-HPC failover module monitors the virtual IP associated with the primary LSF execution host. When LSF-HPC failover is enabled on the HP XC system, and if the primary LSF execution host fails, the Nagios LSF-HPC failover module detects that the node is unresponsive and initiates failover: The Nagios module attempts to contact the node hosting the IP to ensure that LSF-HPC for SLURM is shut down and that virtual IP hosting is disabled. A new primary LSF execution host from the backup nodes is selected. The LSF daemons start on the backup node. The Nagios module tries to re-establish the virtual IP on the new node. LSF-HPC is restarted on that host.
LSF-HPC monitoring and failover are implemented on the HP XC system as tools that prepare the environment for the LSF-HPC execution host daemons on a given node, start the daemons, then watch the node to ensure that it remains active. After a standard installation, the HP XC system is initially configured so that: LSF-HPC is started on the head node. LSF-HPC failover is disabled. The Nagios application reports whether LSF-HPC is up, down, or "currently shut down," but takes no action in any case.
The only direct interaction between LSF-HPC and the LSF-HPC monitoring and failover tools occurs at LSF-HPC startup, when the daemons are started in the virtual environment, and at failover, when the existing daemons are shut down cleanly before the virtual environment is moved to a new host. You have the option of enabling or disabling LSF-HPC failover at any time. For more information, see controllsf(8). Interplay of LSF-HPC and SLURM LSF-HPC and SLURM are managed independently; one is not critically affected if the other goes down. SLURM has no dependency on LSF-HPC. LSF-HPC needs SLURM to schedule jobs. If SLURM becomes unresponsive, LSF-HPC drops its processor count to 1 and closes the HP XC virtual host. When SLURM is available again, LSF-HPC adjusts its processor count accordingly and reopens the host. Assigning the Resource Management NodesYou assign nodes to both SLURM and LSF-HPC by assigning them the resource management role; this role includes both the lsf and slurm_controller services. By default, the HP XC resource management system attempts to place the SLURM controller and the LSF execution host on the same node to constrain the use of system resources. If only one node has the resource management role, the LSF-HPC execution daemons and the SLURM control daemon both run on that node. If two nodes are assigned the resource management role, by default, the first node becomes the primary resource management node, and the second node is the backup resource management node. If more than two nodes are assigned the resource management role, the first becomes the primary resource management host and the second becomes the backup SLURM host and the first LSF-HPC failover candidate. Additional nodes with the resource management role can serve as LSF-HPC failover nodes if the either or both of the first two nodes are down. Resource management candidate nodes are ordered in ASCII sort order by node name, after the head node, which is taken first. ExampleIn this example, nodes n3, n4, n15, and n16 are resource management nodes, and n16 is the head node. The selection list is ordered as shown, and the nodes have the corresponding assignments: Node n16 hosts the primary LSF-HPC and SLURM control daemons. (The head node is taken first.) Node n15 hosts the backup SLURM control daemon and serves as the first LSF-HPC failover candidate. (The remaining nodes are ASCII-sorted.) n3 becomes the second choice for LSF-HPC failover. n4 becomes the third choice for LSF-HPC failover.
You can use the controllsf command to change these assignments. - controllsf disable headnode preferred
Specifies that the head node should be ordered at the end of the list, rather than at the head. - controllsf disable slurm affinity
Specifies that HP XC should attempt to place the SLURM and LSF-HPC daemons on separate nodes. - controllsf set primary nodename
Specifies that LSF-HPC should start on some node other than the head node by default.
You can also change the selection of the primary and backup nodes for the SLURM control daemon by editing the SLURM configuration file, /hptc_cluster/slurm/etc/slurm.conf. LSF-HPC Failover and Running JobsIn the event of an LSF-HPC failover, LSF-HPC terminates each job that was previously running. These jobs finish with an exit code of 122. LSF-HPC cannot monitor the running jobs to determine if the job is running appropriately or if it is hung indefinitely when the HP XC LSF execution host fails and the LSF-HPC daemons are restarted on another node. Ensure that each LSF-HPC queue configured in the lsb.queues file includes 122 as a requeue exit value so that these jobs will be requeued and rerun. At a minimum, the entry for each queue should resemble the following: LSF-HPC Monitoring |  |
LSF-HPC is monitored and controlled by Nagios using the check_lsf plug-in. When LSF-HPC is down, the response of the check_lsf plug-in depends on whether LSF-HPC failover is enabled or disabled. - When LSF-HPC failover is disabled
The check_lsf plug-in returns an immediate failure notification to Nagios. - When LSF-HPC failover is enabled
The check_lsf plug-in decides if LSF-HPC is supposed to be running. If so, it acquires a list of resource management nodes and tries to restart LSF-HPC on each of those nodes, in turn, until one succeeds, or until the list is exhausted. If successful, the check_lsf plug-in returns an LSF OK - restarted message. If the restart procedure fails, the check_lsf plug-in returns a failure notification.
LSF Execution Host FailureShould the node hosting LSF-HPC becomes unresponsive, the Nagios check_lsf plug-in takes action. Table 13-2 lists the Nagios messages for LSF failover monitor status: Table 13-2 Nagios messages for LSF Failover Monitor Status | Message | Meaning |
|---|
| LSF OK - up | The LSF-HPC environment appears to be up and operational on the HP XC system | | LSF OK - currently shut down | The LSF-HPC environment has not been started on the HP XC system | | LSF CRITICAL - down | LSF-HPC is not running, and LSF-HPC failover is disabled | | LSF warning - restarted | The LSF-HPC environment was not running, and should have been; it is being restarted. The message should change to LSF OK - up the next time Nagios is updated. | | LSF CRITICAL - {message} | An abnormal problem occurred. The {message} text provides useful diagnostic information. |
Enhancing LSF-HPC |  |
You can set environment variables to influence the operation of LSF-HPC in the HP XC system. These environment variables affect the operation directly and set thresholds for LSF-HPC and SLURM interplay. LSF-HPC Enhancement SettingsTable 13-3 describes the environment variables in the lsf.conf file that you can use to enhance LSF-HPC. Table 13-3 Environment Variables for LSF-HPC Enhancement (lsf.conf File) | Environment Variable | Description |
|---|
LSB_RLA_PORT=port_number | This entry specifies the TCP port used for communication between the LSF-HPC allocation adapter (RLA) and the SLURM scheduler plug-in. The default port number is 6883. | LSB_RLA_TIMEOUT=seconds | This entry defines the communications timeout between RLA and its clients (for example, sbatchd and the SLURM scheduler plug-in.) The default value is 10 seconds. | LSB_RLA_UPDATE=seconds | This entry specifies how often the LSF-HPC scheduler refreshes free node information from RLA. The default value is 600 seconds. | LSB_RLA_WORKDIR=directory | This entry specifies the directory to store the RLA status file. It enables RLA to recover its original state when it restarts. When RLA first starts, it creates the directory defined by LSB_RLA_WORKDIR if it does not exist, then creates subdirectories for each host. Avoid using /tmp or any other directory that is automatically cleaned up by the system. Unless your installation has restrictions on the LSB_SHAREDIR directory, use the default for LSB_SHAREDIR. The default value is LSB_SHAREDIR/cluster_name/rla_workdir. | LSB_SLURM_BESTFIT=Y | This setting provides for either of two systemwide scheduling modes: - first-fit
Under this mode, the scheduler chooses the first-met free nodes to allocate. By default, HPC SLURM integration works under first-fit mode. - best-fit
Under this mode, the scheduler applies a set of criteria to choose nodes with minimal capacities that satisfy the job request. You can set LSB_SLURM_BESTFIT=Y to switch the scheduler to work under best-fit mode. In a heterogeneous HP XC system, a best-fit allocation may be preferable for clusters running a mix of serial and parallel jobs. In this context, best fit means: "the nodes that minimally satisfy the requirements." Nodes with the maximum number of processors are chosen first. For parallel and serial jobs, the nodes with minimal memory, minimal tmp space, and minimal weight are chosen.
| LSB_SLURM_NONLSF_USE=Y | LSF-HPC is configured to terminate unrecognized jobs in the SLURM lsf partition periodically. This entry stops LSF-HPC from periodically terminating unrecognized jobs in the SLURM lsf partition. | LSF_ENABLE_EXTSCHEDULER=Y|y | This setting enables external scheduling for LSF-HPC The default value is Y, which is automatically set by hpcinstall. | LSF_HPC_EXTENSIONS="ext_name,..." | This setting enables Platform LSF-HPC extensions. This setting is undefined by default. The following extension names are supported: SHORT_EVENTFILE This compresses long host name lists when event records are written to the lsb.events and lsb.acct files for large parallel jobs. The short host string has the format: number_of_hosts*real_host_name When SHORT_EVENTFILE is enabled, older daemons and commands (pre-LSF Version 6.1) cannot recognize the lsb.acct and lsb.events file format. For example, the original host list record is as follows: 6 "hostA" "hostA" "hostA" "hostA" "hostB" "hostC" Redundant host names are removed and the host count is changed so that the short host list record is as follows: 3 "4*hostA" "hostB" "hostC" When LSF_HPC_EXTENSION="SHORT_EVENTFILE" is set, and LSF reads the host list from the lsb.events or lsb.acct files, the compressed host list is expanded into a normal host list. This setting applies to the following events: JOB_START — when a normal job is dispatched. JOB_FORCE — when a job is forced with the brun command. JOB_CHUNK — when a job is inserted into a job chunk. JOB_FORWARD — when a job is forwarded to a MultiCluster leased host.
SHORT_PIDLIST This shortens the output from the bjobs command to eliminate many of the process IDs (PIDs) for a job. The bjobs command displays only the first ID and a count of the process group IDs (PGIDs) and process IDs for the job. Without the SHORT_PIDLIST setting, the bjobs -l command displays all the PGIDs and PIDs for the job. With SHORT_PIDLIST set, the bjobs -l command displays a count of the PGIDs and PIDs. RESERVE_BY_STARTTIME LSF selects the reservation that gives the job the earliest predicted start time. By default, if multiple host groups are available for reservation, LSF chooses the largest possible reservation based on the number of slots. When backfill is configured, this can lead to larger jobs not running as their start times are pushed further into the future. BRUN_WITH_TOPOLOGY If a topology request can be satisfied for a brun job, brun preserves the topology request. LSF allocates the resource according to the request and tries to run the job with the requested topology. If allocation fails because the topology request cannot be satisfied, the job is requeued. By default, the job topology request is ignored by the scheduler when it creates an allocation if BRUN_WITH_TOPOLOGY is not specified.
| LSF_HPC_NCPU_COND=and|or | This entry in the lsf.conf file defines how any two LSF_HPC_NCPU_* thresholds are combined. The default value is or. | LSF_HPC_NCPU_INCREMENT=increment | This entry in the lsf.conf file defines the upper limit for the number of processors that are changed since the last checking cycle. The default value is 0. | LSF_HPC_NCPU_INCR_CYCLES=icycles | This entry specifies the minimum number of consecutive cycles in which the number of processors changed does not exceed LSF_HPC_NCPU_INCREMENT. LSF checks total usable processors every 2 minutes. | LSF_HPC_NCPU_THRESHOLD=threshold | This entry specifies the percentage of total usable processors in the LSF partition. The default is 80. | LSF_NON_PRIVILEGED_PORTS=Y|y | Some LSF-HPC communication can occur through privileged ports. This setting disables privileged ports usage ensuring that no communication occurs through privileged ports. Disabling privileged ports helps to ensure system security. By default, LSF daemons and clients running under the root account use privileged ports to communicate with each other. If LSF_NON_PRIVILEGED_PORTS is undefined, and if LSF_AUTH is not defined in lsf.conf file, LSF daemons check the privileged port of request message to perform authentication. If LSF_NON_PRIVILEGED_PORTS=Y is defined, LSF clients (LSF commands and daemons) do not use privileged ports to communicate with daemons, and LSF daemons do not check privileged ports of incoming requests to do authentication. This setting is undefined by default. | LSF_SLURM_DISABLE_CLEANUP=y|Y | This setting disables cleanup of jobs other than LSF jobs running in a SLURM LSF partition. By default, only LSF jobs are allowed to run within a SLURM LSF partition. LSF periodically cleans up any jobs submitted outside of LSF. This cleanup period is defined through LSB_RLA_UPDATE. For example, the following srun job is not submitted through LSF, so it is terminated: % srun -n 4 -p lsf sleep 100000
srun: error: n13: task[0-1]: Terminated
srun: Terminating job |
If LSF_SLURM_DISABLE_CLEANUP=Y is set, this job would be allowed to run. This setting is undefined by default. | LSF_SLURM_TMPDIR=path | This setting specifies the LSF tmp directory for HP XC systems. The default LSF_TMPDIR /tmp cannot be shared across nodes, so LSF_SLURM_TMPDIR must specify a path that is accessible on all nodes of the HP XC system. LSF_SLURM_TMPDIR affects only HP XC configuration. It is ignored on other systems in a mixed cluster environment. The location of LSF tmp directory is determined in the following order: LSF_SLURM_TMPDIR, if defined The default shared directory /hptc_cluster/lsf/tmp
The default path is /hptc_cluster/lsf/tmp. |
Table 13-4 describes the environment variables in the lsb.queues file that you can use to enhance LSF-HPC. Table 13-4 Environment Variables for LSF-HPC Enhancement (lsb.queues File) | Environment Variable | Description |
|---|
DEFAULT_EXTSCHED= SLURM[options[;options]...] | This entry specifies SLURM allocation options for the queue. The -ext options to the bsub command are merged with DEFAULT_EXTSCHED options, and -ext options override any conflicting queue-level options set by DEFAULT_EXTSCHED. For example, if DEFAULT_EXTSCHED=SLURM[nodes=2;tmp=100] and a job is submitted with -ext "SLURM[nodes=3;tmp=]", LSF-HPC uses the following resulting options for scheduling: The nodes=3 specification in the -ext option overrides the nodes=2 setting in DEFAULT_EXTSCHED, and the tmp= assignment in the -ext option overrides the tmp=100 setting in DEFAULT_EXTSCHED. You can use the DEFAULT_EXTSCHED environment variable in combination with MANDATORY_EXTSCHED environment variable in the same queue. For example: -ext "SLURM[nodes=3;tmp=]" |
DEFAULT_EXTSCHED=SLURM[nodes=2;tmp=100] |
MANDATORY_EXTSCHED=SLURM[contiguous=yes;tmp=200] |
LSF-HPC uses the resulting options for scheduling: SLURM[nodes=3;contiguous=yes;tmp=200] |
The nodes=3 specification in the -ext option overrides the nodes=2 setting in DEFAULT_EXTSCHED, and the tmp= specification in the -ext option overrides the tmp=100 setting in DEFAULT_EXTSCHED. MANDATORY_EXTSCHED adds the contiguous=yes setting, and overrides the tmp= in the -ext option and tmp=100 in DEFAULT_EXTSCHED with tmp=200. If allocation options are set in DEFAULT_EXTSCHED, and you do not want to specify values for these options, use the keyword with no value in the -ext option of the bsub command. For example, if DEFAULT_EXTSCHED=SLURM[nodes= 2], and you do not want to specify any node option at all, use -ext "SLURM[nodes=]". For more information, see bsub(1). | MANDATORY_EXTSCHED= SLURM[options[;options]...] | This specifies mandatory SLURM allocation options for the queue. The -ext options of the bsub command are merged with MANDATORY_EXTSCHED options, and MANDATORY_EXTSCHED options override any conflicting job-level options set by the -ext option. This setting overrides the -ext options of the bsub command. For example, if MANDATORY_EXTSCHED=SLURM[contiguous=yes;tmp=200] and a job is submitted with -ext "SLURM[nodes=3;tmp=100]", LSF uses the following resulting options for scheduling: "SLURM[nodes=3;contiguous=yes;tmp=200]" |
You can use the MANDATORY_EXTSCHED environment variable in combination with DEFAULT_EXTSCHED in the same queue. For example: -ext "SLURM[nodes=3;tmp=]" |
DEFAULT_EXTSCHED=SLURM[nodes=2;tmp=100] |
MANDATORY_EXTSCHED=SLURM[contiguous=yes;tmp=200] |
LSF-HPC uses the resulting options for scheduling: SLURM[nodes=3;contiguous=yes;tmp=200] |
The nodes=3 specification in the -ext option overrides the nodes=2 specification in DEFAULT_EXTSCHED, and tmp= in the -ext option overrides the tmp=100 setting in DEFAULT_EXTSCHED. MANDATORY_EXTSCHED adds contiguous=yes, and overrides tmp= in -ext option and tmp=100 in DEFAULT_EXTSCHED with tmp=200. If you want to prevent users from setting allocation options in the -ext option of the bsub command, use the keyword with no value. For example, if the job is submitted with -ext "SLURM[nodes=4]", use MANDATORY_EXTSCHED= RMS[nodes=] to override this setting. For more information, see bsub(1). |
Thresholds in LSF-HPC-SLURM InterplayWhen the HP XC system starts, some computer nodes may take a while to boot. If LSF-HPC starts to report the current number of processors before the system stabilizes, the smaller jobs that are already queued are scheduled. It may be better to run a larger job requesting more processors. Use the following parameters to control when LSF-HPC starts to report total usable processors on HP XC system: Configuring an External Virtual Host Name for LSF-HPC on HP XC Systems |  |
An external virtual host name for LSF-HPC on an HP XC system needs to be accessed from the external network. This access could be required if the HP XC system is added to an existing LSF cluster, or if the HP XC system is 'Multi-Clustered' with another LSF cluster. See the LSF documentation for more details on LSF Multi-Clusters. Perform the following steps to configure an external virtual host name: If LSF-HPC failover is enabled, ensure that each node with the resource management role has an external network connection. Run the following command to confirm this: # shownode roles --role resource_management external
resource_management: n[127-128]
external: n[125-128]
resource_management: n[127-128]
external: n[125-128] |
Obtain a virtual IP and corresponding host name, and ensure that they are not already in use: # ping -i 2 -c 2 xclsf
PING xclsf (10.10.123.1): 56 data bytes
--- xclsf ping statistics ---
2 packets transmitted, 0 packets received, 100% packet loss
# ping -i 2 -c 2 10.10.123.1
PING 10.10.123.1 (10.10.123.1): 56 data bytes
--- 10.10.123.1 ping statistics ---
2 packets transmitted, 0 packets received, 100% packet loss |
Ensure that no jobs are running. For more information, see “Draining Nodes” Shut down LSF: Change the virtual host name to the new one and confirm: # controllsf set virtual hostname xclsf
# controllsf show
LSF is currently shut down, and assigned to node .
Failover is disabled.
Head node is preferred.
The primary LSF host node is n128.
SLURM affinity is enabled.
The virtual hostname is "xclsf". |
Edit the $LSF_ENVDIR/lsf.cluster.cluster_name file and change the LSF-HPC virtual host name to the new one in the HOSTS section. Edit $LSF_ENVDIR/hosts file to remove the old LSF-HPC virtual host name entry. Restart LSF-HPC for SLURM:
|