Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP XC System Software : Administration Guide > Chapter 15 Managing LSF

LSF-HPC with SLURM Monitoring

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

LSF-HPC with SLURM is monitored and controlled by Nagios using the check_lsf plug-in.

When LSF-HPC with SLURM is down, the response of the check_lsf plug-in depends on whether LSF-HPC with SLURM failover is enabled or disabled:

  • When LSF-HPC with SLURM failover is disabled

    The check_lsf plug-in returns an immediate failure notification to Nagios.

  • When LSF-HPC with SLURM failover is enabled

    The check_lsf plug-in decides if LSF-HPC with SLURM is supposed to be running. If so, it acquires a list of resource management nodes and tries to restart LSF-HPC with SLURM on each of those nodes, in turn, until one succeeds, or until the list is exhausted.

    If successful, the check_lsf plug-in returns an LSF OK - restarted message.

    If the restart procedure fails, the check_lsf plug-in returns a failure notification.

LSF Execution Host Failure

If the node hosting LSF-HPC with SLURM becomes unresponsive, the Nagios check_lsf plug-in takes action.

Table 15-2 lists the Nagios messages for LSF failover monitor status:

Table 15-2 Nagios Messages for LSF-HPC with SLURM Failover Monitor Status

MessageMeaning
LSF OK - upThe LSF-HPC with SLURM environment appears to be up and operational on the HP XC system
LSF OK - currently shut downThe LSF-HPC with SLURM environment has not been started on the HP XC system
LSF CRITICAL - downLSF-HPC with SLURM is not running, and LSF-HPC with SLURM failover is disabled
LSF warning - restartedThe LSF-HPC with SLURM environment was not running, and should have been; it is being restarted. The message changes to LSF OK - up the next time Nagios is updated.
LSF CRITICAL - {message}An abnormal problem occurred. The {message} text provides useful diagnostic information.

 

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 2003 Hewlett-Packard Development Company, L.P.