Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP XC System Software : Administration Guide > Chapter 17 Troubleshooting

LSF-HPC Troubleshooting

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

Take the following steps if you are have trouble submitting jobs or controlling LSF-HPC:

  • Ensure that the number of nodes in the lsf partition is less than or equal to the number of nodes reported in the XC.lic file. Sample entries follow:

    INCREMENT XC-CPUS Compaq auth.number exp. date nodes ... 
    INCREMENT XC-PROCESSORS Compaq auth.number exp. date nodes ... 

    The value for nodes in the XC-CPUS or XC-PROCESSORS entry specifies the number of licensed nodes for this system. If this value does not match the number of actual nodes, the LSF service may fail to start LSF.

    Use the lshosts command to determine the number of processors reported by LSF.

  • Ensure that the date is synchronized throughout the HP XC system.

  • Verify that the /hptc_cluster directory (file system) is properly mounted on all nodes. SLURM relies on this file system.

  • Ensure that SLURM is configured, up, and running properly.

  • Examine the SLURM log files in /var/slurm/log/ directory on the SLURM master node for any problems.

  • If the sinfo command reports that the node is down and daemons are running, examine the available processors vs. Procs setting in the slurm.conf file.

  • Ensure that the lsf partition is configured correctly.

  • Verify that the system licensing is operational. Use the lmstat -a command.

  • Ensure that munge is running on all compute nodes.

  • If you are experiencing LSF communication problems, examine for potential firewall issues.

  • When LSF-HPC failover is disabled and the LSF execution host (which is not the head node) goes down, issue the controllsf command to restart LSF-HPC on the HP XC system:

    # controllsf start
  • When failover is enabled, you need to intervene only when the primary LSF execution host is not started on HP XC system startup (when the startsys command is run). Use the controllsf command to restart LSF-HPC.

    # controllsf start
  • When starting LSF-HPC after a partial system shutdown, LSF is started on the head node if:

    • LSF failover is enabled.

    • The head node has the "resource management" role and no other resource management node is up.

    • The head node has the “resource management” role and the enable headnode preferred subcommand is set.

    • LSF-HPC was not shut down cleanly, perhaps as a result of running startsys without running service lsf stop or controllsf stop on the head node.

    • LSF-HPC starts on the head node if the other resource management nodes are unavailable

  • LSF-HPC failover may select the node that it just released.

    LSF-HPC failover attempts to ensure that a different node is used after it removes control from the present node. However, if all other options are exhausted, LSF-HPC failover tries the current node again before giving up.

    If you are trying to perform load balancing, log in to the primary LSF execution host node and execute the controllsf start here command from that node.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 2003 Hewlett-Packard Development Company, L.P.