Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP XC System Software : Installation Guide > Chapter 5 Verifying the System

Task 1: Run the Operation Verification Program (OVP)

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

The OVP verifies the major HP XC system components to provide a level of confidence that the software has been installed correctly and the system is functioning and operational.

The OVP performs various tests to verify the following:

  • The interconnect is functional.

  • Network connectivity has been established.

  • The administration network is operational.

  • A valid license key file is installed and the license manager servers are up.

  • All compute nodes are responding and are available to run applications.

  • SLURM control daemons are responding and partitioning is valid.

  • LSF-HPC with SLURM or standard LSF is up and running.

  • The xring test verifies the MPI compiler and launches and executes an MPI job

  • Serial and parallel applications can be submitted and executed through standard LSF or LSF-HPC with SLURM from all compute nodes.

Follow this procedure to test your HP XC system:

  1. Begin this procedure as the root user on the head node.

  2. Start the verification procedure, without any options, to test the entire system:

    # ovp --verbose

    Command output is similar to the following. This example output was taken from a small system that has LSF-HPC with SLURM installed. Output looks different on a system with standard LSF.

    XC CLUSTER VERIFICATION PROCEDURE
    Tue Dec 20 09:41:02 2005
    
    Verify connectivity:
    
        Testing etc_hosts_integrity ...
    
    	There are 19 IP addresses to ping.
    	
    	A total of 19 addresses were pinged.
    	
    	Test completed successfully.  All IP addresses were reachable.
    
            +++ PASSED +++
    
    Verify client_nodes:
    
        Testing hptc_cluster_mount ...
    
    	Determining nodes that provide the hptc_cluster_fs_client service.
    	6 nodes provide the hptc_cluster_fs_client service.
    	
    	Now checking if /hptc_cluster is mounted on those nodes.
    	
            +++ PASSED +++
    
    Verify time_synchronism:
    
        Testing  ...
    
    	Comparing time on all nodes with time on head node.
    	
    	n11 time diff 0 ok.
    	n12 time diff 0 ok.
    	n13 time diff 0 ok.
    	n14 time diff 0 ok.
    	n15 time diff 0 ok.
    
            +++ PASSED +++
    
    Verify license:
    
        Testing file_integrity ...
    
    	Checking license file: /opt/hptc/etc/license/XC.lic
    
            +++ PASSED +++
    
        Testing server_status ...
    
    	Starting the command:
    	
    	    /opt/hptc/sbin/lmstat
    	
    	Here is the output from the command:
    	
    	  lmstat - Copyright (c) 1989-2004 by Macrovision Corporation. 
        All rights reserved.
    	  Flexible License Manager status on Tue 12/20/2005 09:41
    	  
    	  License server status: 27000@n16
    	      License file(s) on n16: /opt/hptc/etc/license/XC.lic:
    	  
    	      n16: license server UP (MASTER) v9.2
    	  
    	  Vendor daemon status (on n16):
    	  
    	      Compaq: UP v9.2
    	  
    	
    	Checking output from command.
    
            +++ PASSED +++
    
    Verify SLURM:
    
        Testing daemon_responds ...
    
    	Starting the command:
    	
    	    /opt/hptc/bin/scontrol ping
    	
    	Here is the output from the command:
    	
    	  Slurmctld(primary/backup) at n16/(NULL) are UP/DOWN
    	
    	Checking output from scontrol.
    
            +++ PASSED +++
    
        Testing partition_state ...
    
    	Starting the command:
    	
    	    /opt/hptc/bin/sinfo --all
    	
    	Here is the output from the command:
    	
    	  PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
    	  lsf          up   infinite     6   idle n[11-16]
    	
    	Checking output from command.
    
            +++ PASSED +++
    
        Testing node_state ...
    
    	Starting the command:
    	
    	    /opt/hptc/bin/sinfo --all --noheader --Node
    	
    	Here is the output from the command:
    	
    	  n[11-16]     6       lsf idle  
    	
    	Checking for non-idle node states.
    	
            +++ PASSED +++
    
    Verify LSF:
    
        Testing identification ...
    
    	Starting the command:
    	
    	/opt/hptc/lsf/top/6.1/linux2.6-glibc2.3-ia32e-slurm/bin/lsid
    	
    	Here is the output from the command:
    	
    	  Platform LSF HPC 6.1 for SLURM, Aug  9 2005
    	  Copyright 1992-2005 Platform Computing Corporation
    	  
    	  My cluster name is hptclsf
    	  My master name is lsfhost.localdomain
    	
    	Checking output from command.
    
            +++ PASSED +++
    
        Testing hosts_static_resource_info ...
    
    	Running 'lshosts -w'.
    	Checking output from lshosts.
    	Running 'controllsf show' to determine virtual hostname.
    	Checking output from controllsf.
    	Virtual hostname is lsfhost.localdomain
    	Comparing ncpus from Lsf lshosts to Slurm cpu count.
    	
            +++ PASSED +++
    
        Testing hosts_status ...
    
    	Running 'bhosts -w'.
    	Checking output from bhosts.
    	Running 'controllsf show' to determine virtual hostname.
    	Checking output from controllsf.
    	Virtual hostname is lsfhost.localdomain
    	Comparing MAX job slots from Lsf bhosts to Slurm cpu count.
    
            +++ PASSED +++
    
    Verify interconnect:
    
        Testing infiniband/port_state ...
    
    	Collecting vstat data from all nodes...
    	
            +++ PASSED +++
    
    Verify nagios:
    
        Testing configuration ...
    
    	Running basic sanity check on the Nagios configuration file.
    	Starting the command:
     /opt/hptc/bin/nagios -v /opt/hptc/nagios/etc/nagios_local.cfg
    	
    	Here is the output from the command:
    	
    	Nagios 2.0b6
    	Copyright (c) 1999-2005 Ethan Galstad (http://www.nagios.org)
    	Last Modified: 11-30-2005
    	License: GPL
    	
    	Reading configuration data...
    	Running pre-flight check on configuration data...
    	
    	Checking services...
    		Checked 71 services.
    	Checking hosts...
    	Warning: Host 'necs1-1' has no services associated with it!
    		Checked 8 hosts.
    	Checking host groups...
    		Checked 10 host groups.
    	Checking service groups...
    		Checked 6 service groups.
    	Checking contacts...
    		Checked 1 contacts.
    	Checking contact groups...
    		Checked 1 contact groups.
    	Checking service escalations...
    		Checked 0 service escalations.
    	Checking service dependencies...
    		Checked 156 service dependencies.
    	Checking host escalations...
    		Checked 0 host escalations.
    	Checking host dependencies...
    		Checked 0 host dependencies.
    	Checking commands...
    		Checked 53 commands.
    	Checking time periods...
    		Checked 4 time periods.
    	Checking extended host info definitions...
    		Checked 0 extended host info definitions.
    	Checking extended service info definitions...
    		Checked 0 extended service info definitions.
    	Checking for circular paths between hosts...
    	Checking for circular host and service dependencies...
    	Checking global event handlers...
    	Checking obsessive compulsive processor commands...
    	Checking misc settings...
    	
    	Total Warnings: 1
    	Total Errors:   0
    	
    	Things look okay - No serious problems were detected during the 
      pre-flight check
    
            +++ PASSED +++
    
    Verify cluster:
    
        Testing xring ...
    
    	Loading module mpi/hp/default.
    	Send 200 messages with a size of 1024 bytes to 6 hosts.
    	
    	Done: 200 messages were sent by proc 0
    	Total messages sent by all procs: 1200
    
            +++ PASSED +++
    
    This verification has completed with 0 failures.
    
    A total of 14 tests were run.
    
    Details of this verification have been recorded in:
    
    	/hptc_cluster/adm/logs/ovp/verify_n16_122005.1
    
  3. Examine the test result output to ensure that all tests passed.

    Test results are stored in a time-stamped log file located in the /hptc_cluster/adm/logs/ovp/ directory. The log file name includes the head node name and the current date in the format MMDDYY (2-digit month, 2-digit day, and 2-digit year). For example, in the previous command output, the test was run on December 20, 2005, on node n16, the head node.

    Test failures and warnings are clearly reported, and the log file contains some troubleshooting information. In some cases, the errors might be obvious, and the test output is terse. For example, an LSF test might fail, and the log file message might say that LSF has not been configured.

For more information about verifying individual cluster components on demand, see ovp(8) and the HP XC System Software Administration Guide .

Interconnect diagnostic tests are documented in the installation and operation guide for your model of HP cluster platform and in the HP XC System Software Administration Guide.

When all OVP tests pass, proceed to “Task 2: Take a Snapshot of the Database”.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 2003 Hewlett-Packard Development Company, L.P.