Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP XC System Software: Installation Guide > Chapter 12 Troubleshooting

Troubleshooting the OVP

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

The following list provides suggestions for troubleshooting OVP test failures:

  • The OVP issues a test failure if CPU usage on one or more nodes is found to be over 10 percent. In that case, use the top command to determine what processes are running and consuming CPU resources.

  • CPU usage might hit a spike if the system is in a metrics collection phase. In that case, run the OVP again to see if the spike is a one-time instance or if the problem is persistent. If the problem is persistent, use the top command to determine what processes are running and consuming CPU resources.

  • The OVP issues a test failure if memory usage on the compute nodes is more than 25 percent. In that case, log in to the compute node and use the top command to determine what processes are running and consuming memory.

    XC CLUSTER VERIFICATION PROCEDURE
    Fri Jul 06 14:44:08 2007
     
    Verify perf_health:
     
        Testing memory_usage ...
     
            The headnode is excluded from the memory usage test.
            Number of nodes allocated for this test is 14
            Job <2049> is submitted to default queue <interactive>
            << Waiting for dispatch ...>>
            <<Starting on lsfhost.localdomain>>
            The following node has memory usage more than 25%:
     
              n3:   memory usage is 34.38%, 12.08% of which is cache
            --- FAILED ---
  • The OVP issues a test failure when a node (in this example, n16) has values more than three standard deviations from the mean. In this case, the failure could be a result of a loose cable on node n16.

        Testing network_unidirectional ...
     
            Number of nodes allocated for this test is 15
            Job <2063> is submitted to default queue <interactive>.
            <<Waiting for dispatch ...>>
            <<Starting on lsfhost.localdomain>>
            [0:n2:1] ping-pong 5021.40 usec/msg 796.59 MB/sec
            [1:n3:2] ping-pong 5021.11 usec/msg 796.64 MB/sec
            [2:n4:3] ping-pong 5023.52 usec/msg 796.25 MB/sec
            [3:n5:4] ping-pong 5026.87 usec/msg 795.72 MB/sec
            [4:n6:5] ping-pong 5022.36 usec/msg 796.44 MB/sec
            [5:n7:6] ping-pong 5022.75 usec/msg 796.38 MB/sec
            [6:n8:7] ping-pong 5029.28 usec/msg 795.34 MB/sec
            [7:n9:8] ping-pong 5022.21 usec/msg 796.46 MB/sec
            [8:n10:9] ping-pong 5023.11 usec/msg 796.32 MB/sec
            [9:n11:10] ping-pong 5026.93 usec/msg 795.71 MB/sec
            [10:n12:11] ping-pong 5023.75 usec/msg 796.22 MB/sec
            [11:n13:12] ping-pong 5021.82 usec/msg 796.52 MB/sec
            [12:n14:13] ping-pong 5023.74 usec/msg 796.22 MB/sec
            [13:n15:14] ping-pong 5030.09 usec/msg 795.22 MB/sec
            [14:n16:0] ping-pong 5068.12 usec/msg 789.25 MB/sec
     
            Interconnect test results summary (all values in mBytes/sec):
              min:    789.250000
              max:    796.640000
              median: 796.250000
              mean:   795.685333
              range:       7.390000
              variance:    3.365941
              std_dev:     1.834650
            The following node(s) have values more than
            3 standard deviations from the mean:
     
              node n16 has a value of 789.250000
     
            --- FAILED ---
  • The OVP issues a test failure if the number of CPUs reported by LSF and SLURM do not match.

    From OVP...
    Testing hosts_static_resource_info ...
    
            Running 'lshosts -w'.
            Checking output from lshosts.
            Running 'controllsf show' to determine virtual hostname.
            Checking output from controllsf.
            Virtual hostname is lsfhost.localdomain
            Comparing ncpus from Lsf lshosts to Slurm cpu count.
            The Lsf and Slurm cpu count are NOT in sync.
            The lshosts 'ncpus' value of 1560 differs from the cpu
            total of 2040 calculated from the sinfo output.
    
            Suggest running 'lshosts -w' manually and compare the ncpus
            value with the output from sinfo
    
            --- FAILED ---
    
        Testing hosts_status ...
    
            Running 'bhosts -w'.
            Checking output from bhosts.
    
            Running 'controllsf show' to determine virtual hostname.
            Checking output from controllsf.
            Virtual hostname is lsfhost.localdomain
            Comparing MAX job slots from Lsf bhosts to Slurm cpu count.
            The Lsf MAX job slots and Slurm cpu count are NOT in sync.
            The bhosts 'MAX' value of 1560 differs from the cpu
            total of 2040 calculated from the sinfo output.
    
            Suggest running 'bhosts -w' manually and compare the MAX job
            slots value with the output from sinfo.
    
            --- FAILED ---

    Follow this procedure to resolve the discrepancy in available CPU resources:

    1. Restart the LIM daemon and update licensing information:

      # lsadmin limrestart
    2. Wait a few seconds and run the following command to confirm that the number of CPUs is correct:

      # lshosts -w
    3. When the output of the lshosts command is correct, update LSF with static resources (CPUs and memory) to match what SLURM is reporting:

      # badmin reconfig

      The value reported must match the total number of CPUs reported by SLURM.

OVP network_bidirectional Test Might Report False Error on HP Server Blades

The OVP network_bidirectional test might report a false failure at enclosure boundaries. If these errors occur, rerun the OVP with a double verbose option (--verbose --verbose and verify the actual results versus the mean result. If the difference is less the 5%, you can safely ignore the errors.

The following is an example of a false failure. Nodes ibblc64 and ibblc65 are on enclosure boundaries.

       Exchange results summary (all values in mBytes/sec):
         min:    2077.790000
         max:    2143.940000
         median: 2107.490000
         mean:   2107.259747
         range:       66.150000
         variance:    76.098854
         std_dev:     8.723466
       The following node pairs have values more than
       3 standard deviations from the mean:

         nodes ibblc64 and ibblc65 have an Exchange value of 2077.790000

OVP Reports Benign Nagios Warnings

The OVP might return the following Nagios warning messages. These messages are benign and you can ignore them.

Verify nagios:

    Testing configuration ...

        Running basic sanity check on the Nagios configuration file.
        Starting the command:
                /opt/hptc/bin/nagios -v /opt/hptc/nagios/etc/nagios_local.cfg

        Here is the output from the command:

        Warnings were reported.
        Nagios 2.3.1
        Copyright (c) 1999-2006 Ethan Galstad (http://www.nagios.org)
        Last Modified: 05-15-2006
        License: GPL

        Reading configuration data...

        Warning: Duplicate definition found for service 'nagiosmonitor' 
(config file '/opt/hptc/nagios/etc/xc-monitor-n8.cfg', starting on line 195)
        Warning: Duplicate definition found for service 'hostmonitor' 
(config file '/opt/hptc/nagios/etc/xc-monitor-n8.cfg', starting on line 206)
        Warning: Duplicate definition found for service 'syslogalertmonitor'
(config file '/opt/hptc/nagios/etc/xc-monitor-n8.cfg', starting on line 217)
        Warning: Duplicate definition found for service 'keysync' 
(config file '/opt/hptc/nagios/etc/xc-monitor-n8.cfg', starting on line 232)
        Warning: Duplicate definition found for service 'sensorCollectionMonitor' 
(config file '/opt/hptc/nagios/etc/xc-monitor-n8.cfg', starting on line 252)
        Warning: Duplicate definition found for service 'selmon' 
(config file '/opt/hptc/nagios/etc/xc-monitor-n8.cfg', starting on line 274)
        Running pre-flight check on configuration data...

        Checking services...
        Warning: Service 'Syslog Alert Monitor' on host 'nh'  has a notification interval 
less than its check interval!  Notifications are only re-sent after checks are made, so 
the effective notification interval will be that of the check interval.
        Warning: Service 'Syslog Alerts' on host 'nh'  has a notification interval less 
than its check interval!  Notifications are only re-sent after checks are made, so the 
effective notification interval will be that of the check interval.
        Warning: Service 'System Event Log Monitor' on host 'nh'  has a notification 
interval less than its check interval!  Notifications are only re-sent after checks 
are made, so the effective notification interval will be that of the check interval.
        Warning: Service 'Syslog Alert Monitor' on host 'n6'  has a notification 
interval less than its check interval!  Notifications are only re-sent after checks are 
made, so the effective notification interval will be that of the check interval.
        Warning: Service 'Syslog Alerts' on host 'n6'  has a notification interval 
less than its check interval!  Notifications are only re-sent after checks are made, 
so the effective notification interval will be that of the check interval.
        Warning: Service 'System Event Log Monitor' on host 'n6'  has a notification 
interval less than its check interval!  Notifications are only re-sent after checks are 
made, so the effective notification interval will be that of the check interval.
        Warning: Service 'Syslog Alerts' on host 'n7'  has a notification interval 
less than its check interval!  Notifications are only re-sent after checks are made, 
so the effective notification interval will be that of the check interval.
                Checked 49 services.
        Checking hosts...
        .
        .
        . 

OVP qsnet_database Test May Fail Due to Benign Errors Returned By the qsctrl Utility

The following issue is specific to systems using the QsNetII interconnect.

In this release, it is not possible for the qsnet2 utilities to detect that a node is missing. Therefore, the /usr/bin/qsctrl utility reports a warning message for all links in reset. For example:

# qsctrl
qsctrl: QR0N00:00:0:0 <--> Elan:0:0 state 3 should be 4
qsctrl: QR0N00:00:0:1 <--> Elan:0:1 state 3 should be 4
qsctrl: QR0N00:00:0:2 <--> Elan:0:2 state 3 should be 4
qsctrl: QR0N00:00:0:3 <--> Elan:0:3 state 3 should be 4
qsctrl: QR0N00:00:1:0 <--> Elan:0:4 state 3 should be 4
qsctrl: QR0N00:00:1:1 <--> Elan:0:5 state 3 should be 4
qsctrl: QR0N00:00:1:2 <--> Elan:0:6 state 3 should be 4
qsctrl: QR0N00:00:1:3 <--> Elan:0:7 state 3 should be 4
qsctrl: QR0N00:00:2:0 <--> Elan:0:8 state 3 should be 4
qsctrl: QR0N00:00:2:1 <--> Elan:0:9 state 3 should be 4
qsctrl: QR0N00:00:2:2 <--> Elan:0:10 state 3 should be 4
qsctrl: QR0N00:00:2:3 <--> Elan:0:11 state 3 should be 4
qsctrl: QR0N00:00:3:0 <--> Elan:0:12 state 3 should be 4
qsctrl: QR0N00:00:3:1 <--> Elan:0:13 state 3 should be 4
qsctrl: QR0N00:00:3:2 <--> Elan:0:14 state 3 should be 4
qsctrl: QR0N00:00:3:3 <--> Elan:0:15 state 3 should be 4
qsctrl: QR0N00:01:0:0 <--> Elan:0:16 state 3 should be 4
qsctrl: QR0N00:01:0:1 <--> Elan:0:17 state 3 should be 4
qsctrl: QR0N00:01:0:2 <--> Elan:0:18 state 3 should be 4
qsctrl: QR0N00:01:0:3 <--> Elan:0:19 state 3 should be 4
qsctrl: Warning: failed link state check on 1 modules

To work around this issue, configure out the unconnected links using the qsctrl -o command. For example,

# qsctrl -o QR0N00:00:0:0
# qsctrl -o QR0N00:00:0:1
# qsctrl -o QR0N00:00:0:2
NOTE: You must only configure out links where the ELAN connections are identified with the Elan designator in the destination field. For example,
qsctrl: QR0N00:02:3:3  <--> Elan:0:47 link state normal 

Links that have the QR0Nxx designator in both the origin and destination field must not be configured out. Doing so will cause the whole chip to go into reset. For example, do not configure out the link if it looks as follows

qsctrl: QR0N00:04:0:3  <--> QR0N00:03:7:4 link state reset 

If you attach nodes to any of these ports, you must configure them back in again before the link can be used. For example:

# qsctrl -i QR0N00:00:0:0
Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 2003 Hewlett-Packard Development Company, L.P.