 |
» |
|
|
 |
The following list provides suggestions for troubleshooting
OVP test failures: The OVP issues a test
failure if CPU usage on one or more nodes is found to be over 10 percent.
In that case, use the top command to determine
what processes are running and consuming CPU resources. CPU usage might hit a
spike if the system is in a metrics collection phase. In that case,
run the OVP again to see if the spike is a one-time instance or if
the problem is persistent. If the problem is persistent, use the top command to determine what processes are running and
consuming CPU resources. The OVP issues a test
failure if memory usage on the compute nodes is more than 25 percent.
In that case, log in to the compute node and use the top command to determine what processes are running and consuming memory. XC CLUSTER VERIFICATION PROCEDURE
Fri Jul 06 14:44:08 2007
Verify perf_health:
Testing memory_usage ...
The headnode is excluded from the memory usage test.
Number of nodes allocated for this test is 14
Job <2049> is submitted to default queue <interactive>
<< Waiting for dispatch ...>>
<<Starting on lsfhost.localdomain>>
The following node has memory usage more than 25%:
n3: memory usage is 34.38%, 12.08% of which is cache
--- FAILED --- |
The OVP issues a test
failure when a node (in this example, n16) has
values more than three standard deviations from the mean. In this
case, the failure could be a result of a loose cable on node n16.  |
Testing network_unidirectional ...
Number of nodes allocated for this test is 15
Job <2063> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on lsfhost.localdomain>>
[0:n2:1] ping-pong 5021.40 usec/msg 796.59 MB/sec
[1:n3:2] ping-pong 5021.11 usec/msg 796.64 MB/sec
[2:n4:3] ping-pong 5023.52 usec/msg 796.25 MB/sec
[3:n5:4] ping-pong 5026.87 usec/msg 795.72 MB/sec
[4:n6:5] ping-pong 5022.36 usec/msg 796.44 MB/sec
[5:n7:6] ping-pong 5022.75 usec/msg 796.38 MB/sec
[6:n8:7] ping-pong 5029.28 usec/msg 795.34 MB/sec
[7:n9:8] ping-pong 5022.21 usec/msg 796.46 MB/sec
[8:n10:9] ping-pong 5023.11 usec/msg 796.32 MB/sec
[9:n11:10] ping-pong 5026.93 usec/msg 795.71 MB/sec
[10:n12:11] ping-pong 5023.75 usec/msg 796.22 MB/sec
[11:n13:12] ping-pong 5021.82 usec/msg 796.52 MB/sec
[12:n14:13] ping-pong 5023.74 usec/msg 796.22 MB/sec
[13:n15:14] ping-pong 5030.09 usec/msg 795.22 MB/sec
[14:n16:0] ping-pong 5068.12 usec/msg 789.25 MB/sec
Interconnect test results summary (all values in mBytes/sec):
min: 789.250000
max: 796.640000
median: 796.250000
mean: 795.685333
range: 7.390000
variance: 3.365941
std_dev: 1.834650
The following node(s) have values more than
3 standard deviations from the mean:
node n16 has a value of 789.250000
--- FAILED --- |
 |
The OVP issues a test failure if
the number of CPUs reported by LSF and SLURM do not match.  |
From OVP...
Testing hosts_static_resource_info ...
Running 'lshosts -w'.
Checking output from lshosts.
Running 'controllsf show' to determine virtual hostname.
Checking output from controllsf.
Virtual hostname is lsfhost.localdomain
Comparing ncpus from Lsf lshosts to Slurm cpu count.
The Lsf and Slurm cpu count are NOT in sync.
The lshosts 'ncpus' value of 1560 differs from the cpu
total of 2040 calculated from the sinfo output.
Suggest running 'lshosts -w' manually and compare the ncpus
value with the output from sinfo
--- FAILED ---
Testing hosts_status ...
Running 'bhosts -w'.
Checking output from bhosts.
Running 'controllsf show' to determine virtual hostname.
Checking output from controllsf.
Virtual hostname is lsfhost.localdomain
Comparing MAX job slots from Lsf bhosts to Slurm cpu count.
The Lsf MAX job slots and Slurm cpu count are NOT in sync.
The bhosts 'MAX' value of 1560 differs from the cpu
total of 2040 calculated from the sinfo output.
Suggest running 'bhosts -w' manually and compare the MAX job
slots value with the output from sinfo.
--- FAILED --- |
 |
Follow this procedure to resolve the discrepancy
in available CPU resources: Restart the LIM daemon and
update licensing information: Wait a few seconds and run
the following command to confirm that the number of CPUs is correct: When the output of the lshosts command is correct, update LSF with static resources
(CPUs and memory) to match what SLURM is reporting: The value reported must match the total number
of CPUs reported by SLURM.
OVP network_bidirectional Test Might Report False Error on
HP Server Blades |  |
The OVP network_bidirectional test might report a false failure at enclosure boundaries. If these
errors occur, rerun the OVP with a double verbose option (--verbose --verbose and verify the actual results versus
the mean result. If the difference is less the 5%, you can safely
ignore the errors. The following is an example of a false failure.
Nodes ibblc64 and ibblc65 are on enclosure boundaries. Exchange results summary (all values in mBytes/sec):
min: 2077.790000
max: 2143.940000
median: 2107.490000
mean: 2107.259747
range: 66.150000
variance: 76.098854
std_dev: 8.723466
The following node pairs have values more than
3 standard deviations from the mean:
nodes ibblc64 and ibblc65 have an Exchange value of 2077.790000 |
OVP Reports Benign Nagios Warnings |  |
The OVP might
return the following Nagios warning messages. These messages are
benign and you can ignore them.  |
Verify nagios:
Testing configuration ...
Running basic sanity check on the Nagios configuration file.
Starting the command:
/opt/hptc/bin/nagios -v /opt/hptc/nagios/etc/nagios_local.cfg
Here is the output from the command:
Warnings were reported.
Nagios 2.3.1
Copyright (c) 1999-2006 Ethan Galstad (http://www.nagios.org)
Last Modified: 05-15-2006
License: GPL
Reading configuration data...
Warning: Duplicate definition found for service 'nagiosmonitor'
(config file '/opt/hptc/nagios/etc/xc-monitor-n8.cfg', starting on line 195)
Warning: Duplicate definition found for service 'hostmonitor'
(config file '/opt/hptc/nagios/etc/xc-monitor-n8.cfg', starting on line 206)
Warning: Duplicate definition found for service 'syslogalertmonitor'
(config file '/opt/hptc/nagios/etc/xc-monitor-n8.cfg', starting on line 217)
Warning: Duplicate definition found for service 'keysync'
(config file '/opt/hptc/nagios/etc/xc-monitor-n8.cfg', starting on line 232)
Warning: Duplicate definition found for service 'sensorCollectionMonitor'
(config file '/opt/hptc/nagios/etc/xc-monitor-n8.cfg', starting on line 252)
Warning: Duplicate definition found for service 'selmon'
(config file '/opt/hptc/nagios/etc/xc-monitor-n8.cfg', starting on line 274)
Running pre-flight check on configuration data...
Checking services...
Warning: Service 'Syslog Alert Monitor' on host 'nh' has a notification interval
less than its check interval! Notifications are only re-sent after checks are made, so
the effective notification interval will be that of the check interval.
Warning: Service 'Syslog Alerts' on host 'nh' has a notification interval less
than its check interval! Notifications are only re-sent after checks are made, so the
effective notification interval will be that of the check interval.
Warning: Service 'System Event Log Monitor' on host 'nh' has a notification
interval less than its check interval! Notifications are only re-sent after checks
are made, so the effective notification interval will be that of the check interval.
Warning: Service 'Syslog Alert Monitor' on host 'n6' has a notification
interval less than its check interval! Notifications are only re-sent after checks are
made, so the effective notification interval will be that of the check interval.
Warning: Service 'Syslog Alerts' on host 'n6' has a notification interval
less than its check interval! Notifications are only re-sent after checks are made,
so the effective notification interval will be that of the check interval.
Warning: Service 'System Event Log Monitor' on host 'n6' has a notification
interval less than its check interval! Notifications are only re-sent after checks are
made, so the effective notification interval will be that of the check interval.
Warning: Service 'Syslog Alerts' on host 'n7' has a notification interval
less than its check interval! Notifications are only re-sent after checks are made,
so the effective notification interval will be that of the check interval.
Checked 49 services.
Checking hosts...
.
.
. |
 |
OVP qsnet_database Test May Fail Due to Benign Errors Returned
By the qsctrl Utility |  |
The following issue is specific to systems using
the QsNetII interconnect. In this release, it is not possible
for the qsnet2 utilities to detect that a node is missing. Therefore,
the /usr/bin/qsctrl utility reports a warning message
for all links in reset. For example:  |
# qsctrl
qsctrl: QR0N00:00:0:0 <--> Elan:0:0 state 3 should be 4
qsctrl: QR0N00:00:0:1 <--> Elan:0:1 state 3 should be 4
qsctrl: QR0N00:00:0:2 <--> Elan:0:2 state 3 should be 4
qsctrl: QR0N00:00:0:3 <--> Elan:0:3 state 3 should be 4
qsctrl: QR0N00:00:1:0 <--> Elan:0:4 state 3 should be 4
qsctrl: QR0N00:00:1:1 <--> Elan:0:5 state 3 should be 4
qsctrl: QR0N00:00:1:2 <--> Elan:0:6 state 3 should be 4
qsctrl: QR0N00:00:1:3 <--> Elan:0:7 state 3 should be 4
qsctrl: QR0N00:00:2:0 <--> Elan:0:8 state 3 should be 4
qsctrl: QR0N00:00:2:1 <--> Elan:0:9 state 3 should be 4
qsctrl: QR0N00:00:2:2 <--> Elan:0:10 state 3 should be 4
qsctrl: QR0N00:00:2:3 <--> Elan:0:11 state 3 should be 4
qsctrl: QR0N00:00:3:0 <--> Elan:0:12 state 3 should be 4
qsctrl: QR0N00:00:3:1 <--> Elan:0:13 state 3 should be 4
qsctrl: QR0N00:00:3:2 <--> Elan:0:14 state 3 should be 4
qsctrl: QR0N00:00:3:3 <--> Elan:0:15 state 3 should be 4
qsctrl: QR0N00:01:0:0 <--> Elan:0:16 state 3 should be 4
qsctrl: QR0N00:01:0:1 <--> Elan:0:17 state 3 should be 4
qsctrl: QR0N00:01:0:2 <--> Elan:0:18 state 3 should be 4
qsctrl: QR0N00:01:0:3 <--> Elan:0:19 state 3 should be 4
qsctrl: Warning: failed link state check on 1 modules |
 |
To work around this issue, configure out the unconnected
links using the qsctrl -o command. For example, # qsctrl -o QR0N00:00:0:0
# qsctrl -o QR0N00:00:0:1
# qsctrl -o QR0N00:00:0:2 |
If you attach nodes to any of these ports, you
must configure them back in again before the link can be used. For
example: # qsctrl -i QR0N00:00:0:0 |
|