Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP XC System Software : Administration Guide > Chapter 19 Using Diagnostic Tools

Using the System Interconnect Diagnostic Tools

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

Various tools enable you to diagnose the system interconnect. Some tools are provided by the system interconnect manufacturer and are discussed in the Installation and Operation Guide (the hardware documentation) for your system. Be sure to consult the appropriate Web page for these system interconnect tools:

Other tools have been written specifically for use with the HP XC system.

To use the diagnostic tools, you must ensure that the system interconnect is properly configured. The IP addresses must be configured and the /etc/hosts file must be updated with the switch names, for example MR0N00 for Myrinet system interconnect and QR0N00 for Quadrics system interconnect. These topics are discussed in the HP XC System Software Installation Guide.

Note:

Link errors are common when a node boots or reboots. During boot, the system interconnect driver is initiated, putting the system interconnect into a full reset. This puts the link into reset and always causes an error on the switch connected to the system interconnect.

This section describes the following diagnostic tools:

HP XC Diagnostic Tools for the Myrinet System Interconnect

This section describes tools that were developed specifically for diagnosing the Myrinet system interconnect (from Myricom, Inc.) on the HP XC system. See your system's hardware installation and operation guide for information about standard diagnostic tools.

The gm_prodmode_mon Diagnostic Tool

This program monitors the GM2.1 switch, reads current environment parameters, and generates alerts if the values of the following parameters are outside the operating ranges recommended by the manufacturer:

bad Crcs

The value should be zero(0).

Temperature

The temperature should be less than 104°F (40°C).

Voltage

The voltage should be within +/- 10 percent of nominal voltage.

Fan speed

The fan speed should be above the minimum.

The gm_prodmode_mon diagnostic tool searches /etc/hosts for entries whose name matches the regular expression “MR0[NT][0–9][0–9]”.

This command uses the links -dump command to obtain the current values and parses the output. The gm_prodmode_mon diagnostic tool generates an alert if any errors are found. All alerts are logged in the /var/log/messages file.

The format of this command is:

gm_prodmode_mon-[-help]-[-verbose]-[-d directory-name]

The output from the gm_prodmode_mon is logged to /var/log/diag/myrinet/gm_prodmode_mon/links.log by default, but you can specify another directory with the -d option. Output is displayed to the stdout to show the progress of the diagnostic test.

This command is configured to run once each hour by a crontab file in the /etc/cron.hourly directory.

The gm_drain_test Diagnostic Tool

This diagnostic tool runs five tests for the Myricom® switches in an HP XC system. You must launch it from the head node and run it only during allocated preventive maintenance.

The five tests are as follows:

gm_prodmode_mon

Examines environmental operating parameters.

gm_allsize

Tests network links.

gm_debug

Tests PCI bandwidth.

gm_board_info

Tests host detection.

gm_stress

Exercises the network; it might potentially detect workload problems.

If the mute files exist, they are also gathered.

Note:

Run this test from the head node and only at a time when it will not interfere with productivity.

The output from the gm_drain_test, as well as output from the individual tests, is gathered in a tar file, which is compressed with the gzip utility. You can use the -d option to specify the directory where this file will be located; if you do not specify a directory, it is placed in /var/log/diag/myrinet/gm_drain_test/ by default.

This command has the following format:

gm_drain_test-[-help]-[-d directory-name]

Using Diagnostic Tools for the Quadrics System Interconnect

This section describes tools that were developed specifically for diagnosing the Quadrics system interconnect (from Quadrics, Ltd.) on the HP XC system. In addition, these tools include the swmlogger daemon, provided by Quadrics, which can be configured to log errors to a database along with tools to access the information in this database. See your system's hardware installation and operation guide for information about vendor-specific diagnostic tools.

The swmlogger Daemon

The swmlogger daemon is a Quadrics system interconnect tool that monitors the state of each switch. It updates the database for changes in the status of fans, power supplies, temperature, and link errors. It also handles alerts from the monitoring daemons on each node that generate messages.

The swmlogger daemon provides a service to enable switch management daemons (swmserver) running on embedded QNX modules in QsNetII® network switches to log messages about switch health to the network, either to a log file for generic usage or through a MySQL database (as is the case in HP XC systems). In addition to logging errors in the QsNet database, the swmlogger daemon also logs all errors to the /var/log/messages file. See the diagnostics section of the installation and operation guide for your model of HP cluster platform for additional information on the generic use of swmlogger.

The MySQL database for qsnet is created automatically during the configuration phase of the HP XC system installation and the swmlogger daemon is configured to run on the head node.

Use the qsdiagadm and the qsneterr utilities to report information from the qsnet database.

The qsdiagadm Utility

A useful function of the Quadrics qsdiagadm utility is to display a diagnostic history for the HP XC system, as shown in the following example:

# qsdiagadm
date time     qsdiagadm: diags database created (QMS64,rails=1)
date time     qsctrl: passed power control check (on)
date time     qsctrl: passed population check (ok)
date time     qsctrl: passed bus control check (ok)
date time     qsctrl: passed gateway check (bootp,0.0.0.0)

.
.
.

date time     qsctrl: passed bus control check (ok)
date time     qsctrl: passed gateway check (bootp,0.0.0.0)
date time     qsctrl: passed module heartbeat check
date time     qsctrl: passed firmware version check (43-4081899)
date time     qsctrl: passed tftp server check (172.20.0.16)
date time     qsctrl: passed upgrade file check (503-upgrade.tar)
date time     qsctrl: passed broadcast top check
.
.
.
The qsneterr Utility

The Quadrics qsneterr utility reports on the network errors logged by the swmlogger daemon in the diagnostics database. The default action is to print summaries sorted by time and by module.

The qselantestp Diagnostic Tool

This tool runs the qselantest diagnostic tool in parallel on all the nodes of the HP XC system. The qselantest diagnostic tool verifies the presence of the QM500 adapter in a node and verifies the clock speed, thread processor, SDRAM, PCI_X connectivity, the QsNetII link interface, the direct memory access (DMA) latency, and the link bandwidth between the QM500 adapter and the switch port to which it is connected.

Note:

Because this diagnostic tool uses the adapter and the link 100 percent of the time during testing, it has significant impact on machine performance.

The command format for this utility is as follows:

qselantestp-[-help]-[-d log-directory]-[-rrail]-[-clean]-[-N nodelist]

The options are defined as follows:

-d log-directory

Specifies the directory in which the logs are to be placed. The default directory is /var/log/diag/quadrics/qselantestp.

-r rail

Specifies the rail number. The default is 0.

-clean

Clears out the log file directory. This ensures that if the file already exists, the old data is deleted before the new test is run, to ensure that the data is fresh from the current run.

HP recommends using this option.

-N nodes

Specifies that you want to run this test only on a subset of nodes; the nodes parameter is a comma-separated list of nodes. The default is to run this test on all nodes.

Enter the following command on the head node as superuser to begin this diagnostic:

# qselantestp -clean

The qselantestp tool also parses each node's output file and reports any errors it finds.

The qsnet2_level_test Diagnostic Tool

The qsnet2_level_test utility is a useful tool for diagnosing the Quadrics system interconnect. This utility uses the pdsh command to execute itself (that is, the qsnet2_level_test utility) on all nodes simultaneously.

Each node has a specific path through the network that the node is responsible for testing. This path is derived from the node's physical position in the Quadrics network. The qsnet2_level_test can be run through the various levels to test that path. Therefore, it is important to run level1 first, then level2, level3, and so on, up to the maximum level on your machine topology. If a node fails at level1, then the node will fail on all the other levels because it sends data through level1 to reach the higher levels. Ensure that level1 passes before testing level2.

Note:

This diagnostic tool uses the adapter and the link 100 percent of the time during the test and, as a result, has great affect on machine performance.

You must run the qsnet2_level_test utility as superuser from the head node.

If all the nodes have not completed their tests after a timeout period elapses, the qsnet2_level_test terminates and the incomplete processes on any remaining nodes are killed.

The format of the qsnet2_level_test utility is shown here:

qsnet2_level_test levels [-d dir] [-parse|-noparse ] [-v] [-t timeout] [-N nodes] [-r rail] [-clean]

On the command line, you specify the number of fabric levels you want to test. The qsnet2_level_test utility creates a new directory with the level name in the directory specified automatically on the command line. If the directory already exists, the qsnet2_level_test utility continues and uses the existing directory. Acceptable values for the levels parameter are level1, level2, level3, level, level, and ALL. Test the levels in ascending order.

The options are defined as follows:

-d dir

Specifies the directory to place the logs in. This directory must be visible to all nodes in the HP XC system.

-parse

Parses the existing logs instead of running tests. This can help to identify slow, erratic, or broken links.

-noparse

Enables you to run the qsnet2_level_test to record the log files into the chosen directory, but does not parse the log files to find and report errors automatically. This is particularly useful during drain-time testing, when you have only a 20- to 30-minute period in which to perform preventive maintenance testing. You can use the -parse option to verify the results at a later time.

-v

Specifies verbose output, which is required to identify which component or location is causing errors.

-t timeout

Specifies the timeout value (in seconds), that is, the length to wait for any test to finish. The default value is 300.

-N nodes

Enables you to run the qsnet2_level_test on only a subset of nodes. The argument nodes is a comma-separated list, for example: n1,n2,n4. The default operation is to run the qsnet2_level_test utility on all nodes.

-clean

Clears out the directories if you are doing a test. It clears out all log files in the directory for the level specified. This action is useful because if the file already exists, the old data is deleted before the new test is run, thereby ensuring that the data is fresh from the current run.

This utility uses qsnet2_dmatest to sanitize and test individual links in the network. The qsnet2_dmatest also reports bandwidth for small and large data transfers as it ramps up. It automatically examines and reports if any errors were detected by the system interconnect during the test. The log files for qsnet2_dmatest are written to the directory specified in the command line.

Level3 of the qsnet2_dmatest tests the links between the QM502 card and the QM501 rear switch cards through the midplane. If the QM502 cards are not installed, the test spawns across the nodes, then waits for the timeout period and reports all the nodes as "failed to complete".

Example 1

The following example tests level1. All the nodes write their log files to the directory named level1, which is a subdirectory of the global directory /hptc_cluster/adm/logs/diag/quadrics. The -r 0 option specifies that the test is run on rail 0; if there is only one Quadrics adapter in each node, then rail 0 is the only allowable option. The -clean option is specified so that the qsnet2_level_test utility deletes the contents of the level1 directory to ensure that data from previous tests is removed. The -v option ensures verbose output.

# qsnet2_level_test level1 -d \
/hptc_cluster/adm/logs/diag/quadrics -r 0 -clean -v
Example 2

The following example tests level3. Both nodes specified, n1 and n2, save their log files to the global directory /hptc_cluster/adm/logs/diag/quadrics in the directory named level3. Running the qsnet2_level_test diagnostic tool on only two nodes is useful because you can verify that a failing route has been repaired without affecting the use of the rest of the system. The timeout value is 180 seconds and is specified with the -t option; if a node does not complete within 180 seconds, the qsnet2_level_test diagnostic tool indicates it is a broken link. Also, the corresponding processes are terminated. As in the preceding example, the test is run on rail 0 with the -clean option, to remove data from a previous test that can corrupt the test results.

# qsnet2_level_test level3 -d \
/hptc_cluster/adm/logs/diag/quadrics \
-N n1,n2 -r 0 -clean -v -t 180
Example 3

The following example displays the output of this diagnostic test, run on three nodes, when the timeout expires.

# qsnet2_level_test level3 -d \
/hptc_cluster/adm/logs/diag/quadrics/ \
-c -N n1,n2,n3 -t 180
Starting Level 3 Test
Testing on: n1,n2,n3
Warning: timelimit expired

Killed
Test ran on: n1,n2,n3
Parsing output
level3: n1 - (NodeId = 4)
ERROR: Test incomplete
level3: n2 - (NodeId = 3)
ERROR: Test incomplete
level3: n3 - (NodeId = 2)
ERROR: Test incomplete
Parsing complete 
Example 4

The following example parses the output files created from a previous run of this command. This example specifies the log file directory created after unzipping and extracting the qsnet2_drain_test log file, which is described in the next section.

#  qsnet2_level_test level1 -d \
/var/log/diag/quadrics/qsnet2_drain_test/qsnet2_drain_test \
 -parse

The qsnet2_drain_test Diagnostic Tool

This tool runs up to six tests for the Quadrics switches in an HP XC system:

  • Runs the qsctrl utility to verify that the system interconnects are running within the proper environmental parameters for operation.

  • Runs qsnet2_level_test at level 1.

  • Runs qsnet2_level_test at level 2.

  • Runs qsnet2_level_test at level 3.

  • Runs qsportmap on federated systems to test the link cable connectivity.

  • Runs qsnet2_level_test at level 4 on federated systems.

Note:

You must launch this command from the head node. Run this command only during allocated preventive maintenance time frames because this diagnostic tool uses the adapter and the link 100 percent of the time during the test and, as a result, has a great affect on machine performance.

The command format for qsnet2_drain_test utility is shown here:

qsnet2_drain_test [-help] [-d logdirectory]

The -help option displays the command line options.

The -d option enables you to specify a log directory. The output from the qsnet2_drain_test utility and from individual tests is bundled in a tar file (compressed with the gzip utility) and placed in the specified log directory; the directory/var/log/diag/quadrics/qsnet2_drain_test is used by default if the -d option is not specified.

Note:

You must manually unzip the tar file, extract the files, and examine them for errors.

Using Diagnostic Tools for the InfiniBand Interconnect

The ib_prodmode_mon diagnostic tool monitors the Infiniband switches, looking for InfiniBand network errors, generating alerts and notifying you if it detects these network errors:

  • Links running at 1X speeds instead of the normal 4X

  • Links reporting excessive Receive errors

  • Links reporting IB_TIMEOUT, meaning the node is down.

  • Links reporting a state other than PORT_ACTIVE, meaning the link is down.

The output that ib_prodmode_mon produces identifies the bad links so that you can take corrective action. It resembles the following:

date time node  ib_prodmode_mon: 
IR0N00 - Link xc9n1 GUID  0008f10403961325 (LID 3 PORT 1) 
<==> GUID 0008f10400410876 (LID 1 PORT 1  ) running at 1X

ibt1 - Link ibt1 SLOT 1 PORT 14 GUID 0008f104003f0726 (LID 2 PORT 23) 
<==> R1C5-IB14 PORT 7 GUID 0008f10400410a10 (LID 617 PORT 7 ) reporting 
4297 RcvErrs, which is above threshold of 2400

ibt1 - Link 0008f1040396cd64 PORT 1 GUID 0008f1040396cd65 (LID 950 PORT 
1) <==> R3C10-IB58 PORT 18 GUID 0008f104004108cc (LID 589 PORT 18 ) 
reporting Status ALERT IB_TIMEOUT 

The ib_prodmode_mon diagnostic tool searches /etc/hosts for entries whose name matches the regular expression "IR0[NT][0–9][0–9]".

This command uses the wget command to obtain the PortCounters.csv file from the switch and parses the output. The ib_prodmode_mon diagnostic tool generates an alert if it finds any errors. All alerts are logged in the /var/log/messages file and the ib_prodmode_mon.log file.

The format of this command is:

ib_prodmode_mon [--help] [--verbose] [-d directory-name]

The output from the ib_prodmode_mon is logged to /var/log/diag/ib/ib_prodmode_mon/ib_prodmode_mon.log by default, but you can specify another directory with the -d option. Output is displayed to the stdout to show the progress of the diagnostic test.

This command is configured to run once each day by a crontab file in the /etc/cron.daily directory.

Using Diagnostic Tools for the Gigabit Ethernet System Interconnect

The Gigabit Ethernet system interconnect is diagnosed with the standard Ethernet diagnostic tools, including:

  • tcpdump

  • ifconfig

  • mii-tool

For more information about these diagnostic tools, see the Linux Administration Handbook and the corresponding manpages.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 2003 Hewlett-Packard Development Company, L.P.