| United States-English |
|
|
|
![]() |
HP XC System Software : Administration Guide > Chapter 19 Using Diagnostic ToolsUsing the System Interconnect Diagnostic Tools |
|
Various tools enable you to diagnose the system interconnect. Some tools are provided by the system interconnect manufacturer and are discussed in the Installation and Operation Guide (the hardware documentation) for your system. Be sure to consult the appropriate Web page for these system interconnect tools: Other tools have been written specifically for use with the HP XC system. To use the diagnostic tools, you must ensure that the system interconnect is properly configured. The IP addresses must be configured and the /etc/hosts file must be updated with the switch names, for example MR0N00 for Myrinet system interconnect and QR0N00 for Quadrics system interconnect. These topics are discussed in the HP XC System Software Installation Guide. This section describes the following diagnostic tools: This section describes tools that were developed specifically for diagnosing the Myrinet system interconnect (from Myricom, Inc.) on the HP XC system. See your system's hardware installation and operation guide for information about standard diagnostic tools. This program monitors the GM2.1 switch, reads current environment parameters, and generates alerts if the values of the following parameters are outside the operating ranges recommended by the manufacturer: The gm_prodmode_mon diagnostic tool searches /etc/hosts for entries whose name matches the regular expression “MR0[NT][0–9][0–9]”. This command uses the links -dump command to obtain the current values and parses the output. The gm_prodmode_mon diagnostic tool generates an alert if any errors are found. All alerts are logged in the /var/log/messages file. The format of this command is:
The output from the gm_prodmode_mon is logged to /var/log/diag/myrinet/gm_prodmode_mon/links.log by default, but you can specify another directory with the -d option. Output is displayed to the stdout to show the progress of the diagnostic test. This command is configured to run once each hour by a crontab file in the /etc/cron.hourly directory. This diagnostic tool runs five tests for the Myricom® switches in an HP XC system. You must launch it from the head node and run it only during allocated preventive maintenance. The five tests are as follows: If the mute files exist, they are also gathered.
The output from the gm_drain_test, as well as output from the individual tests, is gathered in a tar file, which is compressed with the gzip utility. You can use the -d option to specify the directory where this file will be located; if you do not specify a directory, it is placed in /var/log/diag/myrinet/gm_drain_test/ by default. This command has the following format:
This section describes tools that were developed specifically for diagnosing the Quadrics system interconnect (from Quadrics, Ltd.) on the HP XC system. In addition, these tools include the swmlogger daemon, provided by Quadrics, which can be configured to log errors to a database along with tools to access the information in this database. See your system's hardware installation and operation guide for information about vendor-specific diagnostic tools. The swmlogger daemon is a Quadrics system interconnect tool that monitors the state of each switch. It updates the database for changes in the status of fans, power supplies, temperature, and link errors. It also handles alerts from the monitoring daemons on each node that generate messages. The swmlogger daemon provides a service to enable switch management daemons (swmserver) running on embedded QNX modules in QsNetII® network switches to log messages about switch health to the network, either to a log file for generic usage or through a MySQL database (as is the case in HP XC systems). In addition to logging errors in the QsNet database, the swmlogger daemon also logs all errors to the /var/log/messages file. See the diagnostics section of the installation and operation guide for your model of HP cluster platform for additional information on the generic use of swmlogger. The MySQL database for qsnet is created automatically during the configuration phase of the HP XC system installation and the swmlogger daemon is configured to run on the head node. Use the qsdiagadm and the qsneterr utilities to report information from the qsnet database. The qsdiagadm UtilityA useful function of the Quadrics qsdiagadm utility is to display a diagnostic history for the HP XC system, as shown in the following example:
The qsneterr UtilityThe Quadrics qsneterr utility reports on the network errors logged by the swmlogger daemon in the diagnostics database. The default action is to print summaries sorted by time and by module. This tool runs the qselantest diagnostic tool in parallel on all the nodes of the HP XC system. The qselantest diagnostic tool verifies the presence of the QM500 adapter in a node and verifies the clock speed, thread processor, SDRAM, PCI_X connectivity, the QsNetII link interface, the direct memory access (DMA) latency, and the link bandwidth between the QM500 adapter and the switch port to which it is connected.
The command format for this utility is as follows:
The options are defined as follows:
Enter the following command on the head node as superuser to begin this diagnostic:
The qselantestp tool also parses each node's output file and reports any errors it finds. The qsnet2_level_test utility is a useful tool for diagnosing the Quadrics system interconnect. This utility uses the pdsh command to execute itself (that is, the qsnet2_level_test utility) on all nodes simultaneously. Each node has a specific path through the network that the node is responsible for testing. This path is derived from the node's physical position in the Quadrics network. The qsnet2_level_test can be run through the various levels to test that path. Therefore, it is important to run level1 first, then level2, level3, and so on, up to the maximum level on your machine topology. If a node fails at level1, then the node will fail on all the other levels because it sends data through level1 to reach the higher levels. Ensure that level1 passes before testing level2.
You must run the qsnet2_level_test utility as superuser from the head node. If all the nodes have not completed their tests after a timeout period elapses, the qsnet2_level_test terminates and the incomplete processes on any remaining nodes are killed. The format of the qsnet2_level_test utility is shown here: qsnet2_level_test levels [-d dir] [-parse|-noparse ] [-v] [-t timeout] [-N nodes] [-r rail] [-clean] On the command line, you specify the number of fabric levels you want to test. The qsnet2_level_test utility creates a new directory with the level name in the directory specified automatically on the command line. If the directory already exists, the qsnet2_level_test utility continues and uses the existing directory. Acceptable values for the levels parameter are level1, level2, level3, level, level, and ALL. Test the levels in ascending order. The options are defined as follows:
This utility uses qsnet2_dmatest to sanitize and test individual links in the network. The qsnet2_dmatest also reports bandwidth for small and large data transfers as it ramps up. It automatically examines and reports if any errors were detected by the system interconnect during the test. The log files for qsnet2_dmatest are written to the directory specified in the command line. Level3 of the qsnet2_dmatest tests the links between the QM502 card and the QM501 rear switch cards through the midplane. If the QM502 cards are not installed, the test spawns across the nodes, then waits for the timeout period and reports all the nodes as "failed to complete". Example 1The following example tests level1. All the nodes write their log files to the directory named level1, which is a subdirectory of the global directory /hptc_cluster/adm/logs/diag/quadrics. The -r 0 option specifies that the test is run on rail 0; if there is only one Quadrics adapter in each node, then rail 0 is the only allowable option. The -clean option is specified so that the qsnet2_level_test utility deletes the contents of the level1 directory to ensure that data from previous tests is removed. The -v option ensures verbose output.
Example 2The following example tests level3. Both nodes specified, n1 and n2, save their log files to the global directory /hptc_cluster/adm/logs/diag/quadrics in the directory named level3. Running the qsnet2_level_test diagnostic tool on only two nodes is useful because you can verify that a failing route has been repaired without affecting the use of the rest of the system. The timeout value is 180 seconds and is specified with the -t option; if a node does not complete within 180 seconds, the qsnet2_level_test diagnostic tool indicates it is a broken link. Also, the corresponding processes are terminated. As in the preceding example, the test is run on rail 0 with the -clean option, to remove data from a previous test that can corrupt the test results.
Example 3The following example displays the output of this diagnostic test, run on three nodes, when the timeout expires.
Example 4The following example parses the output files created from a previous run of this command. This example specifies the log file directory created after unzipping and extracting the qsnet2_drain_test log file, which is described in the next section.
This tool runs up to six tests for the Quadrics switches in an HP XC system:
The command format for qsnet2_drain_test utility is shown here: The -help option displays the command line options. The -d option enables you to specify a log directory. The output from the qsnet2_drain_test utility and from individual tests is bundled in a tar file (compressed with the gzip utility) and placed in the specified log directory; the directory/var/log/diag/quadrics/qsnet2_drain_test is used by default if the -d option is not specified.
The ib_prodmode_mon diagnostic tool monitors the Infiniband switches, looking for InfiniBand network errors, generating alerts and notifying you if it detects these network errors:
The output that ib_prodmode_mon produces identifies the bad links so that you can take corrective action. It resembles the following:
The ib_prodmode_mon diagnostic tool searches /etc/hosts for entries whose name matches the regular expression "IR0[NT][0–9][0–9]". This command uses the wget command to obtain the PortCounters.csv file from the switch and parses the output. The ib_prodmode_mon diagnostic tool generates an alert if it finds any errors. All alerts are logged in the /var/log/messages file and the ib_prodmode_mon.log file. The format of this command is: ib_prodmode_mon [--help] [--verbose] [-d directory-name] The output from the ib_prodmode_mon is logged to /var/log/diag/ib/ib_prodmode_mon/ib_prodmode_mon.log by default, but you can specify another directory with the -d option. Output is displayed to the stdout to show the progress of the diagnostic test. This command is configured to run once each day by a crontab file in the /etc/cron.daily directory. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||