STM: Support Tools for HP Computers (logo)

Frequently Asked Questions (FAQs)

Contents: General STM / Installing STM / Starting STM / Device map/Available tools / General tool / Specific tool / CSTM scripts

General STM questions:

Installing STM and patches:

Starting STM: Device map and available tools issues: General tool issues: Specific tool issues

:

CSTM scripts

:

Top


General STM questions:

How do I start STM?

There are three user interfaces (all in /usr/sbin):
xstm - The X Windows interface
mstm - The menu interface
cstm - The command line interface

 

How do I use STM?

  1. Select a device or devices

    xstm: click on device icon

    mstm: move cursor over device and hit the SPACE key.

  2. Run a tool

    Tools --> <tool> --> Run

  3. View logs after the tool completes

    Tools --> <tool> --> <log>

 

Which tool should I use?

 

Which log should I look at?

STM provides several different logs for different situations:

 

On which releases is SYSDIAG removed from the system?

When any of the following releases are installed, SYSDIAG will be removed from the system:

10.30
IPR9707
T600 Patch
Gamma (Workstation ACE)

SYSDIAG is removed because the functionality provided in STM in these releases is a superset of SYSDIAG, completing the transition to STM. All STM releases from this point on will eliminate SYSDIAG if it is found on the system.

 

Is STM available on HP-UX 9.x and MPE systems?

HP-UX 9.x systems have a program called STM.  (This is NOT the same program as the STM released with HP-UX 10.x and 11.x.) This program offers a limited number of verifiers and exercisers, and can invoke the 'sysdiag' diagnostic system for some limited diagnostics.  In general, however, HP-UX 9.x and MPE systems  rely on the 'sysdiag' diagnostics for their support tools. . A port of STM to MPE/iX is currently planned. There are no plans at the present time to port the new STM to 9.x.

 

How do I view my OS and memory logs? (What replaced sysdiag's logtool?)

A new STM utility called logtool is used to look at system device and memory logs. See the logtool tutorial for more information on this utility.

 

User Interface (UI) reports an incorrect password, or that diagmond won't start, or psconfig says that diagmond is not running; why?

You may have a "multi-homed" machine. To find whether your machine is "multi-homed" , issue the command:

nslookup `hostname`

If this command reports "Addresses: " and then lists more than one IP address associated with the host name, then your machine is multi-homed.

STM has known problems with this configuration, which also affect the Predictive software. The problems are related to the fact that the Domain Name (DNS) server uses a round-robin method which returns a different IP address for the hostname each time it is called. STM expects a given hostname to resolve to the same IP address, and gets confused when this is not the case.

The STM development team is currently working on a fix for these problems.

Workaround to Run the Support Tool Manager:

Check whether the diagnostic daemon diagmond is running by executing the command:

ps -ef | grep diagmond

If diagmond is running, but the UI is reporting an invalid password, then you should be able to get it to run by giving it a particular IP address. Use the System -> Sel System to Test -> Select Current System... command. Place the IP address (or a hostname which is associated with just one IP address) in the "add" field (and, in mstm, hit return.) Then hit OK in the name / password screen. This should cause the User Interface (UI) to connect to local system's diagnostic daemon.

Workarounds to Run STM and psconfig:

The following workaround was reported:

  1. /etc/hosts was not being used in /etc/nsswitch.conf; the cu had made entries for delta1 and delta2 in /etc/hosts, but nsswitch.conf was going to the DNS first to resolve ip addresses and not looking in /etc/hosts which then returned an ip address via the round-robin method described above.
  2. The following entry was made in /etc/nsswitch.conf:
     
      # /etc/nsswitch.conf
      #
      #
      # Using files 1st for hostname lookup to workaround muli-homed problem
      # with predictive.
      #
    
      hosts:        files [NOTFOUND=continue] dns nis ~
    
    so that /etc/hosts is checked first, then the DNS.

This method worked after stopping and restarting diagmond. The psconfig command brought up the menu normally with no errors and predictive was able to run for the first time with no errors.

The round-robin feature of DNS is only disabled for the system on which the /etc/nsswitch.conf file has been modified. Furthermore, the round-robin feature is only disabled when looking up a host with an entry in the /etc/hosts file.

What are the daemons involved in STM?

Diagmond is the DIAGnostic MONitor Daemon. (The final "d" is a UNIX convention; all daemon names end with "d".)

Diagmond is the only diagnostic daemon which is started directly by the initialization scripts. It in turn launches the other diagnostic daemons: diaglogd, memlogd and psmond. Diaglogd (DIAGnostic LOGging Daemon) is the daemon that does the OS logging.

If you shut down diagmond, it shuts down those other diagnostic daemons. So... if diagmond fails to start for some reason, then diaglogd won't be running. Of course, if diagmond is running, it is still possible for diaglogd to fail for some reason (like the diag2 driver not being in the kernel.)

How can I get information about features and problems in the different releases of the diagnostics?

For the latest information, go to the diagnostics web site: http://docs.hp.com/hpux/diag/ . Under the heading "Online Diagnostics: Support Tools Manager (STM)", click on "Release Notes". At this site, you can also find Release Notes for EMS Hardware Monitors.

The diagnostic daemon diagmond seems to be using too much of the CPU (IPR 9909).

In IPR 9909, there is a problem whereby the diagnostic daemon "diagmond" uses a large percentage (between 10% and 15%) of the CPU whevenever a User Interface (UI) for STM is not running.

To fix this problem, load one of the following patches (available from the HP ITResource Center (http://itresourcecenter.hp.com):

PHSS_20005: s700 10.20 STM panic, disk_em,diagmond,tlscsidev
PHSS_20006: s800 10.20 STM panic, disk_em,diagmond,tlscside
PHSS_20007: s700_800 11.00 STM panic, disk_em,diagmond,tlscsidev

Alternately, there is a workaround: start the STM user interface (UI) (for example, cstm) and leave it running. As long as a UI is running, the CPU usage by diagmond will be normal.

When I try to get online help in xstm on HP-UX 11.20, nothing happens.

Symptoms: In xstm (the graphical interface for online diagnostics), no online help appears when you enter a command for online help. The diagnostics themselves are still functional.

Cause/Action: Online help for xstm requires that the Netscape browser be installed on the computer. The xstm interface invokes the Netscape browser to display .htm files located in the diagnostics sub-directories on disk. (The mstm and cstm interfaces do not use the Netscape browser.)

The easiest way to install the Netscape browser is from the AR (Application Release) Media supplied with the computer.

Alternative: Using any computer, you can view online help for xstm, mstm, and cstm on the Internet at: http://docs.hp.com/hpux/onlinedocs/diag/stm/sth_summ.htm

On blade servers, the support tools don't tell the chassis and slot number of the server blade under test.

A blade server is a chassis which contains several server blades and other blades. A server blade is a "computer on a a card" that runs its own OS. Thus, the server blade is the computer system.

When STM tools and hardware event monitors refer to the computer system, they will give the HP-UX hostname of the server blade. For example, when hardware event monitors report events, they name the system on which the event occurs in a format such as:

hpdst313 sent Event Monitor notification information

where "hpdst313" is the HP-UX hostname of the blade reporting the event. As of May, 2002, the support tools do not directly name the slot and chassis in which the server blade is located.

To make it easier to find the location of a particular blade, given its hostname, you can:

Top


Installing STM and patches

I was unable to load a recent patch; what is the problem? (PHSS_14401 - PHSS_14406)

Do not install patches PHSS_14401 through PHSS_14407, unless specifically instructed to do so by support personnel. These patches will load an earlier version of STM to be loaded. PHSS_14401 (10.01 S800) and PHSS_14402 (10.01 S700) load version A.09.00. All other PHSS_1440x patches load version A.10.00.

The BEST SOLUTION is to not to load this patch, but rather to load the latest version of the diagnostics. These are available from the Support Plus Media (IPR 9909 or later), Diagnostic/IPR Media (IPR 9906 or earlier), or from HP's Software Depot at http://www.hp.com/go/softwaredepot. (Look under "Enhancement Releases" for the bundle titled "Support Tools for the HP 9000".)

If you do load the patch, and run into problems, here's the procedure to get rid of it:

  1. Run swremove, and unset the "Enforce script failures" option.
  2. Remove both the patch and the OnlineDiag bundle.
  3. Reinstall IPR 9810.

When I try to install an STM patch, why does the checkinstall script fail? (PHSS_17884 - PHSS_17888)

Patches PHSS_17884, PHSS_17885, PHSS_17886, PHSS_17887 and PHSS_17888 contain the entire diagnostic system, rather than just changes to the diagnostics already loaded on your system.

These patches are the same as patches PHSS_14401 - PHSS_14406, but with a script file in the patch depot which verifies that you are not backdating your diagnostics by loading these patches on top of a more recent version.

If, however, the diagnostics that are on your system are very old (STM version A.03.00 and earlier,) or if there is not a version of STM loaded on your system, then the patch cannot determine the revision of the existing diagnostics and concludes that it should not be loaded. The checkinstall script will fail, and the /var/adm/sw/swagent.log will contain a message something like:

ERROR:    The patch PHSS_1788x is trying to update a system which already
contains a OnlineDiag bundle version newer than is required for this patch
or there is no OnlineDiag bundle installed.  Do not override this install,
doing so may cause STM diagnotics to improperly operate.

If you experience this problem:

Should I swremove an older STM before installing a new one?

It is typically not necesary to do an swremove before installing the Support Tools on a system with an older version of Support Tools.

On HP-UX 10.20, if the old version of Support Tools is earlier than May 1997 (IPR 9705), then it is advisable (but not necessary) to do an swremove.

(If you do an swremove of the Support Tools on HP-UX 10.20, it will cause the system to reboot after the removal takes place.)

I'm running into problems when I load the OnlineDiag bundle from the June 1999 (IPR 9906) Diagnostic/IPR Media. For example: a system panic occurs after I install the bundle on a V-Class. or on a system with a DLT tape library product. Or: Predictive Support doesn't work.


CAUTION

: Do not load the OnlineDiag product from the June 1999 Diagnostic/IPR CD-ROM if your system includes a V-Class computer, a DLT tape library, or Predictive Support. For problem description and fix, see the customer letter (PDF)


When I update a previous version of the Support Tools, the swagent.log file reports that some files are not being correctly installed.

There was an swinstall problem that occurred when certain releases of Support Tools were used to update a system with a previous release. This problem occurs only on HP-UX 10.20 on the following releases: June 99, Dec 99, Mar 00.

If a file was being delivered via /usr/newconfig and it had changed from the version on the system, then swinstall would not actually install the file. Instead, it would leave it in /usr/newconfig/*. It would also log a message like the following to /var/adm/sw/swagent.log :

NOTE:
A new version of "/var/stm/config/tools/utility/os_decode_xref"
has been placed on the system.  The new version is located at
"/usr/newconfig/var/stm/config/tools/utility/os_decode_xref".
The contents of the newly installed file differ from the
contents of "/var/stm/config/tools/utility/os_decode_xref",
and the previously delivered file is not available for
comparison.  Therefore "/var/stm/config/tools/utility/os_decode_xref"
is not being overwritten.  The System Administrator should
resolve this situation manually.

In this example, you could solve the problem by entering the following command:

cp /usr/newconfig/var/stm/config/tools/utility/os_decode_xref \
/var/stm/config/tools/utility/os_decode_xref

Note any similar lines in swagent.log about other files and perform a similar copy for them.

This problem is almost the same as another FAQ: When I try to run logtool, I get an error message about a missing log decoding program..

How can I find out what patches are required or recommended for installing a particular release of the Support Tools (OnlineDiag SD Bundle)?

For the latest information, go to the diagnostics web site: http://docs.hp.com/hpux/diag/ . Under the heading "Diagnostics (Support Tools): General Information", click on "DIAGNOSTICS.readme files".

Where is the Diagnostic/IPR Media for the September 1999 release (IPR 9909)?

As of the September 1999 release, the name of the Diagnostic/IPR Media has been changed to Support Plus Media.

After I updated the diagnostics, messages were not found when I ran STM, and/or psmctd failed to start with sig 6.

You might have used the "match_target=true" swinstall option while updating the diagnostics. The default value of march_target is "false", and in general, you should leave the swinstall options set at their default values when installing diagnostics.

To see if the "match_target=true" swinstall option caused the problem, look in /var/adm/sw/swagent.log at your most recent installation of OnlineDiags. Within this session, you should see a line which begins (depending on your system type):

On 10.20 S700:
     * Installing fileset "Sup-Tool-Mgr-700.STM-CATALOGS,r=

On 10.20 S800:
     * Installing fileset "Sup-Tool-Mgr-800.STM-CATALOGS,r=

On 11.00:
     * Installing fileset "Sup-Tool-Mgr.STM-CATALOGS,r=

If this line is not present, then it is likely that an older installation of diagnostics was updated with the "match_target" option set to true. To fix the problem, re-install the diagnostics. This time, leave the swinstall options set to their default values, with the exception of the option to install filesets even if the the same revision exists; this option should be said to true ("reinstall=true").

Detailed explanation:
When updating the OnlineDiags, do not set the "match_target" swinstall option to "true". The "match_target" option causes only the filesets found on your system to be updated. "Filesets" are used by swinstall to contain the actual files which make up the software. Software bundles consist of multiple filesets. Over time, new filesets have been added to the OnlineDiag bundle to accommodate new functionality, and files previously delivered in existing filesets has been moved to new filesets. Using the "match_target" option means that only those filesets which were part of the diagnostics at the time of your last installation will be updated. This can cause the diagnostic system to fail in multiple ways. Message catalogs were moved to a new fileset, to accommodate the downloading of message catalogs when the UI connects to a remote system. Therefore, one failure is that messages are missing or the message catalogs are found to be out of date. Also, the Event Monitoring Service (EMS) product was added to the bundle in IPR 9902. If a system previous to IPR 9902 was updated with the "match_target=true" set, then EMS will not be loaded. This will cause the psmctd daemon to abort with a signal 6. This can be seen in the system activity log.

Top


Starting STM

Why won't the user interface connect to a machine?

This is usually because the machine address cannot be resolved or because the diagmond daemon is not running on the machine that is being connected to.

  1. Use /etc/ping to verify that the host address can be resolved.
  2. If ping works, check to see if diagmond is running on the host machine ( ps -ef | grep diagmond ).
  3. If diagmond is not running on the host machine, log into the host machine, run STM on it, and use the File-->Administration-->Local UUT Logs command to view the log files on the host machine and determine what the problem is.

It is also possible that the networking is not configured correctly. In this case, there will have probably been a long delay before the "Waiting to obtain host information..." dialog is displayed. To test the network configuration:

  1. Use 'hostname' to determine your system's host name.
  2. Use 'nslookup `hostname`' to determine your system's IP address.
  3. Use 'lanscan' to view the LANs on your system. Verify that the hardware state of at least one LAN is 'UP'.
  4. Use 'ifconfig <lan_name>' to get the IP address of the LAN(s) which are 'UP'.
  5. Make sure that at least one of these 'UP' LANs has the same IP address as your system IP address.

 

I just installed the Online Diagnostics and when I enter xstm, mstm, or cstm, it hangs with the message that diagmond may not be up and running.

If this is a first time install, it is possible that the installation/configuration did not complete properly.

  1. Look in /var/stm/data for an id_mod_xref and a prod_op_xref file. If they are not there, the configuration did not complete properly.
  2. Manually perform configuration and restart STM:
    1. swconfig OnlineDiag
    2. File->Administration->Local Startup
    3. Wait a couple minutes for diagmond to come up and map the system.
    4. Retry xstm, mstm, or cstm.
  3. If this doesn't work, installation probably did not complete properly. Check the SD-UX log files to determine the problem with installation and configuration.

If this is an update, it is possible that the installation/configuration did not complete properly.

  1. Check to see if diagmond is up and running: ps -ef | grep diagmond

  2. Check /var/adm/syslog/syslog.log for a diagmond exit entry

  3. If syslog says that diagmond exited due to a user request:

    1. Enter xstm, mstm, or cstm and cancel the dialog indicating that diagmond may not be up (use cancel button on xstm and mstm, ^C on cstm).
    2. Manually re-start diagmond: File->Administration->Local Startup
    3. Exit xstm, mstm, or cstm.
    4. Wait a couple minutes for diagmond to come up and map the system.
    5. Retry xstm, mstm, or cstm.
  4. If the diagmond exit entry says anything other than it exited due to a user request:

    1. Enter xstm, mstm, or cstm and cancel the dialog indicating that diagmond may not be up (use the cancel button on xstm and mstm, ^C on cstm).

    2. View local System Activity Log:
      File->Administration->Local Log Viewing->System Activity Log

    3. If the System Activity Log says that diagmond failed to build the map, look in the Local Map Log for an explanation of the problem:

      File->Administration->Local Log Viewing->Local Map Log

    4. If the problem reported in a log indicates a missing prod_op_xref file or id_mod_xref file, the configuration did not complete properly. To fix this, manually perform configuration and restart STM:

      1. swconfig OnlineDiag
      2. File->Administration->Local Startup
      3. Wait a couple minutes for diagmond to come up and map the system.
      4. Retry xstm, mstm, or cstm.
    5. If the problem reported in a log does NOT indicate a missing prod_op_xref file or id_mod_xref file, follow the instructions in the Local Map Log.

  5. If the problem persists, even after checking everything as described above, the installation probably did not complete properly.

  6. Check the SD-UX log files to determine the problem with installation and configuration.

Why can't I restart the diagnostic daemon 'diagmond'?

If you use 'kill -9' to kill diagmond, you will not be able to restart it. This is because it was not able to release its socket port when it was killed and the port is therefore not available. If you need to kill the diagnostic daemon, use 'kill -2' or use the function File->Administration->STM Shutdown.

 

Why are colors missing, or why is there no writing in the xstm map icons?

X Windows displays can only show a limited number of colors at any one time. Some applications, particularly Netscape, use so many colors that there are not enough colors left over for xstm to display the system map properly. This may result in icons being drawn in black, with the writing on them also in black. This problem can be fixed by closing the applications which are using a large number of colors, and restarting xstm. Netscape can also be run with the "-install" option, which causes it to release its colors when the pointing device is not pointing within the netscape window.

 

Why does STM core dump on start-up?

STM requires that the Motif and X Windows libraries be present on the system.  If they are not, error messages similar to the following are displayed:

/usr/lib/dld.sl: Can't find path for shared library: libXm.4
/usr/lib/dld.sl: No such file or directory
/usr/sbin/mstm[24]: 15845 Abort(coredump)

If you see a message like this (the particular version of the library, which is ".4" in this case, may vary), load the Motif libraries from the HP-UX OS media.

 

Why does STM report that the networking is not configured correctly?

STM uses sockets to communicate between processes.   When networking is incorrectly configured, STM may hang indefinitely.  To help users identify this problem, STM incorporates a check to verify that the networking was correctly configured. Unfortunately, this check falsely reports bad configurations in certain cases, including when an ATM network card is the primary LAN connection.

This problem was corrected in the IPR 9802 release.   On prior releases, the following work-around can be used (alternately, you can update your copy of STM to a more recent version).

  1. Edit file /sbin/init.d/diagnostic (You will need to be logged on as "root" to do so.)
  2. Just above the line "PATH=/sbin:/usr/sbin:/usr/bin" (just below the comments at the top of the file,) add the line:

    export DONT_CHECK_LAN=1

  3. Log on as "root".
  4. Start the diagnostic daemon with the command:

    /sbin/init.d/diagnostic start

  5. Before starting xstm, mstm or cstm, type:

    export DONT_CHECK_LAN=1

  6. After you make the change described above, the diagnostic daemon should start correctly when the system is rebooted.

In the future, you must enter "export DONT_CHECK_LAN=1" once in a session before starting xstm, mstm or cstm in that session.

 

Why does STM report that it is unable to connect to the host due to an unexpected error?

(This problem is fixed in IPR 9806.)

STM can experience problems as a result of extra files being left on the system by the software update utility swinstall and the frecover utility.

The directory /usr/sbin/stm/uut/bin/sys should contain only the three files: diagmond, diaglogd, and memlogd.  If there are other files in this directory, they should be removed (or moved out of this directory).  Files to be removed typically have names like EMYa01203 (the exact name will vary). Other files to be removed have names like diagmond.1284 (the string "diagmond," "diaglogd," or "memlogd", with a number appended).

  1. First stop the diagnostics daemon using the command:

    /sbin/init.d/diagnostics stop

  2. Remove the files which shouldn't be there. 

  3. If removing the extra files fails due to "text file busy," then you may need to:

    1. ps -ef | grep diaglogd    # to find the Process ID (PID) for diaglogd

    2. kill -2 <the PID reported above>

    3. Repeat this for memlogd. 

    4. If you still can't remove the extra files, repeat the above process for the other files in this directory.

  4. Restart the diagnostic daemon using the command:

    /sbin/init.d/diagnostic start

Why does the diagnostic daemon "diagmond" fail to start after I install diagnostics on an NFS diskless cluster?

If you use sam(1M) or swcluster(1M) to install the IPR 9902 or IPR 9904 version of diagnostics on an NFS Diskless Cluster on HP-UX 10.20, some of the required files are not installed in the /var directory of the clients. (If this is the case, a message stating that /var/stm/data/id_mod_xref could not be accessed will be found in the local map log visible using command File -> Administration -> Local UUT Logs -> Map Log .)

To fix this problem, log onto the cluster server, and issue the following command for each client. (Replace the string <client_name> with the name of the client in the directory /export/private_roots/ .)

  cp -p -r /export/shared_roots/OS_700/var/stm/* \
/export/private_roots/<client_name>/var/stm/

  cp -p -r /export/shared_roots/OS_700/etc/opt/resmon/*  \
/export/private_roots/<client_name>/etc/opt/resmon/

Then, on each client, type the following to restart the diagnostic daemon:

  /sbin/init.d/diagnostic start

An alternate way to restart the diagnostic daemon is to reboot the client.

Why do I get incorrect error messages that diag1 and diag2 patches are not installed when I try to install diagnostics on the clients of an NFS diskless cluster?

When using sam (1M) or swcluster (1M) to install OnlineDiag (IPR 9902 or IPR 9904) on the diskless systems, one of the install scripts incorrectly checks for the presence of the correct patches on the server, rather than the clients. To satisfy this check, the patches (or their S800 equivalent, if the server is a series 800) should be installed on the server.

In the IPR 9902 (Feb 99) release, why does the diagnostic daemon "diagmond" sometimes fail to start?

On IPR 9902 (STM version A.14.00,) there is a problem which can occasionally corrupt a temporary data file. To fix this problem, do the following:

  rm /var/stm/data/uut_status        # Remove the corrupt file

  /sbin/init.d/diagnostic start      # Restart the diagnostics daemon

The diagnostic system should now work normally. Note that it can take several minutes for the daemon to map the system, and the user interface will wait until this is complete. To see if the daemon is running, type:

 ps -ef | grep diagmond

After I shut down CSTM with the stmshutdown (ssd) command and then re-start it, CSTM appears to be hung

The reason CSTM appears to be hung is that the diagnostics daemons need to be re-started. To re-start daemons:

  1. Issue a Control-Y.
  2. Enter a stmstartup (ssu) command.

The diagmond deamon will create a system map which can take several minutes depending on the I/O configuration.

Top


Device map and available tools issues

Why is a device in the STM map labelled "unknown" or its icon blank?

The causes for an "unknown" device can be one of the following:

 

Why doesn't the Device-->Select Class command work?

This is usually because the "device type" and "qualifier type" that were selected don't both match a device in the system. For example, selecting a type of "Disk" and a qualifier of "SCSI" will not select any devices because SCSI disks use qualifiers of "Hard, Floppy, etc." Use the Device-->Current Device Status command to determine the valid type and qualifier that apply to a specific device.

 

I just installed a new version of STM. Why is some hardware is marked as "Unknown" or new tools are greyed out?

When diagnostics are installed, the newly supported drivers for new hardware and new tools for hardware are all part of configuration files which are installed with the diagnostics system.

However, if the configuration files were modified after the previous version of the diagnostic system was installed, these configuration files will not be overwritten with new versions but placed in the /usr/newconfig directory.

Check the /var/adm/sw/swagent.log file and see if it indicates that the prod_op_xref and id_mod_xref files were not installed but copies were placed in the /usr/newconfig/var/stm/data/prod_op_xref and /usr/newconfig/var/stm/data/id_mod_xref.

If the files were not installed but copies saved in /usr/newconfig:

  1. Copy the new versions of prod_op_xref and id_mod_xref to the /var/stm/data directory.
  2. Copy the new version of diaglogd.cfg to the /var/stm/config/sys directory.
  3. Bring up STM (xstm, mstm, cstm).
  4. Perform a re-map of the hardware:
  5. The map should no longer display the "Unknown" and the new tools should be available.

The xstm map is not usable on very large configurations because GUI overlaps device

Known problem (JAGad36990). Running xstm on very large configurations yields a display where most of the devices overlap each other so that you cannot select a single device to perform any tasks. This has been seen on systems with 400 or more disks.

The problem can occur on Dec 00 and previous diagnostic releases (HP-UX 10.20, 11.00, and 11i). A fix for this problem is planned for an upcoming release of the diagnostics.

The workaround for this problem is to display the map in tabular format. Select the Options-->Map command and clicki on the "Display xstm Device Map in Text Format" button. All operations can be carried out as with the graphical map.

As an alternative, use MSTM or CSTM. These interfaces display the entire map correctly.

Top


General tool issues

Why does an Exerciser enter a "hung" state?

STM monitors the progress of all running tools and expects each tool to send a "heartbeat" every minute or so. If these heartbeat indications are not received within a two minute window, the tool state is changed to "hung". The causes for a hung tool can be one of the following:

 

When many tools are started simultaneously, why does STM take a long time to get back to accepting user commands?

This is due primarily to delays attributable to messaging traffic between tools and STM, which is particularly heavy when tools are first run. One or more tools may even enter a "hung" state during this initialization time, but this should clear up once all of the tools have gotten through this phase.

This problem is fixed with version A.06.00 (and above) of STM.

 

How do I select multiple devices in an expert tool?

This was a feature of some SYSIDAG tools, but with the exception of some disk arrays, this feature not available in STM. Instead, you can run multiple expert tools simultaneously. To do this in MSTM:

Start the first expert tool.
Hit CTRL C.
Hit the ESCAPE TO UI key.
Start another expert tool on another device.

To re-connect to the original tool:

Hit CTRL C.
Hit the ESCAPE TO UI key.
Hit tools -> tool mgmt -> attach
Select the tool to attach to and hit OK.

 

When I connect to a remote UUT, why won't the interactive tools run or why does the interactive tool help fail?

There are three problems which may occur when connecting from particular revisions of the UI to particular revisions of the UUT. These problems may cause the tool (or its help) to fail outright with a message indicating that a message catalog with a particular revision (or a help volume) could not be found. If you encounter one of these problems, the work-around is to log in directly to the system on which you want to run STM, rather than using the Select System To Test dialog to select a remote system.

Top


Specific Tool Issues

Why do some of the SCSI disk and tape tools log errors in their activity logs indicating that commands such as LOG SENSE and INQUIRY failed?

This is usually due to the fact that the drive being tested is an older drive, which did not implement these commands in the form they are being used. If this is the case, the cause/action text in the log message will suggest this as a possible cause of the problem.

 

Why are problems occurring running exercisers on more than two graphics devices?

Running exercisers on more than two graphics devices simultaneously may result in one or more of the exercisers terminating due to problems getting resources such as semaphores or shared memory. It is also possible that the graphics exercisers will execute correctly but other exercisers such as the CPU exerciser will fail if started while more than two graphics exercisers are running. This problem can be corrected by modifying the following kernel parameters and rebuilding the kernel:

- The semmini parameter should be increased by 64
- The semmns parameter should be increased by 128

In addition, the /etc/X11/X<server>screens file (where <server> is the number of the X server -- e.g. 0, 1, 2, or 3) should be modified to add the following:

ServerOptions
GraphicsSharedMemorySize 0xc00000

These lines should be added just above the "Screen /dev/crt" line. If a "ServerOptions" line already exists in the file, another "ServerOptions" should not be added, but, rather, the "GraphicsSharedMemorySize" line should simply be added to the existing entries under "ServerOptions".

After modifying these files, the X server must be restarted.

 

Why does the graphics exerciser exit with an INCOMPLETE status?

The graphics exerciser will exit with an INCOMPLETE status when the X server is configured with multiple graphics heads in a single logical screen.

The following message will be logged into the tool activity log:

"The graphics exerciser does not support testing of hardware devices that are part of a single logical screen.

Possible Causes/Recommended Action:

Modify the 'X%screens' file so that the device under test is not part of a single logical screen, restart the X server, and rerun the graphics exerciser.

To work around this problem, modify the 'X%screens' file so that the device under test is not part of a single logical screen, restart the X server, and rerun the graphics exerciser."

 

How can a view an old logfile with the new Logtool?

You can use the new Logtool to view old (pre DART 34 and / or pre 10.30) log files. To manually convert system log files from the old sysdiag format to the new STM raw log format:

  1. Copy the log files (typically LOG0001, etc, in /var/adm/diag) to a 10.30, or DART 34, or later system in directory /var/adm/diag.

  2. Run the log converter on the 10.30 or DART 34 or better system:

      /usr/sbin/conlog /var/adm/diag/ /var/stm/logs/os/
    
  3. Run STM, and invoke logtool.

 

I have just installed the new version of the diagnostics. The "Nike" monitor which is supposed to generate events for errors on the High Availability SCSI Disk Arrays (codename "Nike") does not report errors on those devices.

When diagnostics are installed, the monitor for the High Availability SCSI Disk Arrays is part of configuration files which are installed with the diagnostics system.

However, if the configuration files were modified after the previous version of the diagnostic system was installed, these configuration files will not be overwritten with new versions but placed in the /usr/newconfig directory.

Check the /var/adm/sw/swagent.log file to see if it indicates that the diaglogd.cfg files was not installed but a copy was placed in the /usr/newconfig/var/stm/config/sys/diaglogd.cfg.

If the file was not installed but a copy saved in /usr/newconfig:

  1. Copy the new version of diaglogd.cfg to the /var/stm/config/sys directory.
  2. Bring up STM (xstm, mstm, cstm).
  3. Shut down the diaglogd daemon.
  4. Start the diaglogd daemon.

The "Nike" monitor should now launch when a High Availability SCSI Disk Array encounters an error and generates the event.

Why do I get error messages when I run support tools on a DDS Media Changer (Autoloader) ?

Some support tools fail when run on the DDS Media Changer (Autoloader)

The problem occurs only on Models C1553A and C1557A. As of February 1999, there is no workaround to this problem, but a fix is planned for an upcoming release of STM.

The problem occurs when the device has a non-zero Logical Unit Number (LUN) in its hardware path. For example:

13  8/16/5.3.0           SCSI Tape (HPC1557A)
14  8/16/5.3.1           SCSI Media Changer (HPC1553A)

The Information tool fails with a WARNING, with the following text in the Tool Activity Log:

Mon Jan 25 09:42:18 1999: The following sense data was returned by the device:
                          Sense Key: 0x05 (ILLEGAL REQUEST)
                          Additional Sense Code/Qualifier:  0x25/00

                          Error Description:
                          The logical unit is not supported.

Mon Jan 25 09:42:18 1999: The standard SCSI LOG SENSE command was being
                          performed when the failure occurred.

Mon Jan 25 09:42:18 1999: Tool completed with exit_status
                          MOD_WARNING (1) indicating tool completed
                          with a warning.

The Firmware Update tool won't be able to find a valid firmware update file, even if one exists.

If the user runs the Expert tool and selects any of the commands, the following error message will appear in the Expert Tool window:

The sense key indicates an Illegal Request.
===============
(0x2500) An invalid LUN was specified.

Why don't I get any information when I run the memory information tool (IPR 9904 for HP-UX 10.20)?

On IPR 9904 for HP-UX 10.20, there was a problem with the STM memory information tool. The tool completes successfully, but generates an output with no data. The problem only exists in IPR 9904 for HP-UX 10.20 and has been corrected for later releases.

On systems with IPR 9902 or IPR 9904 and large configurations, why is ioscan using excessive CPU time?

In release IPR 9902 and IPR 9904, in systems with large configurations you may find that ioscan is using a large amount of CPU time. The CPU time used is considerably worse on HP-UX 11.00 systems, although 10.20 systems may also experience an undesirable amount of CPU use. This problem has been corrected in IPR 9906.

For IPR 9902 and IPR 9904, patches are available that solve the problem:

Why does the IPR 9904 version of Logtool in STM not decode the 'disc30' driver data on my system ?

The decoder (disk_em) on IPR 9904 release decodes only the 'sdisk' driver data. Because of the same problem, the Event Monitoring System also will not be able to generate events for this driver. In both cases, the decoder may abort with SIGBUS.

The fix is provided in IPR 9906 release.

Within the dialog box for the 'Format Raw...' command, you may see text similar to the following:

The Format Raw operation is currently in progress.

Entries processed is 1 of 13 total entries; entries formatted is 1.
Either an appropriate log decoding program name could not be located
for one or more entries in the raw log file, or the decode routine
itself failed to execute.  The data portion of the entry(s) will be
formatted for display in hex.
See the completed formatted log file or the Test Activity Log for
specific information.

Entries processed is 13 of 13 total entries; entries formatted is 13.
The Format Raw operation completed successfully. The following raw log
file(s) were formatted into /var/stm/logs/os/test_it.fmt1:

         /tmp/test_it.raw

In the activity log for Logtool, you will see an entry similar to the following:

Thu Apr 29 11:34:20 1999: Decode routine (disk_em) starting on path
       (32.4.0).

Thu Apr 29 11:34:20 1999: The internal tables for managing the data from
                    configuration-file(s) are not yet initilized..

                    Possible Causes/Recommended Action:
                       Make sure that the monitor calls ev_monitor_init()
                       before calling one of the routines to access the
                       configuration values.

Thu Apr 29 11:34:20 1999: Tool is exiting due to receipt of an unexpected
                    signal (10).

                    SIGBUS (10) signal indicates a bus error.

                    Possible Causes/Recommended Action:
                       Internal Application error.  Tool attempted to
                       reference an invalid address.  Usually a NULL or
                       bad pointer.

Why isn't Logtool generating any logs?

The user reported that "when I select FILE ... Select raw ... there are no files in /var/stm/logs/os."

There are two reasons why Logtool hasn't generated any logtools.

First and most likely scenario: If no errors have been detected, no log files will have been created.

This behavior is different than logtool in "Sherlock", the diagnostic system previous to STM. The old logtool always created a log file, even though it didn't have anything to log.

If you would like to verify that Logtool is working correctly, you can purposely cause an error to generate a log. A good way to cause an event to be generated is to put a bad tape in a tape drive, then try to read from it.

Second scenario (unlikely): If "diagmond", the diagnostic daemon, or "diaglogd", the logging daemon, are not running, then no log files will be created. To see if the daemons are running, enter:

ps -ef | grep diagmond
ps -ef | grep diaglogd

If the diagmond daemon is not running, you can start it again with the command:

/sbin/init.d/diagnostic start

When I try to run a support tool on a SCSI device, why does the tool report "INCOMPLETE" (Predictive Support, if installed, reports SCSISCAN 500 errors)?

Support tools for SCSI devices may terminate with a status of INCOMPLETE, if you loaded Diagnostics from the June 99 (IPR 9906) or September 99 (IPR 9909) releases onto an older system, such as T-Class or "Nova" (F,G,H,I Class; xx7 Family). The problem does not occur on newer systems such as D-Class, K-Class, N-Class or V-Class. The problem has been reported on both HP-UX 10.20 and 11.00.

If the system has Predictive Support, you you may see "SCSISCAN 500" errors reported.

To fix the problem, issue the following commands (as root):

  cd /dev
  insf -e           # re-creates the diagnostic device files

The problem may occur when an Information, Expert or Firmware Download Tools is run on SCSI devices on systems with the SIO bus. An entry in the tool's Activity Log will report the error "/dev/diag/diag0 not found."

The problem is caused by an error in a SCSI library which removes /dev/diag/diag0 if a call to get access to the SIO passthru driver fails. The error will be fixed in a future release.

When I try to run logtool, I get an error message about a missing log decoding program.

This problem could occur after updating the diagnostics, if the format of the previous version of the os_decode_xref file is different from the new version of the file.

This problem occurs only on HP-UX 10.20 on the following releases: June 99, Dec 99, Mar 00.

To verify this problem, look in the test activity log for logtool (e.g.: Tool | Utility | Activity Log | logtool ). If this is the problem, you will see an entry like:

Wed Feb  2 18:01:37 2000:
A corresponding log decoding program name could not
be located for sdisk.

Wed Feb  2 18:01:37 2000:
Syntax error in the os_decode_xref file. Failed while
parsing (dc_flex) in the entry (disc2 dc_flex).
etc.

Look in the file /var/adm/sw/swagent.log. If this problem exists, you will see an entry like:

NOTE:
A new version of "/var/stm/config/tools/utility/os_decode_xref"
has been placed on the system.
The new version is located at
"/usr/newconfig/var/stm/config/tools/utility/os_decode_xref".
The contents of the newly installed file differ from the
contents of "/var/stm/config/tools/utility/os_decode_xref", and the previously
delivered file is not available for comparison.  Therefore
"/var/stm/config/tools/utility/os_decode_xref" is not being overwritten

The System Administrator should resolve this situation manually.

To fix the problem, follow the directions in the NOTE by executing the command:

cp /usr/newconfig/var/stm/config/tools/utility/os_decode_xref    \
/var/stm/config/tools/utility/os_decode_xref

Note any similar lines in swagent.log about other files and perform a similar copy for them.

This problem is almost the same as another FAQ: When I update a previous version of the Support Tools, the swagent.log file reports that some files are not being correctly installed.

When I run memory tools on a SuperDome system, the location of a cell is given in the format "CELL 1/5". What does this mean?

"CELL 1/5" means #1 cabinet, #5 cell. On Superdome systems with multiple cabinets, each cabinet can contain cells 0 through 7. To specify a given cell, both the cabinet and the cell ID must be specified.

When I tried to run exercisers on all the CPUs on a system (Superdome, N-Class, or V-Class system), the exercisers failed.

We've received reports of this problem on fully-configured Superdome, N-Class, and V-Class systems. The problem occurs when the user selects multiple CPUs and multiple memory controllers, and tries to run exercisers on all of these.

The workaround is simple: instead of selecting multiple CPUs and memory controllers, JUST SELECT ONE CPU AND ONE MEMORY CONTROLLER. When you select the exercise tool, all CPUs and all memory controllers will be exercised.

Normally, if you select multiple CPUs and memory controllers and then select the exercise tool, exercisers will start on all these modules. Each exercise process then tries to start exercise processes on all the other CPU and memory controller modules. Since an exerciser only needs to be run on one CPU or memory controller to exercise all the CPUs or memory controllers, the extra exerciser processes are killed. Only one exerciser process is left running on a CPU and one on a memory controller.

This expected behavior does not occur on some fully configured Superdome, N-Class and V-Class systems. In these case, all exerciser processes are killed and no exerciser processes are left running.

I can't run customer-licensed online tools on high-end systems using the June 2001 SupportPlus media (HP-UX 11i only).

On the June 2001 release of diagnostics for HP-UX 11i, there is a problem with the customer-licensed diagnostic tools for:

The following types of tools may require a customer license and hence may be affected by this problem:

This problem is fixed in the September 2001 release of diagnostics. You can install the OnlineDiag bundle and then run xstm, mstm, or cstm to install the class license, or you can run the offline diagnostics from the SupportPlus media to install the class license. Once the class license is installed either by online or by offline tools, it will stay effective for all online and offline programs.

The problem only occurs with customer-licenses; CE licenses are NOT affected. The problem only occurs with the hp server rp8400 and the newer versions of Superdome; the problem does not occur with any other server.

If you try to run a customer-licensed online diagnostic tool on one of these systems, you will see an error message similar to one of the following:

   -- Error --
   The Install license command could not be successfully completed.
   An unexpected failure occurred in the Support Tool system.

OR

   The password requested to be installed is a valid
   machine level license yet license could not be
   installed on the system due to unexpected errors.

For customer-licensed offline tools, the error will look like this:

   
Entered password is not recognized, status = xxx, try again.

where xxx is a number.

Problem: System Info tool does not complete successfully for hp server rp8400 (S16K-A) and Superdome systems.

There is a problem with the System Info tool for Superdome and hp server rp8400 (900/800/S16K-A, "Keystone") systems. On the June 2001 diagnostics release for HP-UX 11i, the System Info tool does not complete execution on these systems, and does not provide complete information about the system. In the activity log, several messages would appear, such as:

Failed to execute the PDC_PAT_COMPLEX PDC
call by performing an ioctl call through
the diag2 pseudo driver.

The problem has been fixed on the Sept 2001 diagnostics release for HP-UX 11i.

Problem: Info tool causes HPMCs on K- and T-Class (HP-UX 11i only).

K-Class and T-Class computers running HP-UX 11i will experience HPMCs or Data Page Faults if you run the STM Info Tool to retrieve configuration information from HP-PB SE SCSI adapters (JAGad88317, JAGad97126).

The root cause is a problem in the HP-UX 11i kernel, which is exposed when the Info Tool is run as described above. To correct the problem and avoid potential HPMCs, load patch PHKL_25552 or its successor. This is a kernel patch and requires a system reboot.

STM only exposes this problem if the Info Tool is run on HP-PB SE SCSI adapters. STM does not expose the problem if the Info Tool is run against other I/O devices.

Problem: Page Deallocation Table (PDT) error causes Logtool to display "pending," rather than "deallocated," on reboot.

On both SuperDome (see JAGae02288) and legacy (see JAGad56511) systems, a problem has been encountered when trying to use Logtool to view the memory log. When an error occurs in the Page Deallocation Table (PDT), the status appears in the Logtool view of the memory log as "pending." When the "pending" status appears, it generally means that the page will be deallocated when the system is rebooted. However, if on reboot the "pending" status still appears for that particular page, then that means that the page is in the kernel, and cannot be deallocated (our code does not allow the pages in the kernel to be deallocated).

Top


CSTM scripts

You can use the following sample CSTM scripts by highlighting them and pasting them at the hp-ux prompt (using your own e-mail address where appropriate).

Run all Info tools and email results.

This script runs all information tools and sends the results to a specified email address:

    echo
    "selall;info;wait;infolog
    view
    done
    "|cstm|mail user_name@xxx.xxx.xxx.com

Email several STM logs.

This script sends the System Activity Log, Map Log, and User Activity Log to a specified email address:

    echo
    "sysact
    view
    done
    maplog
    view
    done
    uiact
    view
    done
    "|cstm|mail user_name@xxx.xxx.xxx.com

Top of Page

/ Diagnostics HOME


URL: http://docs.hp.com/hpux/onlinedocs/diag/stm/stm_faq.htm
Last updated: Wednesday July 06