Release Notes for STM on HP-UX 11i (11.11) (September 2002)
These release notes cover the September 2002 release of Support
Plus for HP-UX 11i (11.11) running on S800/S700 systems.
Overview
The Support Tools Manager (STM) provides a complete set of online
support tools for HP-UX systems, enabling you to verify and
troubleshoot PA-RISC system hardware, and to examine system logs.
STM offers several tool types, including information tools,
verifiers, exercisers, expert tools, firmware update tools,
diagnostics and utilities.
Installed with STM (as of IPR 9902) are the EMS Hardware Monitors,
an important tool for maintaining system availability. The EMS
hardware monitors allow you to monitor the operation of a wide
variety of hardware products and be alerted immediately if any
failure or other unusual event occurs. For more information, see
/usr/sbin/stm/Rel_NOTES.HWE.
Documentation
For the latest and most complete information on STM and EMS
Hardware Event Monitors, see the Web page "Diagnostics":
http://docs.hp.com/hpux/diag/
At this site, you will find Overviews, Tutorials, Quick Reference
Cards, Frequently Asked Questions (FAQs), and much other
material.
Changes
The online Support Tools Manager (STM) was enhanced and updated
for the current release.
Changes to User
Interface and Platform
- JAGae35591
Fix to JAGae35591: Memlogd (mt_memlogd) on SuperDome is sometimes
having problems sending PDT information to the EMS HW Memory
Monitor; hence, the EMS HW Memory Monitor may not be able to
correctly perform any PDT trending analysis. In this case, when
Memlogd (mt_memlogd) on SuperDome cannot send the PDT information
to the EMS HW Memory Monitor, the following error messages will
be logged in the memlogd activity log file:
Write system call failed with errno (32), when attempting to write data
to file descriptor (6).
EPIPE (32) errno returned from a write system call indicates an
attempt was made to write to a socket that is not open for reading
by any process.
Possible Causes/Recommended Action:
The process reading the socket removed the socket or exited
unexpectedly. Check the support tool system activity log and tool
activity logs for more information on the process that removed
the socket or exited.
Unable to open the file send_msg to read the memory monitor's socket
and PID numbers.
Possible Causes/Recommended Action:
Internal Application error.
Attempting to perform the PDT analysis when an unexpected error
occurred.
Possible Causes/Recommended Action:
Internal Application error.
- JAGae29247
Fixes the following problem:
If the system supports On-Line Addition And Replacement
(OLA/R), the version of STM is A.31, and the OS is 11.11, then
the disk_em monitor will cause a file handle leak. Eventually,
all system file handles will be used by this process.
- JAGae30052
Fixes the following problem:
If the system supports On-Line Addition And Replacement
(OLA/R), the version of STM is A.31, and the OS is 11.11, then
the disk_em monitor may cause a segmentation violation when
starting. To determine if your system is OLA/R compliant, check
if the /dev/olar driver exists. This problem has been
intermittent.
- JAGae25418
Fixed the following condition:
Cclogd generated a chassis code chain with the hostname
information in ASCII data. The management processor (MP) on
several PDC_PAT_CHASSIS capable machines has a bug where it can't
decode chains properly. When the MP doesn't see the chassis code
encoded field EOM bit set, it thinks it's an orphan chassis code,
generates another chassis code to that effect, and reflects them
back to cclogd. Cclogd then writes the chassis codes to a file,
and when the EMS monitor, dm_chassis, sees them, it creates
erroneous events.
- JAGae19285
This submittal is for a problem reported on memlogd. The tools that
get affected with this problem are memory tools. This problem
occurs when the user replaces the memlog file, or when the OS image
is copied through ignite. Fix to this problem will remove any loop
holes in the memory comparison, and hence, the wrong memory errors
being seen by the user.
- JAGae13168
Ensured that path to memlog file exists, before the memlog file is
created/accessed. If the complete path does not exist, then the
missing directories in the path are created with appropriate
permissions.
- Customers/users will see the following documented changes in
the behavior of the memlogd for the HWE0209 release of the
OnlineDiag product, and only on Superdome systems with virtual
partitions installed. For the HWE0209 release, customers/users will
see the following changes in the behavior of the memlogd on
Superdome systems with virtual partitions installed (that differs
from the existing HWE releases):
- It is required that there is a memlogd running on each
virtual partition (if the OnlineDiag product is installed on
that virtual partition). Example: if a Superdome has two
virtual partitions set up and OnlineDiag product is installed
on both virtual partitions, there will be one memlogd per
virtual partition; hence, a total of two memlogds on the
system.
- Each memlogd will monitor:
- memory allocated to the virtual partition its running
in
- memory allocated to the virtual partition monitor
- memory not allocated to any virtual partition
- memory allocated to another virtual partition, but the
memlogd in that virtual partition currently may not be
running
Hence, users can potentially see the same memory errors
logged by the memlogds in different virtual partitions.
Example: a memory error, 0xadb2074, detected by memlogd A on
virtual partition A belongs to the memory allocated to the
virtual partition monitor, so memlogd A logs that error in
its memlog file; later on, the same memory error, 0xadb2074,
belonging to the memory allocated to the virtual partition
monitor occurs again and is detected by memlogd B on virtual
partition B, so memlogd B logs that error in its memlog file;
hence, when the user views the memlog file in partition A, he
will see the memory error 0xadb2074; when the user views the
memlog file in partition B, he will also see the memory error
0xadb2074. As a result, users can potentially be alerted by
the EMS HW Memory Monitor on each virtual partition
(depending on how the memory monitor's configuration file,
default_dm_memory.clcfg, is setup on each virtual partition)
on the system about the same faulty memory component (this is
is due to the same memory errors).
- Memlogd will handle a memory error belonging to the memory
allocated to another virtual partition if:
- that virtual partition is believed to be not running
or
- the memlogd in that virtual partition is believed to be
not running; this means that the memlogd in another virtual
partition is not able to detect the error after one hour
from the time the error was detected by the memlogd of the
current virtual partition. The memlogd of the current
virtual partition will assume that the memlogd of another
virtual partition is not running and handle the error.
- If running on a virtual partitions system and user selects
to view the detailed information of the memlog file via
logtool, there are two new page statuses introduced in addition
to the existing page statuses:
- Pending: page could not be entered into pdt Note: This
page status is used to indicate to the user that the
following page could not be entered into the page
deallocation table (PDT) either because the pdt is full or
an error was encountered when memlogd tried to enter this
page into the PDT.
- Pending: page may still be active Note: This page
status logged by the memlogd in the memlog file only
applies to pages not belonging to the virtual partition
which the memlogd is running in to indicate to the user
that the page has been entered into the PDT but memlogd can
not determine if the OS has deallocated the page or
not.
- If running on a virtual partitions system and user selects
to view the detailed information of the memlog file via
logtool, the page status of the same memory page could
potentially be different from the views of different virtual
partitions. Example: a memory error, 0xadb2074, detected by
memlogd A on virtual partition A belongs to the memory
allocated to the virtual partition B; however, since memlogd B
on virtual partition B was not running, memlogd A handles the
memory error by logging it in its memlog file. The memory
error, 0xadb2074, occurs again within 24 hours and memlogd A on
virtual partition A handles the memory error again by entering
it into the PDT and updating the memory error in its memlog
file. Since the memory error has already been entered into the
PDT, the page status of that memory error in the memlog file
becomes "Pending: page can not be obtained" in memlogd A on
virtual partition A. When memlogd B on virtual partition B
finally starts up (e.g., user invokes memlogd B on virtual
partition B), memlogd B would read the entries in the PDT and
request the OS on virtual partition B to deallocate any entries
belonging to the memory allocated to the virtual partition B.
Since the memory error, 0xadb2074 (already in the PDT), belongs
to the memory allocated to the virtual partition B, memlogd
would request the OS on virtual partition B to deallocate the
page containing that memory error. If the OS on virtual
partition B is able to deallocate that page, the page status of
that memory error in the memlog file becomes "Deallocated: page
is no longer in use" in memlogd B on virtual partition B.
Hence, the page status of that memory error in the memlog file
could differ between the memlogds on different virtual
partitions.
- Added new IDE CDs:
Product # HP P/N Product
A7853A D4398-60083 CD-RW Drive
A5001A D4389-60083 CD-ROM Drive
- JAGad44671
Display "not set" for serial number on Keystone/Matterhorn, if not
set.
- Modified mapping of SANLINK arrays to be mapped as arrays, not
individual LUNs. This allows the map to be much cleaner by
collapsing all the LUNs into the target level AND fixes the problem
where the disk_em monitor was attempting to monitor the arrays,
believing them to be generic SCSI disks. The disk_em monitor will
no longer attempt to monitor these devices.
- JAGae20830
This fixes a problem of "garbage data" being returned by the
info/dlpi information tool, when PAgp port aggregation software is
installed on a system.
- JAGae21636
Fixed problem with STM hanging or aborting with a segmentation
violation, when a non-root user attempts to install an RCO license.
- Updated the EMS introductory message that is mailed to root on
installation.
- JAGad96190
In prior versions of STM, when a connection to a remote machine
failed due to an invalid user name or password, the status
displayed for that machine would be "No Response". This version of
STM removes the "No Response" message for an invalid user name or
password, since no status message is more appropriate.
General
Changes to Tools
Changes to
Specific Tools
- Information tools:
- JAGae13794
Enhancement Description: The HP-UX Common Criteria EAL4-CAPP
Certification requires user mode validation of Read-Only
virtual memory addresses. This version of the STM Memory
Information Tool verifies and reports the status of these
memory access protection mechanisms:
Customers/users will see the following changes in the STM
Memory Information Tool activity log file:
- JAGae26804
The patch fixes the mapping in memory information tool and
correctly displays the memory extender labels on N-Class and
L-Class. The firmware team provided a new mapping for
translating the extender numbers to extender labels for the
L-class systems. Changes were made in code to incorporate the
new mapping, which was successfully unit-tested on two
L-class machines and a N-class. This fix to JAGae26804 was
submitted to HWE0206_vpar special release and the following
patches:
PATCHNAME: PHSS_27166 HWE0203 11.11
PATCHNAME: PHSS_27167 HWE0206 11.00
PATCHNAME: PHSS_27168 HWE0206 11.11
- Exerciser tools:
- JAGae33612
JAGs fixed in this release:
JAGae11271: "exerciser leaves temporary files around".
It should only leave the tomcatv files in /var/tmp if the
comparison failed, so that the admin can view the files, to
see why the failure occurred.
JAGae33612: "CPU Exerciser fails fpu test"
Exerciser was modified to properly handle CPU
configurations containing deconfigured or otherwise inactive
CPUs.
- Verifier tools:
- Diagnose tools:
- Firmware update tools:
- Expert tools:
- JAGae35882
Reconfigure failed to identify deconfigured processors for
reconfiguration. JAGae35882 fix allows CPU Expert tool to
build list of 'Marked for Deconfigure', as well as
'Deconfigured', processors.
- JAGae33206
JAGs fixed in this release:
JAGae03440: How to know if the processor was deactivated
by lpmc_em?
JAGae11107: expert/cpu, very confusing error message on
T600.
JAGae17176: On kclass cpu/experttool exercise has internal
errors.
JAGae19756: increment the version of expert/cpu.
JAGae30475: cpu/infoinfo is missing a message catalog.
JAGae31662: catalog mismatch: expert/cpu.
JAGae33206: CPU Expert Tool fails when deconfigured
processor is present.
- A submittal was made to support the memory expert tool on
vPar, Superdome class systems, because the original code
changes to support the memory expert tool on vPar systems
(submitted to the first vPar release) did not work as expected
on the vPar, Superdome class systems.
- For the HWE0209 release, customers/users will see the
following changes in the behavior of the memory expert tool on
Superdome systems (that differs from the existing IPR
releases):
- If running on a virtual partitions system and user
selects to run the memory expert tool-> memory test,
there are additional error messages that can be logged to
the memory expert activity log file when the memory test
failed (only on a virtual partitions system):
- An attempt to obtain exclusive access to a memory
controller via the vpmonitor driver failed. This error
is unexpected. The memory expert tool can not continue
with the memory test. Possible Cause(s)/Recommended
Action(s): Internal application error.
- The memory subsystem is not recognized by this tool
to support virtual partitions. The memory expert tool
cannot continue with the memory test. Possible
Cause(s)/Recommended Action(s): Internal application
error.
- An attempt to read syndrome error information was
not successful. The memory expert tool cannot continue
with the memory test. Possible Cause(s)/Recommended
Action(s): A process (possibly memlogd) in another
virtual partition has taken the virtual partition lock
away from the memory expert tool. Rerun the tool
later.
- An attempt to release the virtual partition lock
failed. This error is unexpected. The memory expert
tool cannot continue with the memory test. Possible
Cause(s)/Recommended Action(s): Internal application
error.
- JAGae23523
Disk LED on/off command in expert tool for APEX was not
working. Code changes have been made, and it works as expected.
- JAGae11271; JAGae17176; JAGae03440; JAGae11107
JAGs fixed in this release:
- JAGae11271: Exerciser leaves temporary files.
Added routine to delete temporary files, if a problem
arises prior to tomcatv comparison. If tomcatv comparison
fails, the temporary files are purposely left on disk for
review.
- JAGae17176: Exerciser fails on K-Class.
The Exerciser would return a SUCCESSFUL status, though
it would not perform the tests. Now, it correctly
performs and reports the exerciser tests.
- JAGae03440: Add text to show LPMC-deactivated CPUs.
If the CPU monitor takes a CPU offline due to LPMC
errors, the status displayed for the processor will be
"Inactive--LPMC". CPUs in this state cannot be activated
using the Expert Tool. An attempt to reactivate a CPU in
this state results in the following error message:
"The Processor specified (#) was deactivated by the LPMC monitor due to
Low Priority Machine Check (LPMC) errors. The Expert Tool cannot
activate processors deactivated by the LPMC monitor. Refer to the LPMC
event log to determine the appropriate corrective action."
- JAGae11107: Confusing error message on T600.
Changed error message to following:
"CPU # cannot be deconfigured: Each cell must have at least one CPU
configured. (Note: This machine type may not support CPU
deconfiguration.)"
- Logtool:
- Utilities:
Known
Problems
CAUTION:
Info Tool Causes HPMCs on K- and T-Class (HP-UX 11i only)
K-Class and T-Class computers running HP-UX 11i will experience
HPMCs or Data Page Faults if you run the STM Info Tool to retrieve
configuration information from HP-PB SE SCSI adapters (JAGad88317,
JAGad97126).
The root cause is a problem in the HP-UX 11i kernel, which is
exposed when the Info Tool is run as described above. To correct the
problem and avoid potential HPMCs, load patch PHKL_25552 or its
successor. This is a kernel patch and requires a system reboot.
STM only exposes this problem if the Info Tool is run on HP-PB SE
SCSI adapters. STM does not expose the problem if the Info Tool is
run against other I/O devices.
Defect
Reporting
Use CHART to report defects in STM. The project name is
diag.stm.tools.hpux for individual tools, and diag.stm.ui.hpux for
the user interface. If you don't have access to CHART, contact an HP
representative to enter a defect for you.
SD Product
Structure
The product number for STM is B4708AA.
SD PRODUCT: Sup-Tool-Mgr
Description: On-line Diagnostic System (Series 800/700)
SD SUB-PRODUCT: Manuals
Description: Support Tools Manager Manual Pages
FILESET: STM-MAN
Description: S800/S700 STM Manual Pages
FILESET: STM-SHLIBS
Description: S800/S700 STM Shared Libraries
FILESET: STM-UI-RUN Corequisite Filesets: STM-SHLIBS
Description: S800/S700 STM User Interface
FILESET: STM-UUT-RUN Corequisite Filesets: STM-SHLIBS
Description: S800/700 STM Unit Under Test Runtime
Top of Page
/ Diagnostics HOME
URL:
http://docs.hp.com/hpux/onlinedocs/diag/stm/str_0209_11i.htm
Last updated: Thurs July 11 10:55:56 PDT 2002