These release notes cover the March 2001 release of Support Plus for HP-UX 11i/11.00/10.20 running on S800/S700 systems.
- Overview
- Configuring Hardware Monitoring
- Documentation
- Changes
- Known Problems
- Monitors Provided
- Monitor Dependencies
- Defect Reporting
- SD Product Structure
NOTE: As of the September 1999 release, the name of the Diagnostic/IPR Media has been changed to Support Plus. In addition, the format has changed so that there is a separate CD-ROM for each version of the operating system (HP-UX 11i, 110.00 and 10.20).
Included on the Support Plus CD-ROM are the EMS Hardware Monitors - an important tool for maintaining system availability. The EMS hardware monitors allow you to monitor the operation of a wide variety of hardware products and be alerted immediately if any failure or other unusual event occurs. Hardware event monitoring is available to users running HP-UX 11i, 11.00, or 10.20 (IPR 9902 and later).
Hardware event monitoring provides a high level of protection against system hardware failure. By using hardware event monitoring, you can virtually eliminate undetected hardware failures that could interrupt system operation or cause data loss.
Configuring Hardware Monitoring
The EMS Hardware Monitors are installed at the same time as the Support Tools Manager. Once the monitoring software is installed, monitoring is automatically enabled.
By default, messages regarding major warning, serious and critical events that occur on hardware being monitored will be:
All events will be stored in /var/opt/resmon/log/event.log.
- Written to /var/adm/syslog/syslog.log
- Sent to EMAIL address root
To configure, enable, or disable hardware event monitoring, run the monitoring request manager: /etc/opt/resmon/lbin/monconfig .
The Peripheral Status Monitor (PSM) and the The Kernel Resource Monitor (krmond) are configured differently. They use the EMS GUI. See: http://docs.hp.com/hpux/onlinedocs/diag/ems/ems_gui.htm
For the latest and most complete information on EMS Hardware Monitors and the Support Tools Manager (STM), see the Web page "Diagnostics":
http://docs.hp.com/hpux/diag/At this site, you will find Overviews, Tutorials, Quick Reference Cards, Frequently Asked Questions (FAQs), and much other material.For complete information on installing and using EMS hardware monitors, as well as a list of supported hardware, refer to the "EMS Hardware Monitors User's Guide" available at the above site. An electronic copy of this book is also included on the Support Plus CD-ROM in the <mount_point>/DIAGNOSTICS directory.
Changes in the EMS Hardware Monitors for the the March 2001 release include:
- Changes to Multiple Monitors
- Changes to Individual Monitors
- Changes to Platform and Interface
- Customer-Visible Interface Changes
- Changed the default behaviour of multiple-view (Predictive-enabled) monitors. These monitors now keep event history in memory; they no longer create and maintain an event history file by default in the /var/stm/config/tools/monitor directory. This means that event history will no longer be saved across disable/enable or reboot, by default.
EXCEPTION: the dm_memory monitor will not have the new default behavior. Instead, it will create and maintain the event history file, as in the past. Its event history will be saved across disable/enable or reboot.
With the new default behavior, if you disable/enable monitoring, all event history will be lost. This means that if a monitor is keeping track of a set of errors for trending to determine if an error should be generated, the monitor will lose the history of those errors when it is disabled and re-enabled. The most common result of a monitor losing its event history is:
IF a particular event was generated and it had a suppression time
AND the user disabled and re-enabled the monitor during that suppression time
THEN the user might see that event generated again after the monitor is re-enabled, even though the "original" suppression time had not expired, because the monitor had lost the knowledge that it had previously generated that event.The user can determine which history might be lost by looking at the monitor data sheets for threshold and suppression time information.
- Fixed a problem that occurred on Superdomes running HP-UX 11i. The previous release would log "unavailable" as the system serial number to the event log for any monitor event generated on a system running HP-UX lli. Now, the system serial number will be logged instead of "unavailable".
Changes to Individual Monitors
- SCSI Card Monitor (scsi123_em). Enhanced to be multiple-view (Predictive-enabled). Also enhanced logfiles to provide hw_path data and to meet format requirements of Predictive Support.
- System Status Monitor (sysstat_em). Enhanced sysstat_em to be multiple-view (Predictive-enabled).
- UPS Monitor (dm_ups).
- Added support for Event 103 (the test event).
- (JAGad37359) UPS Monitor (dm_ups). Fixed a problem whereby the dm_ups monitor would occasionally core dump. The problem occurred if the UPS monitor was the first monitor started and either the system info tool or the cpu info tool had not yet been run.
- Memory Monitor (dm_memory).
- (JAGad27656) The operation of the Memory Monitor was changed, so that Suppression Time and Time Window values are now in minutes rather than seconds. These values appear in the configuration file /var/stm/config/tools/monitor/default_dm_memory.clcfg . In addition, several values for Suppression Time and Time Window were changed: 1400 to 1440, 2800 to 2880, and 4200 to 4320.
- (JAGad37319) Modified the Memory Monitor so that it will not have the new default behavior of multiple-view (Predictive-enabled) monitors. Instead, the monitor will create and maintain the event history file, as in the past. Its event history will be saved across disable/enable or reboot.
- Fixed a problem, whereby the monitor would attempt to access message 10 or 11 in message set 50, logging an error when it did not find either of these messages. Part of the error message would read: Message in ll_msg (set: 50 msg: 10) did not exist in catalog. This problem occurred when there was new hardware for the memory monitor to monitor.
- Disk Monitor (disk_em).
- If device uses a removable medium, and sense_key/code/qual say that device is not ready WITH no medium present, disk_em will not generate an event.
- Added Events 11 - 19. Events 11, 12, 13, 14, and 15 may be generated by general SCSI errors. Events 16, 17, 18, and 19 may be generated by driver errors (CDB status).
- Enhanced Disk Monitor so that event history is reset after an FRU is replaced/removed. Also enhanced processing of driver errors.
- I2O drives are now filtered OUT so that they are not monitored by disk_em.
- (JAGad45321). Fixed a problem (corner case), whereby if there are no hardware paths to filter, occasionally memory corruption occurs and monitor may not run.
- (JAGad34276) Fixed the following problem. When FC60 array firmware version "HP08" or later is installed, an interaction between disk_em monitor and the new firmware causes the disk_em monitor to falsely report event #30. The updated UTM lun is perceived to be a scsi disk; however, because it is not a scsi disk, disk_em then also logs errors into the EMS event log.
- (JAGad30397) Fibre Channel Adapters (dm_FCMS_adapter). Corrected message text. The fcmsutil command syntax had been displayed incorrectly as fcmsutil <device file> reset clear. The correct syntax is fcmsutil <devicefile> reset.
- Fibre Channel Adapter Model A5158 Monitor (dm_TL_adapter). Enhanced this monitor to accommodate changes to the driver to support OLAR. (HP-UX 11i only).
- Remote Monitor (RemoteMonitor).
- Fixed a problem, whereby the Remote Monitor did not shut down when all the entries in RemoteMonitor.cfg were defined as DISABLE.
- Added support for DEV_IDs in the client configuration file, /var/stm/config/tools/monitor/default_RemoteMonitor.clcfg . (HP-UX 11.00 only).
- AutoRAID Disk Array (armmon).
- For 10.20 and 11.00: Corrected a problem in which armmon was producing an error event on systems without a HP AutoRAID 12/12h disk array attached.
- (JAGad33657) For 11i: Added support for the send_test_event command, Event 103.
- (JAGad27648) For 11i: Added support for client-specified event suppression values in a .clcfg file.
- Fixed a problem in which armmon could hang if too many events occurred simultaneously.
- (JAGad37359) High Availability Disk Array Monitor (ha_disk_array) . Fixed a problem with the ha_disk_array monitor, whereby the monitor would not find all the hardware paths that it should monitor. Problem only occurred on the Dec 00 release.
- Disk Array FC60 Monitor (fc60mon).
- For 10.20 and 11.00: Corrected a problem in which fc60mon was producing an error event on systems without a HP FC60 disk array attached.
- For 10.20 and 11.00: Added support for new FC60 controller firmware.
- (JAGad33657) For 11i: Added support for the send_test_event command, Event 103.
- (JAGad27648) For 11i: Added support for client-specified event suppression values in a .clcfg file. A side effect of this change is that the fc60mon.log file will grow more quickly. (To get suppression to work correctly the value of REPEAT_FREQUENCY had to be set to one minute.)
- For 11i: Fixed a problem in which fc60mon could hang if too many events occurred simultaneously.
- (JAGad42934) For 11i: Fixed a problem whereby some events would cause fc60mon to hang. Also setting the DEBUG_LEVEL to 030309 in order to "see" all the events now works correctly as well.
- For 11i: Added support for new FC60 firmware.
- For 11i: Enabled event #4, "Can't communicate with AM60Server", which was previously disabled.
- (JAGad36765) Fixed a problem with fc60mon and PSM. The /var/stm/config/tools/monitor/fc60mon.psmcfg configuration file's format was such that the last line in the file was not read. Therefore, the default psmcfg mapping of severity to DOWN state is used for this monitor and an error is logged in the /etc/opt/resmon/log/client.log file, as indicated below.
Severity threshold keyword #1 did not have matching operator keyword or visa versa in the entry with monitor resource name /storage/events/disk_arrays/FC60. This severity mapping will be skipped. The file is supposed to configure the monitor resources to only go into the DOWN state if the severity is CRITICAL. Due to the skipped severity mapping, the monitor resources will go into the DOWN state when the severity is SERIOUS or higher.- Kernel Resource Monitor (krmond): updated to version A.11.00.03 to fix the install problem described below:
The Kernel Resource Monitor (krmond) was not correctly installed if diagnostics were installed using Ignite-UX when booted over the network and installing from a depot. However, the process did work to Ignite the KRM product from an archive.If you tried to install the EMS-KRMonitor product using Ignite-UX and see errors, the KRM product would not run, but nothing else would be affected.
Affected Configurations: This problem only occurred on the Dec 2000 release of the diagnostics for HP-UX 11.00. It only occurred using Ignite_UX when booted over the network. The problem did NOT occur if the diagnostics are installed directly from a Support Plus CD-ROM or from an OnlineDiag depot downloaded from the HP Software Depot website.
Symptoms: Two errors probably appeared in the install log (swagent.log):
ERROR: Cannot install a dlkm driver. and ERROR: Cannot configure a dlkm driver.Additionally, the Kernel Resource Monitor would not run.The Workaround is to exclude the EMS-KRMonitor package from Ignite-UX sources and to install it with swinstall if desired. For details, see the Dec 00 EMS HW Monitor Release Notes.
- (JAGad34371). SCSI Tape Monitor (dm_stape).
- Added "$Revision: $" string to all configuration files.
- (JAGad34371) Added enhancement to clear event histories (to prevent event suppression) whenever hardware changes
- Modified monitor to ignore devices not supported.
- SCSI Tape Devices Monitor (dm_stape) monitor. (The following changes are only for HP-UX 11i; the changes were made to 10.20 and 11.00 in the December 2000 release.)
- Modified the monitor so that it reads the dm_stape.cfg file to determine which, if any, errors to insert while the program is running.
- Added Product IDs for C7369A Tape drive ("Ultrium LTO") and C7483A Tape drive ("Benchmark") to list of supported devices. The Product IDs are "Ultrium 1-SCSI"and "DLT1" respectively.
- (JAGad29068) Increased the suppression time for Event 201 from 1440 minutes (1 day) to 10080 minutes (1 week). Changed severity for Event 201 from CRITICAL to MAJOR_WARNING.
- (JAGad29066) Reduced the severity of several events to comply with requests from Predictive Support:
Event: 11i Initial 11i March 00 Release Severity Release Severity ====== ================= ================= 20 SERIOUS MAJOR_WARNING 22 SERIOUS MAJOR_WARNING 23 SERIOUS MAJOR_WARNING 30 CRITICAL SERIOUS 31 CRITICAL SERIOUS 33 CRITICAL MAJOR_WARNING 38 SERIOUS MAJOR_WARNING 40 CRITICAL MINOR_WARNING 42 MAJOR_WARNING MINOR_WARNING 43 SERIOUS MAJOR_WARNING 44 SERIOUS MINOR_WARNING 45 SERIOUS MAJOR_WARNING 201 CRITICAL MAJOR_WARNING 203 CRITICAL MAJOR_WARNING 204 CRITICAL MAJOR_WARNING 209 SERIOUS MAJOR_WARNING 210 SERIOUS MAJOR_WARNING 216 CRITICAL SERIOUS 217 CRITICAL MAJOR_WARNING 218 CRITICAL MAJOR_WARNING 230 SERIOUS MAJOR_WARNING 901 SERIOUS MAJOR_WARNING- Core Hardware Monitor (dm_core_hw).
- (JAGad45946) This submittal fixes a problem with Dec 00 release and the initial release of HP-UX 11.i on N-Class, L-Class and A-Class systems where erroneous events can be generated regarding hardware that is not present on these systems. These events may refer to I/O power supplies, I/O fans and backplane power boards. These erroneous events have been seen very infrequently.
- Added monitoring of corrected errors detected and logged by the Superdome crossbar controller (XBC) and cell controller (CC) chips. These errors are reported in Events 63 through 85, which have been added to the monitor.
- (JAGad39310) Fixed a problem, whereby dm_core_hw exited at each polling cycle, which is once every 15 minutes. It then restarted, logged a "restart" event message in the event log /var/opt/resmon/log/event.log, and sent the same message to all targets associated with this monitor.
- (The following change is only for HP-UX 11i; the change was made to 10.20 and 11.00 in the December 2000 release.) Events 33 and 34 (for overtemp) are no longer suppressed. Whenever the hardware reports a transition to one of the two overtemp states that can be detected, event 33 or 34 will be generated.
Detailed description: When the core hardware monitor detects an overtemp situation, it generates an event 33. If things get warmer still, it generates an event 34. (Typically event 34 is never actually generated because by default, the envd config file is set up to shut down the system when this condition occurs.)
If either of these conditions occur, and then the temperature returns to normal, the system will wait for 15 minutes (programmable in the core hardware monitor config file) before generating an event 35, saying that the temperature has returned to normal. The idea was to "maintain a state of wariness," so to speak, until we had a way to know that the temperature was normal for long enough that we felt that the danger had subsided.
This is further complicated by the fact that, on N-Class and newer systems, we don't get a status which tells us that the temperature has returned to normal, and therefore never generate event 35.
If event 33 and/or 34 are generated, and then the temperature returns to normal, and then the system gets warm again, the hardware would tell us that we should generate another event 33. The way things were, we didn't generate an event 33 because it was within the suppression time. Since it seems likely that the hardware is reporting a new condition, and since time to fix an overtemp situation is critical, we decided to report all overtemp conditions.
Changes to Platform and Interface
- Corrected an error message logged into System Activity Log for psmctd process. Previously, when monitoring is disabled and STM is restarted, psmctd logs errors indicated that it is exiting due to unexpected signal (16), rather than because monitoring is disabled (the real case). Previous error in system activity log:
Daemon process completed with exit_status UNEXPECTED_SIGNAL_EXIT(208) indicating process exited due to receipt of an unexpected signal (16). Possible Causes/Recommended Action: Process internal error.Proper error in system activity log:Daemon process completed with exit_status SYS_SHUTDOWN_EXIT(150) indicating the process has exited due to Hardware Event Monitoring being disabled Possible Causes/Recommended Action: This process is associated with the Hardware Event Monitoring system which is disabled. Use the /opt/resmon/lbin/monconfig program to enable hardwarwe monitoring and restart this process.- Peripheral Status Monitor (PSM).
- Enhanced psmctd to work around EMS problem of not handling signal interrupts. This will remove the errors that occasionally occur in the client.log indicating rm_recv_reply failed due to an interrupted call.
- Enhanced psmctd to recreate the psm_data file if it is removed.
- Enhanced psmmon to handle missing psm_data file and psm_data file which is in the process of being updated by psmctd. This will remove excessive error logging in the api.log from psmmon indicating the psm_data file couldn't be accessed.
- (The following change is only for HP-UX 11i; the change was made to 10.20 and 11.00 in the December 2000 release.) Fixed problem whereby the toggle_switch process would hang made to 10.20 and 11.00 in the December 2000 release. during startup or shutdown for monitors which did not behave properly. The problem is rare; only one case has been reported. It occurred when a monitor started properly and accepted configuration, but then, for some reason, stopped responding to EMS. Some monitors behave this way if one shuts down diagnostics without shutting down monitoring first. What users see depends on how they performed the "shutdown".
- If they did the "shutdown" from monconfig, then monconfig would hang at the "This might take a while..." display. If they exited monconfig (by doing a cntrl C), they would see toggle_switch running forever.
- If they did the "shutdown" by calling toggle_switch directly, they would hang forever.
- In both cases, depending on how far toggle_switch got before it hung, they would see some monitors still running as well.
Customer-Visible Interface Changes
This section reports changes to the customer-visible interface in this release. This information is provided for the benefit of customers using scripts to drive hardware support tools to look at the output of hardware support tools.
None reported.
CAUTION: UPS Monitor May Need a PatchIn some cases, the UPS monitor (dm_ups) will not function and will instead generate event 42 with the text:
Probable Cause / Recommended Action: The monitor was unable to locate the fifo pipe that should have been created by ups_mond. Therefore, information about the ups cannot be sent to the monitor. You need version (80.1.2.3) of ups_mond or greater. To update your system with the correct version of ups_mond, install one of the following patches: HPUX 10.20/s800 : PHCO_23830 HPUX 11.00 : PHCO_23831 HPUX 11.11 : PHCO_23832(The system may have an earlier version of the message for event 42. The earlier version does not contain the patch information.)To fix the problem, load the indicated patch.
This problem will affect most systems with a UPS when the March 2001 diagnostics are installed. The only systems not affected will be those which are being updating from certain versions of the diagnostics (September 2000 through March 2001) and which do not have patch PHCO_19031 (HP-UX 10.20) or PHCO_19040 (HP-UX 11.00) installed.
CAUTION: Monitoring Changes for disc30, sdisk and disk array devicesAs of IPR 9902 (Feb 99 release), there has been a change to the way that monitoring is done for disc30, sdisk and the HA Disk Array Models 10, 20, and 30FC.
Formerly, the "diaglogd exec" programs (pdisc30_exec, pharaymon_exec, and psdisk_exec) handled driver error entries for these devices.
As of IPR 9902, these programs have been deleted and their functionality is now provided by the EMS Hardware Monitors.
If you had customized the configuration files for the diaglogd exec programs (disk30_exec.cfg, sdisk_exec.cfg, and haraymon_exec.cfg) you may wish to re-configure the EMS Hardware Monitors to achieve the same results.
CAUTION: Compatibility Problem with EMS-Related Products (ServiceGuard, HA Monitors, etc.)If you install the OnlineDiag bundle (Dec 99 or later) onto a computer running older revisions of EMS-related products, these products may experience compatibility problems Affected products include MC/ServiceGuard, ServiceGuard OPS Edition and High Availability Monitors. The only critical problems occur with the following versions:
MC/ServiceGuard A.10.10, A.11.01, A.11.03 ServiceGuard OPS Edition A.11.02, A.11.03Support Tools and the EMS hardware monitors are not affected. For complete information, see EMS Incompatibility Problem.
Monitors are provided to support the following:
In addition, a Hardware status monitor is provided to monitor the current status of the products supported by the above list.
- AutoRAID Disk Array (armmon)
- Core Hardware (dm_core_hw)
- Disk (disk_em)
- Disk Array FC60 (fc60mon)
- Fast Wide SCSI Disk Array (fw_disk_array)
- Fibre Channel Adapters (dm_FCMS_adapter)
- Fibre Channel Adapter Model A5158 (dm_TL_adapter)
- Fibre Channel Arbitrated Loop Hub (dm_fc_hub)
- Fibre Channel SCSI Multiplexer (dm_fc_scsi_mux)
- Fibre Channel Switch (dm_fc_sw)
- High Availability Disk Array (ha_disk_array)
- High Availability Storage System (dm_ses_enclosure)
- Kernel Resource (krmond)
- LPMC (lpmc_em)
- Memory (dm_memory)
- Remote (RemoteMonitor)
- SCSI Card (scsi123_em)
- SCSI Tape Devices (dm_stape)
- System Status (sysstat_em)
- UPS (dm_ups)
For detailed information concerning which products are supported by which monitors and additional dependencies, check the "Diagnostics" section of Hewlett-Packard's online documentation web site: http://docs.hp.com/hpux/diag/ .
Several of the monitors have special requirements, such as patches or certain versions of firmware. In particular:
For a list of the current required patches, see the DIAGNOSTIC.readme file for this release.
- The Fibre Channel Arbitrated Loop Hub Monitor and the Fibre Channel Switch Monitor require special configuration which is described in their data sheets in the "EMS Hardware Monitors User's Guide" (chapter 6). A patch is also required.
- A patch is required if your system includes an HP SureStore E Disk Array FC60. This patch is required to to run the EMS hardware monitor (fc60mon) or STM tools for this device.
Current monitor requirements are described in the "Supported Products" page under "EMS Hardware Monitors" at http://docs.hp.com/hpux/diag . Requirements are also listed in chapter 2 of the manual "EMS Hardware Monitors User's Guide".
Use CHART to report defects in the EMS Hardware monitors. The project name is diag.hw_mon.hpux. If you don't have access to CHART, contact an HP representative to enter a defect for you.
The EMS hardware monitors are installed as part of the OnlineDiag bundle (product number B4708AA). In addition, they utilize the EMS framework, product number B7609BA.
Note: EMS Hardware Monitors are installed as part of the STM-UUT-RUN Fileset. However, the EMS Hardware Monitors are dependent on the EMS-Core and EMS-Config products and additional filesets in the Sup-Tool-Mgr Product.
For information on the STM product, refer to the STM release notes file /usr/sbin/stm/Rel_NOTES.STM.
SD Bundle: OnlineDiag Description: On-line Diagnostic System (Series 800/700) SD PRODUCT: Sup-Tool-Mgr Description: Support Tools Manager for HP-UX Systems SD SUB-PRODUCT: Manuals Description: Support Tools Manager Manual Pages FILESET: RELEASE_NOTES Description: HPUX STM Release Notes FILESET: STM-MAN Description: HPUX STM Manual Pages SD SUB-PRODUCT: Runtime Description: STM Manual Runtime FILESET: STM-CATALOGS Description: HPUX STM Shared Libraries FILESET: STM-SHLIBS Description: HPUX STM Shared Libraries FILESET: STM-UI-RUN Description: HPUX STM User Interface FILESET: STM-UUT-RUN Description: HPUX STM Unit Under Test Runtime SD PRODUCT: EMS-Config Description: EMS Config FILESET: EMS-GUI Description: Event Monitoring Service Graphical User Interface SD PRODUCT: EMS-Core Description: EMS Core Product FILESET: EMS-CORE Description: Event Monitoring Service Core Files