These release notes cover the September 2002 release of Support Plus for HP-UX 11i/11.00/10.20 running on S800/S700 systems.
- Overview
- Configuring Hardware Monitoring
- Documentation
- Changes
- Known Problems
- Monitors Provided
- Monitor Dependencies
- Defect Reporting
- SD Product Structure
NOTE: As of the September 1999 release, the name of the Diagnostic/IPR Media has been changed to Support Plus. In addition, the format has changed so that there is a separate CD-ROM for each version of the operating system (HP-UX 11i, 110.00 and 10.20).
Included on the Support Plus CD-ROM are the EMS Hardware Monitors - an important tool for maintaining system availability. The EMS hardware monitors allow you to monitor the operation of a wide variety of hardware products and be alerted immediately if any failure or other unusual event occurs. Hardware event monitoring is available to users running HP-UX 11i, 11.00, or 10.20 (IPR 9902 and later).
Hardware event monitoring provides a high level of protection against system hardware failure. By using hardware event monitoring, you can virtually eliminate undetected hardware failures that could interrupt system operation or cause data loss.
Configuring Hardware Monitoring
The EMS Hardware Monitors are installed at the same time as the Support Tools Manager. Once the monitoring software is installed, monitoring is automatically enabled.
By default, messages regarding major warning, serious and critical events that occur on hardware being monitored will be:
All events will be stored in /var/opt/resmon/log/event.log.
- Written to /var/adm/syslog/syslog.log
- Sent to EMAIL address root
To configure, enable, or disable hardware event monitoring, run the monitoring request manager: /etc/opt/resmon/lbin/monconfig .
The Peripheral Status Monitor (PSM) and the The Kernel Resource Monitor (krmond) are configured differently. They use the EMS GUI. See: http://docs.hp.com/hpux/onlinedocs/diag/ems/ems_gui.htm
For the latest and most complete information on EMS Hardware Monitors and the Support Tools Manager (STM), see the Web page "Diagnostics":
http://docs.hp.com/hpux/diag/At this site, you will find Overviews, Tutorials, Quick Reference Cards, Frequently Asked Questions (FAQs), and much other material.For complete information on installing and using EMS hardware monitors, as well as a list of supported hardware, refer to the "EMS Hardware Monitors User's Guide" available at the above site. An electronic copy of this book is also included on the Support Plus CD-ROM in the <mount_point>/DIAGNOSTICS directory.
Changes in the EMS Hardware Monitors for the the September 2002 release include:
- Changes to Multiple Monitors
- Changes to Individual Monitors
- Changes to Platform and Interface
- Customer-Vi sible Interface Changes
Changes to Individual Monitors
Changes to each monitor are described below. (Monitors are listed in alphabetical order.)
- AutoRAID Disk Array (armmon).
- Chassis Code Monitor (dm_chassis).
- This is the HWE0209 release for the chassis code monitor. It includes the following changes:
- Mask out message ID from chassis code encoded field when pre-qualifying the event to avoid unnecessary chassis code events.
- Export the new chassis code database and import into UNIX. This means tlchassis.sl and tlchassis.msg have the same content as the chassis code databases on NT. In detail, it means
- Some new caribe chassis code events.
- Some event numbers are re-assigned. The idea is that the same event number should not be shared by the different platforms. In the HWE0206 release, some event numbers are shared by Superdome and Keystone, like event number 1398. In HWE0209, we created a new number, 1798 for Keystone.
- Some new chassis code events for Keystone, such as 0xB0500864 69FF404F (keyword is HLSB_POWER_FAULT).
- CMC Monitor (cmc_em).
- N/A
- Core Hardware Monitor (dm_core_hw)
- JAGae28624
Fixed condition where dm_core_hw monitor logged errors about an ioctl call failing.- JAGae23180
A problem with the HP rp7410 Server that can lead to invalid ECC errors being reported, when the system was configured with a single-cell partition, was fixed. This problem caused EMS events 79, 80, 81, 82, and 83 from the dm_core_hw monitor to be generated, although there was no hardware problem. The complete fix for this problem requires rp7410 firmware version 4.0, as well a s this patch.- Core Hardware for Itanium (ia64_corehw).
- N/A
- CPU Monitor (lpmc_em).
Note: As of the June 2002 release, the LPMC Monitor (lpmc_em) was renamed to "CPU Monitor". The binary name is still lpmc_em. The name was changed to reflect the monitor's enhancement to check floating-point functionality in the CPU.
- Disk Array FC60 Monitor (fc60mon).
- JAGad34138; JAGad37486; JAGad34030; JAGae03024; JAGae13410
Enhanced the fc60mon monitor tool to report MEL related events. The following events have been added to support the enhancemnet for the above JAGs:
- JAGae03024 EMS does not report disk failure in FC60 disk array
- JAGad34138 EMS:notification from EMS,even if no access to the failure disks
- JAGad37486 fc60mon should report MEL-related events
For Event 35 and Event 36, the default_fc60mon.clcfg entry should have the suppression time:timewindow:threshold as NOT_USED:5 (seconds):2 (No. of times the event occurs within this time frame).
- Event 33 would be generated when GHS is enabled on the fc60 device.
- Event 34 would be generated when GHS is disabled on the fc60 device.
- Event 35 would be generated when a disk is inserted into any of the JBODs connected to the FC60 controller.
- Event 36 would be generated when a disk is removed from any of the JBODs connected to the FC60 controller.
- Event 37 would be generated when the log files created by AM60Srvr files are corrupted.
The above events would be repeated, whenever the system is rebooted, or the AM60Srvr is restarted. The reason is that the AM60Srvr logs the previously reported events again.
Enhanced the fc60mon monitor tool to report historical log of LUN configuration and UTM LUN status:
- JAGad34030 EMS should warn when UTM LUN status not good
- JAGae13410 AM60 needs historical log of LUN configuration to recover LUNs
The customer would see these new events generated if the conditions, above, occur.
- Event 31 would be generated when there was a change in LUN configuration every 24 hours.
- Event 32 would be generated if the UTM LUN was disabled.
- Fast Wide SCSI Disk Array (fw_disk_array)
- Disk Monitor (disk_em).
- JAGae31385; JAGae09750; JAGae23668; JAGae19531
The disk_em monitor functionality after this change will be like this:
- JAGae09750- disk_em will monitor fixed disks with HP supported firmware version, and also specific products, if they are fixed disks.
The disk_em monitor will determine the set of disks to monitor in the following way:
- The monitor will ignore any removable medium disk devices, like MO, CDROM, etc.
- The monitor monitors a specific set of product numbers, if they are FIXED disks.
The disk_em gets the following information related to device:
Based on the above information, we determine whether to monitor or not, as shown in the table below:
- Does the disk have HP supported firmware version?
- Does the disk belong to specific set of product number list?
- Is the disk of type FIXED?
- Is the device of type removable medium?
CRITERIA MONITOR NOT-MONITOR 1 AND 3 Yes ----- 2 AND 3 Yes ------ 1 AND 2 AND 3 Yes ----- 1 AND 4 ---- yes 2 AND 4 ---- yes 1 AND 2 AND 4 --------- yesSpecific sets of product number lists are updated on the following web page: http://www.docs.hp.com/hpux/onlinedocs/diag/ems/emd_disk.htm
- JAGae31385-Implemented a defect table in disk_em monitor, which will specify the maximum number of allowable defects based on the disk capacity. disk_em monitor checks the current defects of the drive against this table, to determine whether we should generate an event 4 message or not.
HDD Capacity P+G List Threshold G List Only Threshold 2 GB and <2 GB 1024 NA 4 GB 2048 NA 9 GB and Above NA 8190Note: The 9GB drive with model number ST19171 has been treated, above, as this comes under 4GB category; hence, the maximum allowable defects for this drive are 2048(P+G), because this drive supports a P+G list of max 2900. This information is updated on the following web page: http://www.docs.hp.com/hpux/onlinedocs/diag/ems/emd_disk.htm
- JAGae23668-saved_timeout variable has been initialized, so that monitor would get the return value indicating it should poll the hardware.
- Fibre Channel Adapters (dm_FCMS_adapter).
- N/A
- Fibre Channel Adapter Model A5158 Monitor (dm_TL_adapter).
- N/A
- Fibre Channel SCSI Multiplexer (dm_fc_scsi_mux).
- N/A
- Fibre Channel Switch (dm_fc_sw).
- N/A
- High Availability Disk Array Monitor (ha_disk_array) .
- High Availability Storage System (dm_ses_enclosure)
- JAGae23014
Fixed condition where multiple events were being generated at irregural intervals for different firmware versions on the controllers.- Kernel Resource Monitor (krmond)
- N/A
- LPMC Monitor (lpmc_em).
As of the June 2002 release, the LPMC Monitor (lpmc_em) was renamed to "CPU Monitor". The binary name is still lpmc_em. The name was changed to reflect the monitor's enhancement to check floating-point functionality in the CPU. For more information, see CPU Monitor (lpmc_em).- JAGae18418
JAGae18418 : Minor spelling errors in Event messages for LPMC monitor."Floting" was corrected to be "Floating" "erors" was corrected to be "errors" "procesor" was corrected to be "processor" ========================== The following applies to systems with Virtual Partitions only : ================= The LPMC monitor, CPU Info, Expert, and Exerciser tools will operate on the CPUs in the current partition, because the partition does not see hardware on other partitions. The tools try to gather information about the CPUs in the whole system, and then filter out the information for the CPUs not in the partition. The process of discovering the CPUs in the partition will log an entry in either Activity Log or the api.log file, indicating that the tool could not gather information about the processor(s) that don't belong to the current partition. The text of the log entry is similar to: Failed to convert HPA (0xfffffffffce7a000) to spu_number. It may be possible that the CPU is deconfigured or the CPU is not in the Current partition. Even though this looks like an error, it in fact is NOT an error because it is referring to the CPU outside the partition. The LPMC monitor in a non-vPar environment will deactivate a CPU and activate one of the iCOD CPUs, if any are available. Since, in a vPar environment, iCOD CPUs are NOT visible to the partition, the monitor will not be able to activate the CPU. The user is advised to use the iCOD command to find out if iCOD CPUs are available on the system, and if so, activate one of them. To determine the number of iCOD CPUs that are instantly activatable for the local vPar, use the following command: icod_stat i and the command will print out a single number. Alternatively, the user can use the command "icod_stat" and look for information for 'Unassigned processors that can be assigned'. To activate an iCOD processor, use the command: icod_modify -a 1 \ <[description]:user_name:manager_name:manager_email:manager_ phone> For more information on the iCOD commands, refer to iCOD documentation. In the vPar environment, when the LPMC monitor marks the last CPU in the partition for deconfiguration, the user is strongly advised to reboot the system - rather than rebooting the partition only. When the partition is rebooted, the faulty CPU will still be active, but when the system is rebooted, after bringing all the partitions down, the faulty processor will not be visible to the system. The goal of the LPMC monitor is to remove the faulty component, and this goal can be achieved only by rebooting the system.- Customers/users will see the following documented changes in the behavior of the memlogd for the HWE0209 release of the OnlineDiag product, and only on Superdome systems with virtual partitions installed.
- Memory Monitor (dm_memory).
- JAGae35591
Fix to JAGae35591: Memlogd (mt_memlogd) on SuperDome is sometimes having problems sending PDT information to the EMS HW Memory Monitor; hence, the EMS HW Memory Monitor may not be able to correctly perform any PDT trending analysis. In this case, when Memlogd (mt_memlogd) on SuperDome cannot send the PDT information to the EMS HW Memory Monitor, the following error messages will be logged in the memlogd activity log file:Write system call failed with errno (32), when attempting to write data to file descriptor (6). EPIPE (32) errno returned from a write system call indicates an attempt was made to write to a socket that is not open for reading by any process. Possible Causes/Recommended Action: The process reading the socket removed the socket or exited unexpectedly. Check the support tool system activity log and tool activity logs for more information on the process that removed the socket or exited. Unable to open the file send_msg to read the memory monitor's socket and PID numbers. Possible Causes/Recommended Action: Internal Application error. Attempting to perform the PDT analysis when an unexpected error occurred. Possible Causes/Recommended Action: Internal Application error.- Peripheral Status Monitor (PSM/psmmon).
- JAGae14110
Full fix for the problem described in JAGae14110. Monitoring requests made in the EMS GUI for the EMS HW Monitoring status monitor, psmmon, resources are removed during an update. This has been fixed such that those monitoring requests are retained.- Remote Monitor (RemoteMonitor).
- SCSI Card Monitor (scsi123_em).
- N/A
- SCSI Cascade Monitor (scsi_cascade).
- SCSI Disk (scsi_disk).
- JAGad91084
A fix was made for the condition where SCSI array disk errors would be sent to EMS monitor disk_em by default, rather than to the SCSI disk monitor. The STM scsi library routine identifying the driver name was looking for the typo 'disk30', instead of 'disc30'.- SCSI Tape Monitor (dm_stape).
- System Status Monitor (sysstat_em)
- N/A
- UPS Monitor (dm_ups).
Changes to Platform and Interface
- JAGad99498
Modified moncheck, toggle_switch, and startmon_client, used by monconfig to check monitoring requests, disable monitoring, and enable monitoring, respectively, and psmctd to use the new rm_service_up routine in the EMS library to check if the EMS services are available prior to attempting to connect to EMS. The code will display an error message to the user, indicating the registrar service has not been started if the 5 minutes expires: "EMS Registrar inetd service not started. Start registrar and retry". This message will be displayed in monconfig, when the user selects the K)ill or C)heck command. It is not displayed for an E)nable command, as there is no communication between monconfig and the program that actually enables the monitors. However, if the service is not available and the user selects E)nable, the command will complete, but the monitors will NOT be enabled, and the state displayed in monconfig will indicate monitors are NOT enabled. No errors can be logged into the EMS error logs, as logging requires connection to the registrar, which isn't started. For psmctd, an error will be logged into the System Activity log, indicating psmctd exited due to the initialization error, below:Wed Feb 27 15:29:12 2002: Daemon process (psmctd) with process identifier (26984) exited. Wed Feb 27 15:29:12 2002: Daemon process completed with exit_status SYS_INIT_FAILED_EXIT (100) indicating the process exited because it could not perform basic initialization. Possible Causes/Recommended Action: Process internal error. For a remap hardware process, check the map log for more information. Wed Feb 27 15:29:12 2002: Daemon process (psmctd) will not be restarted as restart attempts have exceeded the maximum allowed (5). Start daemon process manually using User Interface.Customer-Visible Interface Changes
CAUTION: UPS Monitor May Need a PatchIn some cases, the UPS monitor (dm_ups) will not function and will instead generate event 45 (formerly event 42) with the text:
Probable Cause / Recommended Action: The monitor was unable to locate the fifo pipe that should have been created by ups_mond. Therefore, information about the ups cannot be sent to the monitor. You need version (80.1.2.3) of ups_mond or greater. To update your system with the correct version of ups_mond, install one of the following patches: HPUX 10.20/s800 : PHCO_24153 (supersedes PHCO_23830) HPUX 11.00 : PHCO_24172 (supersedes PHCO_23831) HPUX 11.11 : PHCO_23832To fix the problem, load the indicated patch or load the HWE patch bundle which contains this patch. For HP-UX 11i, the ups_mond patch PHCO_23832 is also distributed on the Sept 01 OE.This problem will affect most systems with a UPS when the September 2001 diagnostics are installed. The only systems not affected will be those which are being updating from certain versions of the diagnostics (September 2000 through March 2001) and which do not have patch PHCO_19031 (HP-UX 10.20) or PHCO_19040 (HP-UX 11.00) installed.
CAUTION: Monitoring Changes for disc30, sdisk and disk array devicesAs of IPR 9902 (Feb 99 release), there has been a change to the way that monitoring is done for disc30, sdisk and the HA Disk Array Models 10, 20, and 30FC.
Formerly, the "diaglogd exec" programs (pdisc30_exec and psdisk_exec) handled driver error entries for these devices.
As of IPR 9902, these programs have been deleted and their functionality is now provided by the EMS Hardware Monitors.
If you had customized the configuration files for the diaglogd exec programs (disk30_exec.cfg and sdisk_exec.cfg) you may wish to re-configure the EMS Hardware Monitors to achieve the same results.
CAUTION: Compatibility Problem with EMS-Related Products (ServiceGuard, HA Monitors, etc.)If you install the OnlineDiag bundle (Dec 99 or later) onto a computer running older revisions of EMS-related products, these products may experience compatibility problems. Affected products include MC/ServiceGuard, ServiceGuard OPS Edition and High Availability Monitors. The only critical problems occur with the following versions:
MC/ServiceGuard A.10.10, A.11.01, A.11.03 ServiceGuard OPS Edition A.11.02, A.11.03Support Tools and the EMS hardware monitors are not affected. For complete information, see EMS Incompatibility Problem.
Monitors are provided to support the following:
- AutoRAID Disk Array (armmon)
- Chassis Code Monitor (dm_chassis)
- CMC Monitor (cmc_em).
- Core Hardware (dm_core_hw)
- Core Hardware for Itanium (ia64_corehw)
- CPU (lpmc_em)
- Disk (disk_em)
- Disk Array FC60 (fc60mon)
- Fast Wide SCSI Disk Array (fw_disk_array)
- Fibre Channel Adapters (dm_FCMS_adapter)
- Fibre Channel Adapter Model A5158 (dm_TL_adapter)
- Fibre Channel Arbitrated Loop Hub (dm_fc_hub)
- Fibre Channel SCSI Multiplexer (dm_fc_scsi_mux)
- Fibre Channel Switch (dm_fc_sw)
- High Availability Disk Array (ha_disk_array)
- High Availability Storage System (dm_ses_enclosure)
- Kernel Resource (krmond)
- LPMC (lpmc_em) renamed to "CPU Monitor" as of the June 02 release
- Memory (dm_memory)
- Remote (RemoteMonitor)
- SCSI Card (scsi123_em)
- SCSI Cascade (scsi_cascade)
- SCSI Disk (scsi_disk)
- SCSI Tape Devices (dm_stape)
- System Status (sysstat_em)
- UPS (dm_ups)
In addition, the Peripheral Status Monitor (PSM) is provided to monitor the current status of the products supported by the above list.
For detailed information concerning which products are supported by which monitors and additional dependencies, check the "Diagnostics" section of Hewlett-Packard's online documentation web site: http://docs.hp.com/hpux/diag/ .
Several of the monitors have special requirements, such as patches or certain versions of firmware. In particular:
For a list of the current required patches, see the DIAGNOSTIC.readme file for this release.
- The Fibre Channel Arbitrated Loop Hub Monitor and the Fibre Channel Switch Monitor require special configuration which is described in their data sheets in the "EMS Hardware Monitors User's Guide" (chapter 6). A patch is also required.
- A patch is required if your system includes an HP SureStore E Disk Array FC60. This patch is required to to run the EMS hardware monitor (fc60mon) or STM tools for this device.
Current monitor requirements are described in the "Supported Products" page under "EMS Hardware Monitors" at http://docs.hp.com/hpux/diag . Requirements are also listed in chapter 2 of the manual "EMS Hardware Monitors User's Guide".
Use CHART to report defects in the EMS Hardware monitors. The project name is diag.hw_mon.hpux. If you don't have access to CHART, contact an HP representative to enter a defect for you.
The EMS hardware monitors are installed as part of the OnlineDiag bundle (product number B4708AA). In addition, they utilize the EMS framework, product number B7609BA.
Note: EMS Hardware Monitors are installed as part of the STM-UUT-RUN Fileset. However, the EMS Hardware Monitors are dependent on the EMS-Core and EMS-Config products and additional filesets in the Sup-Tool-Mgr Product.
For information on the STM product, refer to the STM release notes file /usr/sbin/stm/Rel_NOTES.STM.
SD Bundle: OnlineDiag Description: On-line Diagnostic System (Series 800/700) SD PRODUCT: Sup-Tool-Mgr Description: Support Tools Manager for HP-UX Systems SD SUB-PRODUCT: Manuals Description: Support Tools Manager Manual Pages FILESET: RELEASE_NOTES Description: HPUX STM Release Notes FILESET: STM-MAN Description: HPUX STM Manual Pages SD SUB-PRODUCT: Runtime Description: STM Manual Runtime FILESET: STM-CATALOGS Description: HPUX STM Shared Libraries FILESET: STM-SHLIBS Description: HPUX STM Shared Libraries FILESET: STM-UI-RUN Description: HPUX STM User Interface FILESET: STM-UUT-RUN Description: HPUX STM Unit Under Test Runtime SD PRODUCT: EMS-Config Description: EMS Config FILESET: EMS-GUI Description: Event Monitoring Service Graphical User Interface SD PRODUCT: EMS-Core Description: EMS Core Product FILESET: EMS-CORE Description: Event Monitoring Service Core Files