EMS Hardware Monitors (logo)

Frequently Asked Questions

General questions: Problems

General Questions

How can I push my EMS Hardware Monitors configuration to multiple systems?

Do the configuration on one system via monconfig (creates appropriate /var/stm/config/tools/monitor/*.sapcfg

Do additional manual edits, if any, in the other configuration files
(NOTE: The default values in these files work; it would only be if you had specific configurations you wanted to change and push out that you would need this step)

/var/stm/config/tools/monitor/*.cfg, default_*.clcfg

/var/stm/config/tools/monitor/Global.cfg

/var/stm/data/tools/monitor/

For each system where the new configuration is desired:
Copy all /var/stm/config/tools/monitor/*.cfg, default_*.clcfg, *.sapcfg to new system except any file with the name "predictive" in it. Execute /etc/opt/resmon/lbin/startcfg_client to enable the new configuration.

NOTE: If OPC (OpenView) configuration is desired (using "opcmsg"), the initial configuration must be done on a system where OPC is installed. Otherwise, the option of "opcmsg" will not be a destination in monconfig.

NOTE: If you want to keep a copy of the old configuration, either on the system where you do the configuration or the systems where you are going to do the push, you should make copies of the files before doing any changes.




Why should I install EMS hardware monitors?

The EMS hardware monitors allow you to monitor the operation of a wide variety of hardware products and be alerted immediately if any failure or other unusual event occurs.

The EMS hardware monitors are available at no additional cost on the Diagnostic/IPR Media CD-ROM. They are automatically installed when you install the Support Tool Manager (STM) diagnostics. Once you enable hardware monitors, they require little or no maintenance.



How do I know if EMS hardware monitors are functioning?

  1. Run the hardware monitoring request manager by typing: /etc/opt/resmon/lbin/monconfig. (You must be root.)
  2. The initial screen tells you whether event monitoring is enabled.
  3. You can enable event monitoring by entering E at the monconfig prompt.
  4. You can show a list of all monitoring requests that have been created (both active and inactive) by entering S at the monconfig prompt (S = Show current monitoring requests).
  5. You can show a list of currently active monitoring requests by entering C at the monconfig prompt (C = Check Detailed Monitoring Status).

Should I configure the EMS hardware monitors?

It is recommended that you NOT change the default configuration unless you fully understand the implications of doing so. The default configuration has been designed to meet the needs of most users. Consult the "EMS Hardware Monitor User's Guide" available from the Diagnostics HOME

If you are responsible for large computer installations, you may find it worthwhile to learn more about the configuration of hardware monitors.


What happens if I set the monitor polling interval to a number outside the minimum or maximum values?

Setting the polling interval outside the 1 - 1440 value range will result in polling at the default value of 60. The only exception is setting the value at zero (0).

Do not set the value to zero. A zero value will cause different monitors to do different things such logging errors or the termination of polling for that monitor.

What are the default notification methods?

For all monitors:

I just added new devices to my systems. Will they be monitored by the EMS hardware monitors?

Yes. When you add another hardware resource (device), it will inherit the monitoring in effect for other resources (devices) of the same type. If you add a new class of supported hardware, the associated monitor will be launched when the system is restarted and monitoring will begin.

For hardware monitoring to recognize new devices, the new devices must be properly added, so that they are recognized by the kernel (ioscan must see the new devices). Devices must also be configured properly.



Where can I get more information?

The central point for information on EMS hardware monitors is the Web site to which this page is located: Diagnostics HOME In particular, see the EMS Hardware Monitor User's Guide. The PDF file for this manual is also located in the /Documentation directory of the Diagnostic/IPR CD-ROM.


What is the difference between EMS hardware monitors and EMS high availability (HA) monitors?

The monitors described in this and related documents are hardware monitors. The hardware monitors: In contrast the EMS high availability (HA) monitors:

Is the Event Monitoring Service (EMS) Y2K-compliant?

Yes the B7609BA/BJ Version A.03.00 Event Monitoring Services product is Y2K compliant.

What is the relationship between EMS hardware monitors and Predictive Support?

EMS is not replacing Predictive Support -- EMS works with Predictive. You need to have EMS running for Predictive to work.

For example, the EMS hardware monitors include the Disk Monitor (disk_em). This monitor lets you know when your disk has events that might cause problems and downtime. These "event" messages are sent to text files, such as console, syslog or the default /var/opt/resmon/log/event.log

In the case of Predictive, the same message goes to a log at /var/opt/pred/emslog. Predictive uses the emsscan program to read this data. Based on the message and the severity, Predictive sends a message to HP to come out and service the disk.

Predictive Support is the "full service" story. The "economy class" solution is to load the EMS hardware monitors and manually read the messages in the event.log and make your own decisions about responding to the events.


How can I verify that the EMS hardware monitors are working?

You can verify the operation of some EMS Hardware Monitors by executing the command:
/opt/resmon/bin/send_test_event -v MONITOR_NAME
where MONITOR_NAME is the name of the executable file for the monitor (for example: dm_memory, disk_em, etc.)

For more information, see Verifying EMS Hardware Monitors

Where can I find the detailed information for a specific event number?

HP maintains a list of events reported by each monitor at:

http://www.docs.hp.com/hpux/onlinedocs/diag/ems/eme_summ.htm

How can I change the event numbers for a monitor?

The event numbering is defined by the monitor developer. You should not change these numbers in the configuration files.

How can I disable a EMS HW monitor for a single instance?

A user asks: "I have a disk that is reporting an error. It will be several days until we can replace this disk. I would like to stop the reporting of the error message to the logs until we get the disk replaced. Is this possible?" The startmon_client program now reads the file /var/stm/data/tools/monitor/disabled_instances. The above file is read before the *.sapcfg file is read and so there is no startup of the monitor for the specific instance listed in the disabled_instances file. The format of the disabled_instances file is a text file, with each fully qualified instances listed, one instance per line. In addition, wildcards can be used in the instance names to specify more than more instance. For example: /storage/events/disks/default/* could be used to specify all the instances associated with the default disk resource names. For those instances listed in the disabled instance file, no monitoring requests will show up in the list displayed by the monconfig "C)heck monitoring" command. NOTE: This does not mean that the monitor will stop polling the device. It just means that any events will not be forwarded to the log files, based on information in the *.sapcfg files. In order to use the disabled_instances file, the user must perform the following:
  1. Add/delete/modify instances in the disabled_instances file
  2. Run monconfig
  3. Select the "E)nable Monitoring" command
  4. Wait for monitoring to be re-enabled

Problems

Difficulty installing EMS hardware monitors.


One monitor is not working or is not working as expected.

Several of the monitors have special requirements, such as patches or certain versions of firmware. Check the current requirements for the monitor, as described in EMS Hardware Monitor Supported Products, and verify that you have satisfied them.

Requirements for specific monitors are also listed in chapter 2 of the manual "EMS Hardware Monitors User's Guide".



The hardware monitors log errors after monitoring is disabled.

When monitoring is disabled using the monconfig program, each monitor will log an error in the /var/opt/resmon/log/api.log file similar to the following:

-------------------------------Start Event----------------------------
User event occurred at Tue Sep  1 10:27:58 1998
Process ID: 10723 (/usr/sbin/stm/uut/bin/tools/.../disk_em)   Log Level: Error
Tool is exiting due to receipt of a SIGINT signal.
-------------------------------End Event------------------------------
This is not an error as the monitors are stopped with a SIGINT signal during the disable process.


"Status" monitor requests are lost after EMS Hardware monitors are updated.

When updating the EMS Hardware monitors, any monitor requests for the Hardware status monitor added thru the EMS GUI or ServiceGuard will be lost. The resources associated with the Hardware status monitor have names listed in the /etc/opt/resmon/monitor/classes.config file with the "events" string replaced with a "status" string.

For the EMS GUI, these requests must be re-entered after the update.

For ServiceGuard, packages dependent on these resources must be stopped or the dependency on the resource must be removed prior to the update. Once the update is complete, the package can be restarted and/or the dependency re-created.



Fibre Channel SCSI MUX Monitor (dm_fc_scsi_mux) does not monitor the FC SCSI MUX.

The dm_fc_scsi_mux monitor cannot monitor a Fibre Channel (FC) SCSI MUX (HP A3308) which has no functional SCSI cards.

The dm_fc_scsi_mux monitor will not monitor a Fibre Channel SCSI MUX (HP A3308), if ioscan does not find paths to the SCSI cards in the FC SCSI MUX. The paths for dm_fc_scsi_mux must have the driver fcpmux and be in the CLAIMED state. When no such paths appear, the path to the FC SCSI MUX may still appear using the sctl driver, but the monitor cannot use it to monitor the device.

For example, if the the ioscan shows a path like this:

      ext_bus 12 8/8.8.0.0.2 fcpmux CLAIMED INTERFACE HP HPA3308 FCP-SCSI MUX 
Interface
the dm_fc_scsi_mux monitor can monitor the FC SCSI MUX. Note the fcpmux driver and the CLAIMED state.

If the the ioscan shows only a path like this:

      ext_bus 12 8/8.8.0.0.2 fcpmux NO_HW INTERFACE HP HPA3308 FCP-SCSI MUX 
Interface
      ctl 5 8/8.8.0.255.0.0.0 sctl CLAIMED DEVICE HP HPA3308
the dm_fc_scsi_mux monitor cannot monitor the FC SCSI MUX. Note the fcpmux driver and the NO_HW state.

Running ioscan -f will affect the ability of the monitor to find failed or newly working SCSI cards, and thus affects its ability to monitor the FC SCSI MUX.

If a path exists before the SCSI card fails, dm_fc_scsi_mux will continue to monitor the FC SCSI MUX. If an ioscan is performed, and all the SCSI cards have failed, the monitor will no longer be able to monitor the FC SCSI MUX as all the hardware paths are changed to a state of NO_HW.

If a path existed to a SCSI card at the time the is system booted, dm_fc_scsi_mux can still monitor the FC SCSI MUX after all of the SCSI cards fail. If an ioscan is performed, the monitor will no longer be able to monitor the FC SCSI MUX as the hardware paths are changed to a state of NO_HW.

If no path existed to a SCSI card in an FC SCSI MUX at the time the system booted, dm_fc_scsi_mux cannot monitor the FC SCSI MUX after a SCSI card starts working unless an ioscan is performed, which adds a hardware path for the fcpmux driver in the CLAIMED state.



FC-AL hub monitor not functioning.

Unlike the other EMS Hardware Monitors, the FC-AL hub monitor requires some initial configuration before it will function. Because a FC-AL hub is not part of the host's configuration, the host cannot detect any hubs during start-up. You must tell the hub monitor what hubs you want it to monitor. This is done by defining two settings (HUB_COUNT and HUB_X_IP_ADDRESS) in the hub monitor configuration file, /var/stm/config/tools/monitor/dm_fc_hub.cfg. Refer to the "EMS Hardware Monitors User's Guide" for more detailed information.


FC-AL hub monitor exits with a SIGABRT signal (6).

The FC-AL hub monitor requires two C++ runtime patches in order to run properly, as indicated in EMS Hardware Monitor Supported Products and in Chapter 2 of the "EMS Hardware Monitors User's Guide."

If these patches are not installed, the FC-AL hub monitor will exit with a SIGABRT signal (6). Since ALL the EMS Hardware Monitors are initiated each time monitoring is enabled and at each reboot, regardless of whether hardware exists on the system, if these patches are not installed, the FC-AL hub monitor will abort at each reboot and when monitoring is enabled and create a core file. This will cause a errors in the monconfig program when attempting to start the monitor.

The fix is to install the C++ patches.

Missing configuration file for Fibre-Channel Switch monitor.

Question: I'm trying to monitor a Fibre-Channel Switch using the EMS hardware monitors. The documentation states that I must edit the dm_fc_sw.cfg file. When I look in the /var/stm/config/tools/monitor directory, there is no such file.

Answer: For IPR 9904, the switch monitor config files (dm_fc_sw.cfg, dm_fc_sw.psmcfg, and dm_fc_hub.sapcfg) were not installed in the proper directory (/var/stm/config/tools/monitor). (The problem was fixed in IPR 9906.)

To fix the problem:

  1. Copy the file dm_fc_sw.cfg from /usr/newconfig/var/stm/config/tools/monitor to the directory /var/stm/config/tools/monitor.
  2. Also, copy the files dm_fc_sw.psmcfg and dm_fc_sw.sapcfg if missing from the directory /var/stm/config/tools/monitor.
  3. Edit the files as described in the documentation.

Compatibility Problem with ServiceGuard and LockManager

From the February 1999 release (IPR 9902) onwards, the Support Tools (diagnostics) include EMS hardware monitors and EMS version A.03.00 on both HP-UX 10.20 and HP-UX 11.00.

This version of EMS is incompatible with ServiceGuard A.10.10, which includes version A.01.00 of EMS. It is also incompatible with ServiceGuard and LockManager versions A.11.01, A.11.02 and A.11.03, which include version A.02.00 of EMS.

If you run these releases of ServiceGuard or LockManager, you must upgrade them before installing the Support Tools on the February 1999 (IPR 9902) or newer releases.

On HP-UX 10.20 you should upgrade ServiceGuard to A.10.11 and on HP-UX 11.00 you should upgrade ServiceGuard or LockManager to release A.11.04 or newer.

If you do not upgrade, EMS will silently be upgraded to version A.03.00 when you install the diagnostics; ServiceGuard and LockManager will fail to work if you have any monitored resources. In this case, if you execute swverify or other SD-UX commands, you will see error messages like:

     The corequisite
     "EMS-Core.EMS-CORE,r=A.01.00,a=HP-UX_B.10.20_800,v=HP" for
     fileset "Cluster-Monitor.CM-CORE,l=/,r=A.10.10" cannot be
     successfully resolved.
If you have already loaded the diagnostics and therefore upgraded to EMS A.03.00 and are still running an incompatible release of ServiceGuard or LockManager, you should now upgrade to get your system into a supported and working state.

There is no functional difference between ServiceGuard A.10.10 and ServiceGuard A.10.11, other than support for the new version of EMS and bug fixes. Functional differences for the 11.00 releases of ServiceGuard and LockManager can be found in the release notes.

Older versions of ServiceGuard and LockManager, for example A.10.06 and A.10.07.01, do not provide any support for EMS, and so are not affected by this issue.

Problem with FC60 Monitor in Sept 1999 Release

On the September 1999 release (IPR 9909), the FC60 hardware monitor (fc60mon) does not consistently report problems with the FC60 array. This was also the version contained on the CD shipped with the array. This problem was fixed on the December 1999 release (IPR 9912).

To check the version of fc60mon you are running, execute the following command:

# what /usr/bin/stm/uut/bin/tools/monitor/fc60mon
You are running a bad version if the command returns a version number of "A.01.03", for example:
fc60.mon: A.01.03 Tue Jul 20 19:00:41 MDT 1999
To fix the problem, update the support tools with the December 1999 (IPR 9912) release or later. This will cause version A.01.04 of fc60mon to be loaded.

Devices not supported in SCSI Tape Monitor

According to the online man page for the SCSI tape monitor (dm_stape), support was added in the June 99 release (IPR 9906) for the following monitors:
DLT7000   8-slot Library (A5501A)
DLT8000  20-slot Library (A5583A, A5584A)
DLT8000  40-slot Library (A5585A, A5586A)
DLT8000  60-slot Library (A5587A, A5588A)
DLT8000 700-slot Library (A5597A)
This information was incorrect. In the June 99, Sept 99, and Dec 99 releases, these devices were not supported by the monitor. Support for these devices was added to the SCSI tape monitor with the March 00 release (IPR 0003).

"Monitor restart" messages are sometimes generated for devices on the system. Is there something wrong?

No, the messages do not represent any problem with the hardware monitors if they occur during system boot and initial monitor startup. They are just a side effect of the fact that the hardware monitors are built on top of the already existing EMS platform.

If "Monitor restart" messages AFTER system boot and initial monitor startup, then there may be a problem -- these messages would occur if a monitor ran for a while and then died. And of course, monitors shouldn't die. If this should happen, you can investigate the logged messages in /etc/opt/resmon/log/api.log.

A detailed technical explanation follows.

HP first created the Event Monitoring Service (EMS) framework to support the high availability (HA) monitors. These monitor disk resources, cluster resources, network resources and system resources. They are designed for a high availability environment, and are available at additional cost.

Later, the hardware monitors were developed. These monitor hardware resources such as I/O devices, interface cards, and memory. Like the HA monitors, the hardware monitors use the EMS framework.

The high availability (HA) monitors use "p_client" to restart monitors when they exit unexpectedly. The hardware monitors use a different method, the "startmon_client" program, to restart monitors.

"Monitor restart" messages are output by the "p_client" when it restarts a monitor. P-client is launched by init, and is also designed to restart the high availability (HA) monitors when the system is rebooted. Monitor restart messages are sent to each "target" (the notification methods that have been configured for the system).

When hardware monitoring was added to EMS, a new method of starting monitors was developed. This new method (the program startmon_client) discovers the devices which have been added to or removed from the system, and then modifies the monitoring requests appropriately. The program is run by the diagnostic daemon "diagmond" after it has finished mapping the system at system boot, and when hardware changes are discovered.

Even though p_client is not needed to start the hardware monitors, it still attempts to restart all monitors for which monitoring requests exist. If p_client is able to successfully restart a monitor and register the existing monitoring requests, it generates a restart message. If, on the other hand, a monitor replies that it is not ready, then p_client will not generate a message but will instead try again every couple of minutes to restart the monitor. Some monitors will respond that they are not ready until diagmond completes mapping the system, or until they otherwise obtain information on the resources which they can monitor. After a monitor becomes ready, if startmon_client restarts the monitor and registers the requests before the p_client, then p_client will not generate a restart message because on its next cycle, it will not find any unregistered monitoring requests for the monitor. If, on the other hand, p_client "slips in" between the monitor becoming ready and startmon_client getting to it, the restart message will be generated.

Therefore: a restart message may or may not get generated on reboot, depending on whether a monitor returns a "not ready" status, how long it takes to map the system, and whether or not (based on timing windows) it is p_client or startmon_client which actually starts the monitor.

This unpredictability is a bit disconcerting, but it does not effect the operation of the monitors.

The only time when the "Monitor restart" messages indicate a problem is if they occur AFTER system boot and initial monitor startup. These messages would occur if a monitor ran for a while and then died. In this case, p_client restarts the monitor that has died and issues the "Monitor restart" message.

Unhelpful event message from the SCSI Tape Devices Monitor (dm_stape).: event # 599 (Unrecognized TapeAlert event)

The SCSI Tape Devices Monitor reports all Tape Alert flags from tape drives as Event # 599 ("Unrecognized TapeAlert" event).

For example, a "Media Life" warning flag is now reported as Event 599 ("Unrecognized TapeAlert" event). The message should indicate for this warning flag that "The tape drive is trying to indicate that the media loaded in it is wearing out and needs to be replaced. The recommended action is to back up the data onto another tape and then discarded this one."

This problem will be fixed in an upcoming release, so that Tape Alert flags will be properly decoded and reported.

This problem does NOT occur with Tape Alert flags from Tape libraries, autochangers, and autoloaders; flags for these devices are decoded without any problem. The problem only occurs with Tape drives.

The problem occurs in the March 2000 and December 1999 releases (IPR 9912 and 0003), and probably occurs in previous releases as well.

I had an error on my SCSI bus that was not captured by the EMS hardware moniitors. Shouldn't the scsi123_em monitor catch these errors?

If your system uses scsi1, scsi2, or scsi3 drivers, then the scsi123_em monitor should catch the SCSI bus errors.

However, if your system only uses c700 or c720 drivers (WSIO I/O subsystem drivers), then no monitor can catch these errors. These drivers do not log errors through the diagnostic logging mechanism.

In fact, none of the WSIO drivers log errors. So, since the scsi123 monitor can only generate events for errors that are logged by the driver, it can do nothing for devices that use c700 or c720 drivers.

To see what drivers are used by your system, you can execute the command:

ioscan -kf

The SCSI Tape Monitor (dm_stape) logged an event that I cannot find in the Event Description web page.

There is a problem with dm_stape in IPR 0003 (March 2000) and earlier versions of the diagnostics. The problem was fixed in June 2000 release (IPR 0006).

The problem affects 37 events as follows:

Events 701-716 all report as event 700
Events 801-805 all report as event 800
Events 901-903 all report as event 900
Events 1001-1011 all report as event 1000
Event 1201 reports as event 1200
Event 1301 reports as event 1300
The recommended solution is for systems running the IPR 0003 (March 2000) or earlier versions of diagnostics to update to the IPR 0009 (September 2000) or later version of diagnostics.

Alternately, the customer can disable the dm_stape monitor altogether.

The UPS Monitor (dm_ups) is not running after computer boot-up

Known problem (JAGad37359). The UPS hardware monitor (dm_ups) may core dump with a SIGABRT signal during HP-UX boot-up. A core file may be left in /etc/opt/resmon. Afterwards dm_ups is not running and EMS does not provide any hardware monitoring for UPSs on the system.

The problem can occur on Dec 00 and previous diagnostic releases (HP-UX 10.20, 11.00, and 11i). A fix for this problem is planned for an upcoming release of the diagnostics

The work-a-round for this problem is to verify whether the dm_ups is running with the command:

ps -ef | grep dm_ups.
If dm_ups is found, then it is running and there is no problem. Otherwise, start dm_ups manually.

(NOTE: dm_ups should be running only if the computer system contains a UPS.)

To start dm_ups manually:

  1. Log on as root and run monconfig:
    /etc/opt/resmon/lbin/monconfig
    
  2. From the prompt enter "e" to enable monitoring. You will see the messages:
    This may take a while...
    
    Waiting for changes in monitoring requests or in hardware configuration
    to take effect...
    
  3. After you see the message "EVENT MONITORING IS CURRENTLY ENABLED", you can leave monconfig with the "q" command.

The disk monitor (disk_em) does not seem to be monitoring Hitachi XP256/XP512 drives and EMC Symetrix drives

The disk monitor (disk_em) does not support Hitachi XP256/XP512 drives and EMC Symetrix drives. These drives have their own monitoring (not EMS).

The UPS Monitor generates Event 42 ("The fifo pipe connection to ups_mond does not exist ") and does not run.

In some cases, the UPS monitor (dm_ups) will not function and will instead generate event 42. As of the June 2001 release, the cause/action statement tells what to do:
Probable Cause / Recommended Action:

The monitor was unable to locate the fifo pipe that should have been
created by ups_mond.  Therefore, information about the ups cannot be sent
to the monitor.

You need version (80.1.2.3) of ups_mond or greater.  To update your
system with the correct version of ups_mond, install one of the
following patches:
HPUX 10.20/s800 : PHCO_23830
HPUX 11.00      : PHCO_23831
HPUX 11.11      : PHCO_23832
To fix the problem, load the indicated patch.

This problem will affect most systems with a UPS when the June 2001 diagnostics are installed. The only systems not affected will be those which are being updating from certain versions of the diagnostics (September 2000 through March 2001) and which do not have patch PHCO_19031 (HP-UX 10.20) or PHCO_19040 (HP-UX 11.00) installed.

The problem also occurs on releases before June 2001 if certain patches are installed on the system after the diagnostics have been installed. The patches that cause the problem are PHCO_19031 (HP-UX 10.20) and PHCO_19040 (HP-UX 11.00). These problem patches overwrite the version of the UPS daemon, ups_mond, that was installed as part of the diagnostic product and is required for the UPS monitor to function. These problem patches have been contained in the General Release patch bundles, but not in the Hardware Critical patch bundles. Only a few instances of this problem have been reported.

On blade servers, the monitors don't tell the chassis and slot number of the server blade under test.

A blade server is a chassis which contains several server blades and other blades. A server blade is a "computer on a a card" that runs its own OS. Thus, the server blade is the computer system.

When STM tools and hardware event monitors refer to the computer system, they will give the HP-UX hostname of the server blade. For example, when hardware event monitors report events, they name the system on which the event occurs in a format such as:

hpdst313 sent Event Monitor notification information
where "hpdst313" is the HP-UX hostname of the blade reporting the event. As of May, 2002, the support tools do not directly name the slot and chassis in which the server blade is located.

To make it easier to find the location of a particular blade, given its hostname, you can:


Top of Page

/ Diagnostics HOME


URL: http://docs.hp.com/hpux/onlinedocs/diag/ems/ems_faq.htm
Last updated: Tue Apr 30 11:15:18 PDT 2002