Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Configuring OPS Clusters with ServiceGuard OPS Edition > Chapter 8 Troubleshooting Your Cluster

Solving Problems

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Index

Problems with ServiceGuard may be of several types. The following is a list of common categories of problem:

  • ServiceGuard Command Hangs

  • Cluster Re-formations

  • System Administration Errors

  • Package Movement Errors

  • Node and Network Failures

  • Quorum Server Problems

The first two categories of problems occur with the incorrect configuration of ServiceGuard. The last category contains "normal" failures to which ServiceGuard clusters are designed to react and ensure the availability of your applications.

ServiceGuard Command Hangs

Many ServiceGuard commands, including cmviewcl, depend on name resolution services to look up the addresses of cluster nodes. When name services are not available (for example, if a name server is down), ServiceGuard commands may hang, or may return a network-related error message. If this happens, use the nslookup command on each cluster node to see whether name resolution is correct. Example:# nslookup ftsys9

Name Server:  server1.cup.hp.com
Address: 15.13.168.63
Name: ftsys9.cup.hp.com
Address: 15.13.172.229

If the output of this command does not include the correct IP address of the node, then check your name resolution services further.

Cluster Re-formations

Cluster re-formations may occur from time to time due to current cluster conditions. Some of the causes are as follows:

  • local switch on an Ethernet LAN if the switch takes longer than the cluster NODE_TIMEOUT value. To prevent this problem, you can increase the cluster NODE_TIMEOUT value, or you can use a different LAN type.

  • excessive network traffic on heartbeat LANs. To prevent this, you can use dedicated heartbeat LANs, or LANs with less traffic on them.

  • an overloaded system, with too much total I/O and network traffic.

  • an improperly configured network, for example, one with a very large routing table.

In these cases, applications continue running, though they might experience a small performance impact during cluster re-formation.

System Administration Errors

There are a number of errors you can make when configuring ServiceGuard OPS Edition that will not show up when you start the cluster. Your cluster can be running, and everything appears to be fine, until there is a hardware or software failure and control of your packages is not transferred to another node as you would have expected.

These situations are caused specifically by errors in the cluster configuration file and package configuration scripts. Examples of these errors include:

  • Volume groups not defined on adoptive node.

  • Mount point does not exist on adoptive node.

  • Network errors on adoptive node (configuration errors).

  • User information not correct on adoptive node.

You can use the following commands to check the status of your disks:

  • bdf - to see if your package's volume group is mounted.

  • vgdisplay -v - to see if all volumes are present.

  • lvdisplay -v - to see if the mirrors are synchronized.

  • strings /etc/lvmtab - to ensure that the configuration is correct.

  • ioscan -fnC disk - to see physical disks.

  • diskinfo -v /dev/rdsk/cxtydz - to display information about a disk.

  • lssf /dev/dsk/*d0 - to check LV and paths.

  • vxdg list - to list disk groups.

  • vxprint - to show disk group details.

Package Control Script Hangs or Failures

When a RUN_SCRIPT_TIMEOUT or HALT_SCRIPT_TIMEOUT value is set, and the control script hangs, causing the timeout to be exceeded, MC/Serviceguard kills the script and marks the package "Halted." Similarly, when a package control script fails, MC/ServiceGuard kills the script and marks the package "Halted." In both cases, the following also take place:

  • Control of the package will not be transferred.

  • The run or halt instructions may not run to completion.

  • Global switching will be disabled.

  • The current node will be disabled from running the package.

Following such a failure, since the control script is terminated, some of the package's resources may be left activated. Specifically:

  • Volume groups may be left active.

  • File systems may still be mounted.

  • IP addresses may still be installed.

  • Services may still be running.

In this kind of situation, MC/Serviceguard will not restart the package without manual intervention. You must clean up manually before restarting the package. Use the following steps as guidelines:

  1. Perform application specific cleanup. Any application specific actions the control script might have taken should be undone to ensure successfully starting the package on an alternate node. This might include such things as shutting down application processes, removing lock files, and removing temporary files.

  2. Ensure that package IP addresses are removed from the system. This step is accomplished via the cmmodnet(1m) command. First determine which package IP addresses are installed by inspecting the output resulting from running the command netstat -in. If any of the IP addresses specified in the package control script appear in the netstat output under the "Address" column, use cmmodnet to remove them:

    # cmmodnet -r -i <ip-address> <subnet>

    where <ip-address> is the address in the "Address" column and <subnet> is the corresponding entry in the "Network" column.

  3. Ensure that package volume groups are deactivated. First unmount any package logical volumes which are being used for filesystems. This is determined by inspecting the output resulting from running the command bdf -l. If any package logical volumes, as specified by the LV[] array variables in the package control script, appear under the "Filesystem" column, use umount to unmount them:

    # fuser -ku <logical-volume>
    # umount <logical-volume>

    Next, deactivate the package volume groups. These are specified by the VG[] array entries in the package control script.

    # vgchange -a n <volume-group>
  4. Finally, re-enable the package for switching.

    # cmmodpkg -e <package-name>

    If after cleaning up the node on which the timeout occured it is desirable to have that node as an alternate for running the package, remember to re-enable the package to run on the node:

    # cmmodpkg -e -n <node-name> <package-name>

The default ServiceGuard control scripts are designed to take the straightforward steps needed to get an application running or stopped. If the package administrator specifies a time limit within which these steps need to occur and that limit is subsequently exceeded for any reason, ServiceGuard takes the conservative approach that the control script logic must either be hung or defective in some way. At that point the control script cannot be trusted to perform cleanup actions correctly, thus the script is terminated and the package administrator is given the opportunity to assess what cleanup steps must be taken.

If you want the package to switch automatically in the event of a control script timeout, set the NODE_FAIL_FAST_ENABLED parameter to YES. (If you are using SAM, set Package Failfast to Enabled.) In this case, ServiceGuard will cause a TOC on the node where the control script timed out. This effectively cleans up any side effects of the package's run or halt attempt. In this case the package will be automatically restarted on any available alternate node for which it is configured.

Package Movement Errors

These errors are similar to the system administration errors except they are caused specifically by errors in the package control script. The best way to prevent these errors is to test your package control script before putting your high availability application on line.

Adding a "set -x" statement in the second line of your control script will give you details on where your script may be failing.

Node and Network Failures

Node and network failures cause ServiceGuard to transfer control of a package to another node. This is the normal action of ServiceGuard, but you have to be able to recognize when a transfer has taken place and decide to leave the cluster is its current condition or to restore it to its original condition. Possible node failures can be caused by the following conditions:

  • HPMC. This is a High Priority Machine Check, a system panic caused by a hardware error.

  • TOC

  • Panics

  • Hangs

  • Power failures

In the event of a TOC, a system dump is performed on the failed node and numerous messages are also displayed on the console. You can use the following commands to check the status of your network and subnets:

  • netstat -in - to display LAN status and check to see if the package IP is stacked on the LAN card.

  • lanscan - to see if the LAN is on the primary interface or has switched to the standby interface.

  • arp -a - to check the arp tables.

  • lanadmin - to display, test, and reset the LAN cards.

Since your cluster is unique, there are no cookbook solutions to possible problems. But if you apply these checks and commands and work your way through the log files, you will be successful in identifying and solving problems.

Troubleshooting Quorum Server

Authorization File Problems

The following kind of message in a ServiceGuard node's syslog file may indicate an authorization problem:

Access denied to quorum server 192.6.7.4

The reason may be that you have not updated the authorization file. Verify that the node is included in the file, and try using /usr/lbin/qs -update to re-read the authorization file.

Timeout Problems

The following kinds of message in a ServiceGuard node's syslog file may indicate timeout problems:

Unable to set client version at quorum server 192.6.7.2:reply timed out
Probe of quorum server 192.6.7.2 timed out

These messages could be an indication of an intermittent network; or the default quorum server timeout may not be sufficient. You can set the QS_TIMEOUT_EXTENSION to increase the timeout, or you can increase the heartbeat or node timeout value.

The following kind of message in a ServiceGuard node's syslog file indicates that the node did not receive a reply to it's lock request on time. This could be because of delay in communication between the node and the qs or between the qs and other nodes in the cluster:

Attempt to get lock /sg/cluser1 unsuccessful. Reason: request_timedout


Messages

The coordinator node in ServiceGuard sometimes sends a request to the quorum server to set the lock state. (This is different from a request to obtain the lock in tie-breaking.) If the quorum server's connection to one of the cluster nodes has not completed, the request to set may fail with a two-line message like the following in the quorum server's log file:

Oct 008 16:10:05:0: There is no connection to the applicant
2 for lock /sg/lockTest1
Oct 08 16:10:05:0:Request for lock /sg/lockTest1 from
applicant 1 failed: not connected to all applicants.

This condition can be ignored. The request will be retried a few seconds later and will succeed. The following message is logged:

Oct 008 16:10:06:0: Request for lock /sg/lockTest1
succeeded. New lock owners: 1,2.
Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.