| United States-English |
|
|
|
![]() |
Configuring OPS Clusters with ServiceGuard OPS Edition > Chapter 8 Troubleshooting Your
ClusterSolving Problems |
|
Problems with ServiceGuard may be of several types. The following is a list of common categories of problem:
The first two categories of problems occur with the incorrect configuration of ServiceGuard. The last category contains "normal" failures to which ServiceGuard clusters are designed to react and ensure the availability of your applications. Many ServiceGuard commands, including cmviewcl, depend on name resolution services to look up the addresses of cluster nodes. When name services are not available (for example, if a name server is down), ServiceGuard commands may hang, or may return a network-related error message. If this happens, use the nslookup command on each cluster node to see whether name resolution is correct. Example:# nslookup ftsys9
If the output of this command does not include the correct IP address of the node, then check your name resolution services further. Cluster re-formations may occur from time to time due to current cluster conditions. Some of the causes are as follows:
In these cases, applications continue running, though they might experience a small performance impact during cluster re-formation. There are a number of errors you can make when configuring ServiceGuard OPS Edition that will not show up when you start the cluster. Your cluster can be running, and everything appears to be fine, until there is a hardware or software failure and control of your packages is not transferred to another node as you would have expected. These situations are caused specifically by errors in the cluster configuration file and package configuration scripts. Examples of these errors include:
You can use the following commands to check the status of your disks:
When a RUN_SCRIPT_TIMEOUT or HALT_SCRIPT_TIMEOUT value is set, and the control script hangs, causing the timeout to be exceeded, MC/Serviceguard kills the script and marks the package "Halted." Similarly, when a package control script fails, MC/ServiceGuard kills the script and marks the package "Halted." In both cases, the following also take place:
Following such a failure, since the control script is terminated, some of the package's resources may be left activated. Specifically:
In this kind of situation, MC/Serviceguard will not restart the package without manual intervention. You must clean up manually before restarting the package. Use the following steps as guidelines:
The default ServiceGuard control scripts are designed to take the straightforward steps needed to get an application running or stopped. If the package administrator specifies a time limit within which these steps need to occur and that limit is subsequently exceeded for any reason, ServiceGuard takes the conservative approach that the control script logic must either be hung or defective in some way. At that point the control script cannot be trusted to perform cleanup actions correctly, thus the script is terminated and the package administrator is given the opportunity to assess what cleanup steps must be taken. If you want the package to switch automatically in the event of a control script timeout, set the NODE_FAIL_FAST_ENABLED parameter to YES. (If you are using SAM, set Package Failfast to Enabled.) In this case, ServiceGuard will cause a TOC on the node where the control script timed out. This effectively cleans up any side effects of the package's run or halt attempt. In this case the package will be automatically restarted on any available alternate node for which it is configured. These errors are similar to the system administration errors except they are caused specifically by errors in the package control script. The best way to prevent these errors is to test your package control script before putting your high availability application on line. Adding a "set -x" statement in the second line of your control script will give you details on where your script may be failing. Node and network failures cause ServiceGuard to transfer control of a package to another node. This is the normal action of ServiceGuard, but you have to be able to recognize when a transfer has taken place and decide to leave the cluster is its current condition or to restore it to its original condition. Possible node failures can be caused by the following conditions:
In the event of a TOC, a system dump is performed on the failed node and numerous messages are also displayed on the console. You can use the following commands to check the status of your network and subnets:
Since your cluster is unique, there are no cookbook solutions to possible problems. But if you apply these checks and commands and work your way through the log files, you will be successful in identifying and solving problems. The following kind of message in a ServiceGuard node's syslog file may indicate an authorization problem: Access denied to quorum server 192.6.7.4 The reason may be that you have not updated the authorization file. Verify that the node is included in the file, and try using /usr/lbin/qs -update to re-read the authorization file. The following kinds of message in a ServiceGuard node's syslog file may indicate timeout problems: Unable to set client version at quorum server 192.6.7.2:reply timed out These messages could be an indication of an intermittent network; or the default quorum server timeout may not be sufficient. You can set the QS_TIMEOUT_EXTENSION to increase the timeout, or you can increase the heartbeat or node timeout value. The following kind of message in a ServiceGuard node's syslog file indicates that the node did not receive a reply to it's lock request on time. This could be because of delay in communication between the node and the qs or between the qs and other nodes in the cluster:
The coordinator node in ServiceGuard sometimes sends a request to the quorum server to set the lock state. (This is different from a request to obtain the lock in tie-breaking.) If the quorum server's connection to one of the cluster nodes has not completed, the request to set may fail with a two-line message like the following in the quorum server's log file:
This condition can be ignored. The request will be retried a few seconds later and will succeed. The following message is logged:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||