Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Configuring OPS Clusters with ServiceGuard OPS Edition > Chapter 8 Troubleshooting Your Cluster

Troubleshooting Approaches

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Index

The following sections offer a few suggestions for troubleshooting by reviewing the state of the running system and by examining cluster status data, log files, and configuration files. Topics include:

Reviewing Package IP Addresses

The netstat -in command can be used to examine the LAN configuration. The command, if executed on node 1 after the halting of node 2, shows that the package IP addresses are assigned to lan0 on node 1 along with the heartbeat IP address.

Name  Mtu   Network         Address            Ipkts Ierrs    Opkts Oerrs  Coll
ni0* 0 none none 0 0 0 0 0
ni1* 0 none none 0 0 0 0 0
lo0 4608 127 127.0.0.1 10114 0 10114 0 0
lan0 1500 15.13.168 15.13.171.14 959269 2 305189 47 30538
lan0 1500 15.13.168 15.13.171.23 959269 2 305189 47 30538
lan0 1500 15.13.168 15.13.171.20 959269 2 305189 47 30538
lan1* 1500 none none 418623 27 41716 3 5149

Reviewing the System Log File

All the components of ServiceGuard produce messages at different times indicating the completion of a step or an error or warning condition. Messages generated by SAM are displayed to the user in a message box; messages from HP-UX commands are normally displayed on the standard output; some information may also be written to different log files, depending on which software component is generating the message. Messages from the cluster manager are found in the system log file, /var/adm/syslog/syslog.log.

Messages Written to the System Log File

Messages from the Cluster Manager and Package Manager are written to the system log file. Each message is accompanied by a timestamp showing the date and time the message was written out and the name of the process that generated the message. The default location of the log file is /var/adm/syslog/syslog.log.

You can distinguish messages from the following daemon processes:

  • cmcld - CM daemon

  • cmclconfd - CM cluster configuration daemon

  • cmlvmd - SLVM daemon

  • cmgmsd - CM group membership daemon

You can examine the syslog.log file periodically for messages relating to the configuration. In SAM, use the following steps:

  1. Run SAM, and choose the High Availability options.

  2. Choose Cluster Administration, then select View Syslog File from the Cluster Administration Actions menu.

You can also browse the syslog file directly:

# more /var/adm/syslog/syslog.log Return

The cluster manager employs several types of messages to convey information about the running system. Each message is accompanied by a prefix that identifies the message type. The categories are as follows:

LOG_INTERNAL

This type of message is used to log ongoing processes occurring within the ServiceGuard software or one of its related commands.

LOG_EXTERNAL

This type of message indicates that there has been a change in the condition of some piece of hardware or software outside ServiceGuard itself. Examples: a LAN card fails, or a node comes back into the cluster.

LOG_PERIODIC

This type of message is a special case of the LOG_INTERNAL category. Periodic messages report those events or actions which occur all the time, whether or not a problem or change is detected in the cluster.

LOG_ERROR

This type of message is used to report incorrect ServiceGuard behavior, which may be related to the inability to obtain system resources or other problems within ServiceGuard.

LOG_DEATH

This type of message accompanies the death of a daemon process.

Messages are intended to be self-explanatory, but occasionally it is necessary to study several messages together in context to determine the appropriate corrective action. In some cases, no action is required because the message is purely informative, as when a message reports successful completion of a task. In other cases, the only action may be to gather information from the running system for use in diagnosis of the problem by HP field personnel.

Sample System Log Entries

The following entries from the file /var/adm/syslog/syslog.log show a package that failed to run due to a problem in the pkg5_run script. You would look at the pkg5_run.log for details.

Dec 14 14:33:48 star04 cmcld[2048]: Starting cluster management protocols.
Dec 14 14:33:48 star04 cmcld[2048]: Attempting to form a new cluster
Dec 14 14:33:53 star04 cmcld[2048]: 3 nodes have formed a new cluster
Dec 14 14:33:53 star04 cmcld[2048]: The new active cluster membership is:
star04(id=1) , star05(id=2), star06(id=3)
Dec 14 17:33:53 star04 cmlvmd[2049]: Clvmd initialized successfully.
Dec 14 14:34:44 star04 CM-CMD[2054]: cmrunpkg -v pkg5
Dec 14 14:34:44 star04 cmcld[2048]: Request from node star04 to start
package pkg5 on node star04.
Dec 14 14:34:44 star04 cmcld[2048]: Executing '/etc/cmcluster/pkg5/pkg5_run
start' for package pkg5.
Dec 14 14:34:45 star04 LVM[2066]: vgchange -a n /dev/vg02
Dec 14 14:34:45 star04 cmcld[2048]: Package pkg5 run script exited with
NO_RESTART.
Dec 14 14:34:45 star04 cmcld[2048]: Examine the file
/etc/cmcluster/pkg5/pkg5_run.log for more details.

The following is an example of a successful package starting:

Dec 14 14:39:27 star04 CM-CMD[2096]: cmruncl
Dec 14 14:39:27 star04 cmcld[2098]: Starting cluster management protocols.
Dec 14 14:39:27 star04 cmcld[2098]: Attempting to form a new cluster
Dec 14 14:39:27 star04 cmclconfd[2097]: Command execution message
Dec 14 14:39:33 star04 cmcld[2098]: 3 nodes have formed a new cluster
Dec 14 14:39:33 star04 cmcld[2098]: The new active cluster membership is:
star04(id=1), star05(id=2), star06(id=3)
Dec 14 17:39:33 star04 cmlvmd[2099]: Clvmd initialized successfully.
Dec 14 14:39:34 star04 cmcld[2098]: Executing '/etc/cmcluster/pkg4/pkg4_run
start' for package pkg4.
Dec 14 14:39:34 star04 LVM[2107]: vgchange /dev/vg01
Dec 14 14:39:35 star04 CM-pkg4[2124]: cmmodnet -a -i 15.13.168.0 15.13.168.4
Dec 14 14:39:36 star04 CM-pkg4[2127]: cmrunserv Service4 /vg01/MyPing 127.0.0.1
>>/dev/null
Dec 14 14:39:36 star04 cmcld[2098]: Started package pkg4 on node star04.

If Using OPS 8.1.x or Later

Runtime errors appear in the syslog file. If the message contains the keywords cmgmsd and ERROR a hardware or software defect has occurred. Send a copy of the syslog file when requested by HP support. The following is an example:

Apr 5 18:26:33 node_name cmgmsd [1952]: ERROR: Failed to create primary obj (4)

Reviewing Object Manager Log Files

The ServiceGuard Object Manager daemon cmomd logs messages to the file /var/opt/cmom/cmomd.log. You can review these messages using the cmreadlog command, as follows:

# cmreadlog /var/opt/cmom/cmomd.log

Messages from cmomd include information about the processes that request data from the Object Manager, including type of data, timestamp, etc. An example of a client that requests data from Object Manager is ServiceGuard Manager.

Reviewing ServiceGuard Manager Log Files

ServiceGuard Manager maintains a log file of user activity. This file is stored in the HP-UX directory /var/opt/sgmgr or the Windows directory X:\Program Files\Hewlett-Packard\ServiceGuard Manager\log (where X refers to the drive on which you have installed ServiceGuard Manager). You can review these messages using the cmreadlog command, as in the following HP-UX example:

# cmreadlog /var/opt/sgmgr/929917sgmgr.log

Messages from ServiceGuard Manger include information about the login date and time, Object Manager server system, timestamp, etc.

Reviewing Configuration Files

Review the following configuration files:

  • Cluster configuration file /etc/cmcluster/cmclconf.asc.

  • Package configuration files. For each package, the file is called /etc/cmcluster/package_name/package_nameconf.asc.

Reviewing the Package Control Script

Ensure that the package control script is found on all nodes where the package can run and that the file is identical on all nodes. Ensure that the script is executable on all nodes. Ensure that the name of the control script appears in the package configuration file, and ensure that all services named in the package configuration file also appear in the package control script.

Information about the starting and halting of each package is found in the package's control script log. This log provides the history of the operation of the package control script. It is found at /etc/cmcluster/package_name/control.sh.log. This log documents all package run and halt activities. If you have written a separate run and halt script for a package, each script will have its own log.

Using cmquerycl and cmcheckconf

In addition, cmquerycl and cmcheckconf can be used to troubleshoot your cluster just as they were used to verify its configuration. The following example shows the commands used to verify the existing cluster configuration on node 1 and node 2:

# cmquerycl -v -C /etc/cmcluster/verify.asc node1 -n node2

# cmcheckconf -v -C /etc/cmcluster/verify.asc

The cmcheckconf command checks the following:

  • The network addresses and connections.

  • The cluster lock connectivity.

  • The validity of configuration parameters of the cluster and packages for:

    • The uniqueness of names.

    • The existence and permission of scripts.

the cmcheckconf command does not check the following:

  • The correct setup of the power circuits.

  • The correctness of the package control script.

Using cmscancl

The command cmscancl displays information about all the nodes in a cluster in a structured report that allows you to compare such items as IP addresses or subnets, physical volume names for disks, and other node-specific items for all nodes in the cluster. cmscancl actually runs several different HP-UX commands on all nodes and gathers the output into a report on the node where you run the command.

The following are the types of configuration data that cmscancl displays for each node:

Table 8-1 Data Displayed by the cmscancl Command

Description

Source of Data

LAN device configuration and status

lanscan command

network status and interfaces

netstat command

file systems

mount command

LVM configuration

/etc/lvmtab file

LVM physical volume group data

/etc/lvmpvg file

link level connectivity for all links

linkloop command

binary configuration file

cmviewconf command

 

Using cmviewconf

cmviewconf allows you to examine the binary cluster configuration file, even when the cluster is not running. The command displays the content of this file on the node where you run the command.

Reviewing the LAN Configuration

The following networking commands can be used to diagnose problems:

  • netstat -in can be used to examine the LAN configuration. This command lists all IP addresses assigned to each LAN interface card.

  • lanscan can also be used to examine the LAN configuration. This command lists the MAC addresses and status of all LAN interface cards on the node.

  • arp -a can be used to check the arp tables.

  • landiag is useful to display, diagnose, and reset LAN card information.

  • linkloop verifies the communication between LAN cards at MAC address levels. For example, if you enter

    # linkloop -i4 0x08000993AB72

    you should see displayed the message

    Link Connectivity to LAN station: 0x08000993AB72 — OK

  • /usr/contrib/bin/cmgetconfig -f can be used to verify that Primary and Standby LANs are on the same bridged net.

  • cmviewcl -v shows the status of primary and standby LANs.

Use these commands on all nodes.

Reviewing the Status of Shared Volume Groups

To display the current configuration of a shared volume group, use the vgdisplay -v command. An example is as follows:

# vgdisplay -v /dev/vg_ops 

The output includes a list of all volume groups, together with the logical volumes configured in them and all the physical volumes associated with them. Physical volume groups are also included.

Problems with VxVM Disk Groups

This section describes some approaches to solving problems that may occur with VxVM disk groups in a cluster environment. For most problems, it is helpful to use the vxdg list command to display the disk groups currently imported on a specific node. Also, you should consult the package control script log files for messages associated with importing and deporting disk groups on particular nodes.

Force Import and Deport After Node Failure

After certain failures, packages configured with VxVM disk groups will fail to start, and the following error will be seen in the package log file:

vxdg: Error dg_01 may still be imported on ftsys9
ERROR: Function check_dg failed

This can happen if a package is running on a node which then fails before the package control script can deport the disk group. In these cases, the host name of the node that had failed is still written on the disk group header.

When the package starts up on another node in the cluster, a series of messages is printed as in the following example (the hostname of the failed system is ftsys9, and the disk group is dg_01):

check_dg: Error dg_01 may still be imported on ftsys9

To correct this situation, logon to ftsys9 and
execute the following command:
vxdg deport dg_01

Once dg_01 has been deported from ftsys9,
this package may be restarted via either cmmodpkg(1M)
or cmrunpkg(1M).

In the event that ftsys9 is either powered off
or unable to boot, then dg_01 must be force
imported.

******************* WARNING**************************

The use of force import can lead to data corruption if
ftsys9 is still running and has dg_01
imported. It is imperative to positively determine that
ftsys9 is not running prior to performing the force
import. See -C option on vxdg(1M).

*******************************************************

To force import dg_01, execute the following
commands on the local system:
vxdg -tfC import $vg
vxdg deport $vg

Follow the instructions in the message to use the force import option (-C) to allow the current node to import the disk group. Then deport the disk group, after which it can be used again by the package. Example:

# vxdg -tfC import dg_01

# vxdg deport dg_01

The force import will clear the host name currently written on the disks in the disk group, after which you can deport the disk group without error so it can then be imported by a package running on a different node.

CAUTION: This force import procedure should only be used when you are certain the disk is not currently being accessed by another node. If you force import a disk that is already being accessed on another node, data corruption can result.

Core Dump Locations

Core dumps for the cmcld and cmlvmd daemons are produced in the /var/adm/cmcluster and /etc/lvmconf directories, respectively.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.