 |
» |
|
|
 |
The following sections offer a few suggestions for troubleshooting
by reviewing the state of the running system and by examining cluster status
data, log files, and configuration files. Topics include: Reviewing Cluster and Package States |  |
A cluster or its component nodes may be in several different
states at different points in time. Status information for clusters,
packages and other cluster elements is shown in the output of the cmviewcl command and in some displays in SAM. This section
explains the meaning of many of the common conditions the cluster
or package may be in. Information about cluster status is stored in the status database,
which is maintained on each individual node in the cluster. You
can display information contained in this database by issuing the cmviewcl command: The command when issued with the -v option displays information about the whole cluster.
See the man page for a detailed description of other cmviewcl options. The status of a cluster may be one of the following: Up. At least one node has a running
cluster daemon, and reconfiguration is not taking place. Down. No cluster daemons are running on any cluster
node. Starting. The cluster is in the process of determining
its active membership. At least one cluster daemon is running.
The status of a node is either up (active as
a member of the cluster) or down (inactive in
the cluster), depending on whether its cluster daemon is running
or not. Note that a node might be down from the cluster perspective,
but still up and running HP-UX. A node may also be in one of the following states: Failed. A node never sees itself in
this state. Other active members of the cluster will see a node
in this state if that node was in an active cluster, but is no longer,
and is not halted. Cluster Reforming. A node in this state is running
the protocols which ensure that all nodes agree to the new membership
of an active cluster. If agreement is reached, the status database
is updated to reflect the new cluster membership. Running. A node in this state has completed all
required activity for the last re-formation and is operating normally. Halted. A node never sees itself in this state.
Other nodes will see it in this state after the node has gracefully
left the active cluster, for instance with a cmhaltnode command. Unknown. A node never sees itself in this state.
Other nodes assign a node this state if it has never been an active
cluster member.
The status of a package can be one of the following: Up. The package control script is
active. Down. The package control script is not active.
The state of the package can be one of the following: Starting. The start instructions in
the control script are being run. Running. Services are active and being monitored. Halting. The halt instructions in the control script
are being run.
Packages also have the following switching attributes: Package Switching. Enabled means that
the package can switch to another node in the event of failure. Switching Enabled for a Node. Enabled means that
the package can switch to the referenced node. Disabled means that
the package cannot switch to the specified node until the node is
enabled for the package using the cmmodpkg command. Every package is marked Enabled or Disabled for each node
that is either a primary or adoptive node for the package.
The DLM state of the cluster is one of the following: Starting. The start instructions in
the DLM runhalt script are being run. Running. Services are active and being monitored. Halting. The halt instructions in the DLM runhalt
script are being run.
The GMS 8.0.5 state of the cluster is one of the following: Starting. The cluster is starting
and GMS 8.0.5 services have been initiated. Running. Services are active and being monitored. Halted. The cluster is halting and GMS 8.0.5 services
have been stopped.
Status of (OPS 8.1.5 64 bit or OPS 8.1.6)The state of the cluster for OPS 8.1.5 64 bit or OPS 8.1.6 is
one of the following: Up. Services are active and being
monitored. The membership appears in the output of cmviewcl -l group. Down. The cluster is halted and OPS 8.1.5 64 bit
or OPS 8.1.6 services have been stopped. The membership does not
appear in the output of the cmviewcl -l group.
# cmviewcl -l groupGROUP MEMBER PID MEMBER_NODE
groupA 1 4003 chinook
2 5405 comanche
3 7865 comanche
groupB 5 3579 chinook
groupC 4 9517 chinook |
where the cmviewcl output values are: - GROUP
the name of a configured group - MEMBER
the ID number of a member of a group - PID
the Process ID of the group member - MEMBER_NODE
the Node on which the group member is running
Services have only status, as follows: Up. The service is being monitored. Down. The service is not running. It may have halted
or failed.
The network interfaces have only status, as follows: Unknown. We cannot determine whether the interface
is up or down. This can happen when the cluster is down. A standby
interface has this status.
The serial line has only status, as follows: Up. Heartbeats are received over the
serial line. Down. Heartbeat has not been received over the serial
line within 2 times the NODE_TIMEOUT value. Unknown. We cannot determine whether the serial
line is up or down. This can happen when the remote node is down.
Examples of Cluster and Package States The following sample output from the cmviewcl -v command shows status for the cluster in the sample
configuration. Normal Running Status--OPS 8.0.5, OPS 8.1.5 64 bit and OPS 8.1.6Everything is running normally; both nodes in a two-node cluster are
running, and each OPS instance package is running as well. The only packages
running are OPS instance packages. Status After Halting a Package--OPS 7.3.xAfter halting pkg2 with the cmhaltpkg command, the output of cmviewcl -v is as follows:  |
CLUSTER STATUS
example up
NODE STATUS STATE DLM_STATE
ftsys9 up running running
Network_Parameters:
INTERFACE STATUS PATH NAME
PRIMARY up 56/36.1 lan0
STANDBY up 60/6 lan1
PACKAGE STATUS STATE PKG_SWITCH NODE
pkg1 up running enabled ftsys9
Policy_Parameters:
POLICY_NAME CONFIGURED_VALUE
Start min_package_node
Failback manual
Script_Parameters:
ITEM STATUS MAX_RESTARTS RESTARTS NAME
Service up 0 0 service1
Subnet up 0 0 15.13.168.0
Resource up /example/float
Node_Switching_Parameters:
NODE_TYPE STATUS SWITCHING NAME
Primary up enabled ftsys9 (current)
Alternate up enabled ftsys10
NODE STATUS STATE GMS_STATE
ftsys10 up running running
Network_Parameters:
INTERFACE STATUS PATH NAME
PRIMARY up 28.1 lan0
STANDBY up 32.1 lan1
UNOWNED_PACKAGES
PACKAGE STATUS STATE PKG_SWITCH NODE
pkg2 down unowned disabled unowned
Policy_Parameters:
POLICY_NAME CONFIGURED_VALUE
Failover min_package_node
Failback manual
Script_Parameters:
ITEM STATUS MAX_RESTARTS RESTARTS NAME
Service up 0 0 service1
Subnet up 0 0 15.13.168.0
Resource down /example/float
Node_Switching_Parameters:
NODE_TYPE STATUS SWITCHING NAME
Primary up enabled ftsys10
Alternate up enabled ftsys9
|
 |
Pkg2 now has the status "down", and it is
shown as in the unowned state, with package switching disabled.
Resource "/example/float," which is configured
as a dependency of pkg2, is down. Note that switching is enabled
for both nodes, however. This means that once global switching is
re-enabled for the package, it will attempt to start up on the primary node. Status After Moving the Package to Another NodeAfter issuing the following command: # cmrunpkg -n ftsys9 pkg2 |
the output of the cmviewcl -v command is as follows:  |
CLUSTER STATUS
example up
NODE STATUS STATE GMS_STATE
ftsys9 up running running
Network_Parameters:
INTERFACE STATUS PATH NAME
PRIMARY up 56/36.1 lan0
STANDBY up 60/6 lan1
PACKAGE STATUS STATE PKG_SWITCH NODE
pkg1 up running enabled ftsys9
Policy_Parameters:
POLICY_NAME CONFIGURED_VALUE
Failover min_package_node
Failback manual
Script_Parameters:
ITEM STATUS MAX_RESTARTS RESTARTS NAME
Service up 0 0 service1
Subnet up 0 0 15.13.168.0
Resource up /example/float
Node_Switching_Parameters:
NODE_TYPE STATUS SWITCHING NAME
Primary up enabled ftsys9 (current)
Alternate up enabled ftsys10
PACKAGE STATUS STATE PKG_SWITCH NODE
pkg2 up running disabled ftsys9
Policy_Parameters:
POLICY_NAME CONFIGURED_VALUE
Failover min_package_node
Failback manual
Script_Parameters:
ITEM STATUS NAME MAX_RESTARTS RESTARTS
Service up service2.1 0 0
Subnet up 15.13.168.0 0 0
Node_Switching_Parameters:
NODE_TYPE STATUS SWITCHING NAME
Primary up enabled ftsys10
Alternate up enabled ftsys9 (current)
NODE STATUS STATE GMS_STATE
ftsys10 up running running
Network_Parameters:
INTERFACE STATUS PATH NAME
PRIMARY up 28.1 lan0
STANDBY up 32.1 lan1
|
 |
Now pkg2 is running on node ftsys9. Note that it is still
disabled from switching. Status After Package Switching is EnabledThe following command changes package status back to Package
Switching Enabled: The result is now as follows: CLUSTER STATUS
example up
NODE STATUS STATE GMS_STATE
ftsys9 up running running
PACKAGE STATUS STATE PKG_SWITCH NODE
pkg1 up running enabled ftsys9
pkg2 up running enabled ftsys9
NODE STATUS STATE
ftsys10 up running |
Both packages are now running on ftsys9 and pkg2 is enabled
for switching. Ftsys10 is running the daemon and no packages are
running on ftsys10. Status After Halting a NodeAfter halting ftsys10, with the following command: the output of cmviewcl is as follows on ftsys9: CLUSTER STATUS
example up
NODE STATUS STATE GMS_STATE
ftsys9 up running running
PACKAGE STATUS STATE PKG_SWITCH NODE
pkg1 up running enabled ftsys9
pkg2 up running enabled ftsys9
NODE STATUS STATE
ftsys10 down halted |
This output is seen on both ftsys9 and ftsys10. If you are using a serial (RS232) line as a heartbeat connection,
you will see a list of configured RS232 device files in the output
of the cmviewcl -v command. The following shows normal running status: CLUSTER STATUS
example up
NODE STATUS STATE GMS_STATE
ftsys9 up running running
Network_Parameters:
INTERFACE STATUS PATH NAME
PRIMARY up 56/36.1 lan0
Serial_Heartbeat:
DEVICE_FILE_NAME STATUS CONNECTED_TO:
/dev/tty0p0 up ftsys10 /dev/tty0p0
NODE STATUS STATE GMS_STATE
ftsys10 up running running
Network_Parameters:
INTERFACE STATUS PATH NAME
PRIMARY up 28.1 lan0
Serial_Heartbeat:
DEVICE_FILE_NAME STATUS CONNECTED_TO:
/dev/tty0p0 up ftsys9 /dev/tty0p0 |
The following shows status when the serial line is not working: CLUSTER STATUS
example up
NODE STATUS STATE GMS_STATE
ftsys9 up running running
Network_Parameters:
INTERFACE STATUS PATH NAME
PRIMARY up 56/36.1 lan0
Serial_Heartbeat:
DEVICE_FILE_NAME STATUS CONNECTED_TO:
/dev/tty0p0 down ftsys10 /dev/tty0p0
NODE STATUS STATE GMS_STATE
ftsys10 up running running
Network_Parameters:
INTERFACE STATUS PATH NAME
PRIMARY up 28.1 lan0
Serial_Heartbeat:
DEVICE_FILE_NAME STATUS CONNECTED_TO:
/dev/tty0p0 down ftsys9 /dev/tty0p0 |
Reviewing Package IP Addresses |  |
The netstat -in command can be used to examine the LAN configuration.
The command, if executed on node 1 after the halting of node 2, shows that the package IP addresses are assigned
to lan0 on node 1 along with the heartbeat IP address. Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
ni0* 0 none none 0 0 0 0 0
ni1* 0 none none 0 0 0 0 0
lo0 4608 127 127.0.0.1 10114 0 10114 0 0
lan0 1500 15.13.168 15.13.171.14 959269 2 305189 47 30538
lan0 1500 15.13.168 15.13.171.23 959269 2 305189 47 30538
lan0 1500 15.13.168 15.13.171.20 959269 2 305189 47 30538
lan1* 1500 none none 418623 27 41716 3 5149
|
Reviewing the System Log File |  |
All the components of MC/LockManager produce messages at different times
indicating the completion of a step or an error or warning condition.
Messages generated by SAM are displayed to the user in a message
box; messages from HP-UX commands are normally displayed on the
standard output; some information may also be written to different
log files, depending on which software component is generating the
message. Messages from the cluster manager are found in the system log
file, /var/adm/syslog/syslog.log. Messages from the distributed lock manager are
placed in files in a subdirectory of the home directory of the dlm user, as well as being sent to /var/adm/syslog/syslog.log. Messages Written to the System Log File Messages from the Cluster Manager and Package Manager are
written to the system log file. Each message is accompanied by a
timestamp showing the date and time the message was written out
and the name of the process that generated the message. The default
location of the log file is /var/adm/syslog/syslog.log. You can distinguish messages from the following daemon processes: cmclconfd - CM cluster configuration daemon dlm (OPS 7.3.x) - DLM daemons and clients cmgmsd - CM group membership daemon
You can examine the syslog.log file periodically for messages
relating to the configuration. In SAM, use the following steps: Run SAM, and choose the High Availability options. Choose Cluster Administration, then select "View
Syslog File" from the Cluster Administration Actions menu.
You can also browse the syslog file directly: # more /var/adm/syslog/syslog.log Return |
The cluster manager employs several types of messages to convey information
about the running system. Each message is accompanied by a prefix
that identifies the message type. The categories are as follows: - LOG_INTERNAL
This type of message is used to log ongoing processes occurring
within the MC/LockManager software or one of its related commands. - LOG_EXTERNAL
This type of message indicates that there has been
a change in the condition of some piece of hardware or software
outside MC/LockManager itself. Examples: a LAN card fails, or a node
comes back into the cluster. - LOG_PERIODIC
This type of message is a special case of the LOG_INTERNAL category. Periodic messages report those events
or actions which occur all the time, whether or not a problem or
change is detected in the cluster. - LOG_ERROR
This type of message is used to report incorrect MC/LockManager behavior,
which may be related to the inability to obtain system resources
or other problems within MC/LockManager. - LOG_DEATH
This type of message accompanies the death of a daemon
process.
Messages are intended to be self-explanatory, but occasionally
it is necessary to study several messages together in context to
determine the appropriate corrective action. In some cases, no action
is required because the message is purely informative, as when a
message reports successful completion of a task. In other cases,
the only action may be to gather information from the running system
for use in diagnosis of the problem by HP field personnel. Sample System Log Entries The following entries from the file /var/adm/syslog/syslog.log show a package that failed to run due to a problem
in the pkg5_run script. You would look at the pkg5_run.log for details.  |
Dec 14 14:33:48 star04 cmcld[2048]: Starting cluster management protocols.
Dec 14 14:33:48 star04 cmcld[2048]: Attempting to form a new cluster
Dec 14 14:33:53 star04 cmcld[2048]: 3 nodes have formed a new cluster
Dec 14 14:33:53 star04 cmcld[2048]: The new active cluster membership is:
star04(id=1) , star05(id=2), star06(id=3)
Dec 14 17:33:53 star04 cmlvmd[2049]: Clvmd initialized successfully.
Dec 14 14:34:44 star04 CM-CMD[2054]: cmrunpkg -v pkg5
Dec 14 14:34:44 star04 cmcld[2048]: Request from node star04 to start
package pkg5 on node star04.
Dec 14 14:34:44 star04 cmcld[2048]: Executing '/etc/cmcluster/pkg5/pkg5_run
start' for package pkg5.
Dec 14 14:34:45 star04 LVM[2066]: vgchange -a n /dev/vg02
Dec 14 14:34:45 star04 cmcld[2048]: Package pkg5 run script exited with
NO_RESTART.
Dec 14 14:34:45 star04 cmcld[2048]: Examine the file
/etc/cmcluster/pkg5/pkg5_run.log for more details. |
The following is an example of a successful package starting: Dec 14 14:39:27 star04 CM-CMD[2096]: cmruncl
Dec 14 14:39:27 star04 cmcld[2098]: Starting cluster management protocols.
Dec 14 14:39:27 star04 cmcld[2098]: Attempting to form a new cluster
Dec 14 14:39:27 star04 cmclconfd[2097]: Command execution message
Dec 14 14:39:33 star04 cmcld[2098]: 3 nodes have formed a new cluster
Dec 14 14:39:33 star04 cmcld[2098]: The new active cluster membership is:
star04(id=1), star05(id=2), star06(id=3)
Dec 14 17:39:33 star04 cmlvmd[2099]: Clvmd initialized successfully.
Dec 14 14:39:34 star04 cmcld[2098]: Executing '/etc/cmcluster/pkg4/pkg4_run
start' for package pkg4.
Dec 14 14:39:34 star04 LVM[2107]: vgchange /dev/vg01
Dec 14 14:39:35 star04 CM-pkg4[2124]: cmmodnet -a -i 15.13.168.0 15.13.168.4
Dec 14 14:39:36 star04 CM-pkg4[2127]: cmrunserv Service4 /vg01/MyPing 127.0.0.1
>>/dev/null
Dec 14 14:39:36 star04 cmcld[2098]: Started package pkg4 on node star04. |
The following DLM daemons produce messages: cmlkmgrd - DLM configuration daemon cmdlmmon - DLM monitor daemon
These daemon processes direct their messages to the logs directory inside the dlm home directory. There are two log files that contain messages
produced by the DLM daemons (and client processes attached to the
DLM): dlmstart.log. This file contains messages from the DLM daemons produced
during startup. dlm.log. This file contains messages from the DLM daemons produced
during normal operation, reconfiguration and shutdown.
Important DLM messages are also directed to /var/adm/syslog/syslog.log. See the appendix "Appendix C “Designing Highly Available Cluster
Applications ”" for a
listing of all DLM error messages, together with a probable cause
for the error condition, and the action you should take to eliminate
the problem. GMS errors at configuration time are reported on the standard
output, and runtime GMS errors appear in the syslog file. A complete
list of GMS configuration and runtime errors appears in the appendix "Appendix D “MC/LockManager Error Messages”." Some GMS internal errors are written to a trace file. For
OPS 8.0.5, the trace file for the GMS daemon is /tmp/.ogms/daem_xxxx.trc.
(xxxx is the GMS daemon's process id). This file, which
contains messages from OPS as well as from MC/LockManager, is of use
only to Oracle or HP support personnel. Each MC/LockManager internal
message consists of three parts: timestamp - date and time the error occurred function - name of the node manager API (nmapi) function issuing the
message message - text of the message itself.
If the message contains the keyword ERROR, there has been an error in the function issuing
the message. It is usually caused by a hardware failure or a software
defect. OPS users only need to pay attention this kind of message.
The following is an example: Thu May 14 18:42:31 1998 skgxndinfo: Unable to retrieve bootstrap info from status database at this moment. |
Send a copy of the trace file when requested by HP support. If Using OPS 8.1.5 64 bit or OPS 8.1.6Runtime errors appear in the syslog file. If the message contains the keywords cmgmsd and ERROR a hardware or software defect has occurred. Send
a copy of the syslog file when requested by HP support. The following is
an example: Apr 5 18:26:33 node_name cmgmsd [1952]: ERROR: Failed to create primary obj (4) |
Reviewing Configuration Files |  |
Review the following configuration files: Cluster configuration file /etc/cmcluster/cmclconf.asc. Package configuration files. For each package, the
file is called /etc/cmcluster/package_name/package_nameconf.asc.
Reviewing the Package Control Script |  |
Ensure that the package control script is found on all nodes
where the package can run and that the file is identical on all
nodes. Ensure that the script is executable on all nodes. Ensure
that the name of the control script appears in the package configuration
file, and ensure that all services named in the package configuration
file also appear in the package control script. Information about the starting and halting of each package
is found in the package's control script log. This log provides
the history of the operation of the package control script. It is
found at /etc/cmcluster/package_name/control.sh.log. This log documents all package run and halt activities.
If you have written a separate run and halt script for a package,
each script will have its own log. Using cmquerycl and cmcheckconf |  |
In addition, cmquerycl and cmcheckconf can be used to troubleshoot your cluster just
as they were used to verify its configuration. The following example
shows the commands used to verify the existing cluster configuration
on node 1 and node 2: # cmquerycl -v -C /etc/cmcluster/verify.asc -n node1 -n node2 Return
# cmcheckconf -v -C /etc/cmcluster/verify.asc Return |
The cmcheckconf command checks the following: The network addresses and connections. The cluster lock connectivity. The validity of configuration parameters of the
cluster and packages for: The existence and permission of scripts.
the cmcheckconf command does not check the following: The correct setup of the power circuits. The correctness of the package control script.
Using cmscancl |  |
The command cmscancl displays information about all the nodes in a cluster
in a structured report that allows you to compare such items as IP
addresses or subnets, physical volume names for disks, and other node-specific
items for all nodes in the cluster. cmscancl actually runs several different HP-UX commands
on all nodes and gathers the output into a report on the node where
you run the command. The following are the types of configuration data that cmscancl displays for each node: Table 8-1 Data Displayed by the cmscancl Command Description | Source of Data |
|---|
LAN device configuration and status | lanscan command | network status and interfaces | netstat command | file systems | mount command | LVM configuration | /etc/lvmtab file | LVM physical volume group data | /etc/lvmpvg file | link level connectivity for all links | linkloop command | binary configuration file | cmviewconf command |
Using cmviewconf |  |
cmviewconf allows you to examine the binary cluster configuration
file, even when the cluster is not running. The command displays
the content of this file on the node where you run the command. Reviewing the LAN Configuration |  |
The following networking commands can be used to diagnose
problems: netstat -in can be used to examine the LAN configuration.
This command lists all IP addresses assigned to each LAN interface
card. lanscan can also be used to examine the LAN configuration.
This command lists the MAC addresses and status of all LAN interface cards
on the node. arp -a can be used to check the arp tables. landiag is useful to display, diagnose, and reset LAN
card information. linkloop verifies the communication between LAN cards at
MAC address levels. For example, if you enter looplink -i4 0x08000993AB72, you should see displayed the message "Link Connectivity
to LAN station: 0x08000993AB72 — OK" /usr/contrib/bin/cmgetconfig -f can be used to verify that Primary and Standby LANs are
on the same bridged net. cmviewcl -v shows the status of primary and standby LANs.
Use these commands on all nodes. Reviewing the Status of Shared Volume
Groups |  |
To display the current configuration of a shared volume group,
use the vgdisplay -v command. An example is as follows: # vgdisplay -v /dev/vg_ops
|
The output includes a list of all volume groups, together
with the logical volumes configured in them and all the physical
volumes associated with them. Physical volume groups are also included. Using DLM Diagnostic Tools (OPS 7.3.x) |  |
MC/LockManager software includes a group of diagnostic tools that
may be helpful in troubleshooting. Use these tools in cooperation
with your HP representative or technical consultant. Refer also
to the man page for each command. dlmdump is a tool that dumps DLM-related memory structures. dlmdump is used for debugging purposes. It allows the
user to obtain a snapshot about an object given its handle. Since dlmdump only provides a "snapshot" of
the current state, what is displayed might not be completely consistent
with the actual data. dlmstat is a tool that tracks DLM-related statistical information. dlmstat is used to acquire statistics for the process, resource,
instance, and cluster objects from the DLM database. For example, dlmstat -i -q -t +1 -n 10
|
Core dumps for the cmcld and cmlvmd daemons are produced in the /var/adm/cmcluster and /etc/lvmconf directories, respectively. The DLM daemons (OPS
7.3.x) deposit dumps in the cores subdirectory of the dlm home directory.
|