 |
» |
|
|
 |
The following sections offer a few suggestions for troubleshooting
by reviewing the state of the running system and by examining cluster
status data, log files, and configuration files. Topics include: Reviewing Cluster and Package States Reviewing Package IP addresses Reviewing the System Log File Reviewing Configuration Files Reviewing the Package Control Script Using cmquerycl
and cmcheckconf Reviewing the LAN Configuration Reviewing the Status of Shared Volume Groups Using DLM Diagnostic Tools
Reviewing Cluster and Package States |  |
A cluster or its component nodes may be in several different
states at different points in time. Status information for clusters,
packages and other cluster elements is shown in the output of the
cmviewcl command
and in some displays in SAM. This section explains the meaning of
many of the common conditions the cluster or package may be in. Information about cluster status is stored in the status database,
which is maintained on each individual node in the cluster. You
can display information contained in this database by issuing the
cmviewcl command: The command when issued with the -v
option displays information about the whole cluster. See the man
page for a detailed description of other cmviewcl
options. The status of a cluster may be
one of the following: Up. At least one node has a running
cluster daemon, and reconfiguration is not taking place. Down. No cluster daemons are running on any cluster
node. Starting. The cluster is in the process of determining
its active membership. At least one cluster daemon is ruuning.
The status of a node is either
up (active as a member of the cluster) or down
(inactive in the cluster), depending on whether
its cluster daemon is running or not. Note that a node might be
down from the cluster perspective, but still up and running HP-UX. A node may also be in one of the following states: Failed. A node never sees itself in
this state. Other active members of the cluster will see a node
in this state if that node was in an active cluster, but is no longer,
and is not halted. Cluster Reforming. A node in this state is running
the protocols which ensure that all nodes agree to the new membership
of an active cluster. If agreement is reached, the status database
is updated to reflect the new cluster membership. Running. A node in this state has completed all
required activity for the last re-formation and is operating normally. Halted. A node never sees itself in this state.
Other nodes will see it in this state after the node has gracefully
left the active cluster, for instance with a cmhaltnode
command. Unknown. A node never sees itself in this state.
Other nodes assign a node this state if it has never been an active
cluster member.
The status of a package can be
one of the following: Up. The package control script is
active. Down. The package control script is not active.
The state of the package can be
one of the following: Starting. The start instructions in
the control script are being run. Running. Services are active and being monitored. Halting. The halt instructions in the control script
are being run.
Packages also have the following switching attributes: Package Switching. Enabled means that
the package can switch to another node in the event of failure. Switching Enabled for a Node. Enabled means that
the package can switch to the referenced node. Disabled means that
the package cannot switch to the specified node until the node is
enabled for the package using the cmmodpkg
command. Every package is marked Enabled or Disabled for each node
that is either a primary or adoptive node for the package.
The GMS state of the cluster is one of the following: Starting. The cluster is starting
and GMS services have been initiated. Running. Services are active and being monitored. Halted. The cluster is halting and GMS services
have been stopped.
The DLM state of the cluster is one of the following: Starting. The start instructions in
the DLM runhalt script are being run. Running. Services are active and being monitored. Halting. The halt instructions in the DLM runhalt
script are being run.
Services have only status, as follows: Up. The service is being monitored. Down. The service is not running. It may have halted
or failed.
The network interfaces have only status, as follows: Unknown. We cannot determine whether the interface
is up or down. This can happen when the cluster is down. A standby
interface has this status.
The serial line has only status, as follows: Up. Heartbeats are received over the
serial line. Down. Heartbeat has not been received over the serial
line within 2 times the NODE_TIMEOUT value. Unknown. We cannot determine whether the serial
line is up or down. This can happen when the remote node is down.
Examples of Cluster and Package States The following sample output from the cmviewcl -v
command shows status for the cluster in the sample configuration. Normal Running Status--OPS 8.0.5Everything is running normally; both nodes in a two-node cluster
are running, and each OPS instance package is running as well. The
only packages running are OPS instance packages.  |
CLUSTER STATUS example up NODE STATUS STATE GMS_STATE ftsys9 up running running Network_Parameters: INTERFACE STATUS PATH NAME PRIMARY up 56/36.1 lan0 STANDBY up 60/6 lan1 PACKAGE STATUS STATE PKG_SWITCH NODE ops_pkg1 up running disabled ftsys9 Policy_Parameters: POLICY_NAME CONFIGURED_VALUE Start configured_node Failback manual Node_Switching_Parameters: NODE_TYPE STATUS SWITCHING NAME Primary up enabled ftsys9 (current) NODE STATUS STATE GMS_STATE ftsys10 up running running Network_Parameters: INTERFACE STATUS PATH NAME PRIMARY up 28.1 lan0 STANDBY up 32.1 lan1 PACKAGE STATUS STATE PKG_SWITCH NODE ops_pkg2 up running disabled ftsys10 Policy_Parameters: POLICY_NAME CONFIGURED_VALUE Start configured_node Failback manual Node_Switching_Parameters: NODE_TYPE STATUS SWITCHING NAME Primary up enabled ftsys10 (current) Alternate up enabled ftsys9
|
Status After Halting a Package--OPS 7.3.xAfter halting pkg2 with the cmhaltpkg
command, the output of cmviewcl -v
is as follows:  |
CLUSTER STATUS example up NODE STATUS STATE DLM_STATE ftsys9 up running running Network_Parameters: INTERFACE STATUS PATH NAME PRIMARY up 56/36.1 lan0 STANDBY up 60/6 lan1 PACKAGE STATUS STATE PKG_SWITCH NODE pkg1 up running enabled ftsys9 Policy_Parameters: POLICY_NAME CONFIGURED_VALUE Start min_package_node Failback manual Script_Parameters: ITEM STATUS MAX_RESTARTS RESTARTS NAME Service up 0 0 service1 Subnet up 0 0 15.13.168.0 Resource up /example/float Node_Switching_Parameters: NODE_TYPE STATUS SWITCHING NAME Primary up enabled ftsys9 (current) Alternate up enabled ftsys10 NODE STATUS STATE GMS_STATE ftsys10 up running running Network_Parameters: INTERFACE STATUS PATH NAME PRIMARY up 28.1 lan0 STANDBY up 32.1 lan1 UNOWNED_PACKAGES PACKAGE STATUS STATE PKG_SWITCH NODE pkg2 down unowned disabled unowned Policy_Parameters: POLICY_NAME CONFIGURED_VALUE Failover min_package_node Failback manual Script_Parameters: ITEM STATUS MAX_RESTARTS RESTARTS NAME Service up 0 0 service1 Subnet up 0 0 15.13.168.0 Resource down /example/float Node_Switching_Parameters: NODE_TYPE STATUS SWITCHING NAME Primary up enabled ftsys10 Alternate up enabled ftsys9
|
 |
Pkg2 now has the status "down", and it is
shown as in the unowned state, with package switching disabled.
Resource "/example/float," which is configured
as a dependency of pkg2, is down. Note that switching is enabled
for both nodes, however. This means that once global switching is
re-enabled for the package, it will attempt to start up on the primary
node. Status After Moving the Package to Another NodeAfter issuing the following command: # cmrunpkg -n ftsys9 pkg2 |
the output of the cmviewcl -v
command is as follows:  |
CLUSTER STATUS example up NODE STATUS STATE GMS_STATE ftsys9 up running running Network_Parameters: INTERFACE STATUS PATH NAME PRIMARY up 56/36.1 lan0 STANDBY up 60/6 lan1 PACKAGE STATUS STATE PKG_SWITCH NODE pkg1 up running enabled ftsys9 Policy_Parameters: POLICY_NAME CONFIGURED_VALUE Failover min_package_node Failback manual Script_Parameters: ITEM STATUS MAX_RESTARTS RESTARTS NAME Service up 0 0 service1 Subnet up 0 0 15.13.168.0 Resource up /example/float Node_Switching_Parameters: NODE_TYPE STATUS SWITCHING NAME Primary up enabled ftsys9 (current) Alternate up enabled ftsys10 PACKAGE STATUS STATE PKG_SWITCH NODE pkg2 up running disabled ftsys9 Policy_Parameters: POLICY_NAME CONFIGURED_VALUE Failover min_package_node Failback manual Script_Parameters: ITEM STATUS NAME MAX_RESTARTS RESTARTS Service up service2.1 0 0 Subnet up 15.13.168.0 0 0 Node_Switching_Parameters: NODE_TYPE STATUS SWITCHING NAME Primary up enabled ftsys10 Alternate up enabled ftsys9 (current) NODE STATUS STATE GMS_STATE ftsys10 up running running Network_Parameters: INTERFACE STATUS PATH NAME PRIMARY up 28.1 lan0 STANDBY up 32.1 lan1
|
 |
Now pkg2 is running on node ftsys9. Note that it is still
disabled from switching. Status After Package Switching is EnabledThe following command changes package status back to Package
Switching Enabled: The result is now as follows: CLUSTER STATUS example up NODE STATUS STATE GMS_STATE ftsys9 up running running PACKAGE STATUS STATE PKG_SWITCH NODE pkg1 up running enabled ftsys9 pkg2 up running enabled ftsys9 NODE STATUS STATE ftsys10 up running |
Both packages are now running on ftsys9 and pkg2 is enabled
for switching. Ftsys10 is running the daemon and no packages are
running on ftsys10. Status After Halting a NodeAfter halting ftsys10, with the
following command: the output of cmviewcl
is as follows on ftsys9: CLUSTER STATUS example up NODE STATUS STATE GMS_STATE ftsys9 up running running PACKAGE STATUS STATE PKG_SWITCH NODE pkg1 up running enabled ftsys9 pkg2 up running enabled ftsys9 NODE STATUS STATE ftsys10 down halted |
This output is seen on both ftsys9
and ftsys10. If you are using a serial (RS232) line as a heartbeat connection,
you will see a list of configured RS232 device files in the output
of the cmviewcl -v
command. The following shows normal running status: CLUSTER STATUS example up NODE STATUS STATE GMS_STATE ftsys9 up running running Network_Parameters: INTERFACE STATUS PATH NAME PRIMARY up 56/36.1 lan0 Serial_Heartbeat: DEVICE_FILE_NAME STATUS CONNECTED_TO: /dev/tty0p0 up ftsys10 /dev/tty0p0 NODE STATUS STATE GMS_STATE ftsys10 up running running Network_Parameters: INTERFACE STATUS PATH NAME PRIMARY up 28.1 lan0 Serial_Heartbeat: DEVICE_FILE_NAME STATUS CONNECTED_TO: /dev/tty0p0 up ftsys9 /dev/tty0p0 |
The following shows status when the serial line is not working: CLUSTER STATUS example up NODE STATUS STATE GMS_STATE ftsys9 up running running Network_Parameters: INTERFACE STATUS PATH NAME PRIMARY up 56/36.1 lan0 Serial_Heartbeat: DEVICE_FILE_NAME STATUS CONNECTED_TO: /dev/tty0p0 down ftsys10 /dev/tty0p0 NODE STATUS STATE GMS_STATE ftsys10 up running running Network_Parameters: INTERFACE STATUS PATH NAME PRIMARY up 28.1 lan0 Serial_Heartbeat: DEVICE_FILE_NAME STATUS CONNECTED_TO: /dev/tty0p0 down ftsys9 /dev/tty0p0 |
Reviewing Package IP Addresses |  |
The netstat -in
command can be used to examine the LAN configuration. The command,
if executed on node 1 after the halting
of node 2, shows that the package IP
addresses are assigned to lan0 on node 1 along with the heartbeat
IP address. Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll ni0* 0 none none 0 0 0 0 0 ni1* 0 none none 0 0 0 0 0 lo0 4608 127 127.0.0.1 10114 0 10114 0 0 lan0 1500 15.13.168 15.13.171.14 959269 2 305189 47 30538 lan0 1500 15.13.168 15.13.171.23 959269 2 305189 47 30538 lan0 1500 15.13.168 15.13.171.20 959269 2 305189 47 30538 lan1* 1500 none none 418623 27 41716 3 5149
|
Reviewing the System Log File |  |
All the components of MC/LockManager produce messages at
different times indicating the completion of a step or an error
or warning condition. Messages generated by SAM are displayed to
the user in a message box; messages from HP-UX commands are normally
displayed on the standard output; some information may also be written
to different log files, depending on which software component is
generating the message. Messages from the cluster manager are found
in the system log file, /var/adm/syslog/syslog.log.
Messages from the distributed lock manager are placed in files in
a subdirectory of the home directory of the dlm
user, as well as being sent to /var/adm/syslog/syslog.log. Messages Written to the System Log File Messages from the Cluster Manager and Package Manager are
written to the system log file. Each message is accompanied by a
timestamp showing the date and time the message was written out
and the name of the process that generated the message. The default
location of the log file is /var/adm/syslog/syslog.log. You can distinguish messages from the following daemon processes: cmclconfd -
CM cluster configuration daemon dlm (OPS 7.3.x) - DLM daemons
and clients
You can examine the syslog.log file periodically for messages
relating to the configuration. In SAM, use the following steps: Run SAM, and choose the High Availability options. Choose Cluster Administration, then select "View
Syslog File" from the Cluster Administration Actions menu.
You can also browse the syslog file directly: # more /var/adm/syslog/syslog.log Return |
The cluster manager employs several types of messages to convey
information about the running system. Each message is accompanied
by a prefix that identifies the message type. The categories are
as follows: - LOG_INTERNAL
This type of message is used to log ongoing processes
occurring within the MC/LockManager software or one of its related
commands. - LOG_EXTERNAL
This type of message indicates that there has been
a change in the condition of some piece of hardware or software
outside MC/LockManager itself. Examples: a LAN card fails, or a
node comes back into the cluster. - LOG_PERIODIC
This type of message is a special case of the LOG_INTERNAL
category. Periodic messages report those events or actions which
occur all the time, whether or not a problem or change is detected
in the cluster. - LOG_ERROR
This type of message is used to report incorrect
MC/LockManager behavior, which may be related to the inability to
obtain system resources or other problems within MC/LockManager. - LOG_DEATH
This type of message accompanies the death of a
daemon process.
Messages are intended to be self-explanatory, but occasionally
it is necessary to study several messages together in context to
determine the appropriate corrective action. In some cases, no action
is required because the message is purely informative, as when a
message reports successful completion of a task. In other cases,
the only action may be to gather information from the running system
for use in diagnosis of the problem by HP field personnel. Sample System Log Entries The following entries from the file /var/adm/syslog/syslog.log
show a package that failed to run due to a problem in the pkg5_run
script. You would look at the pkg5_run.log for details.  |
Dec 14 14:33:48 star04 cmcld[2048]: Starting cluster management protocols. Dec 14 14:33:48 star04 cmcld[2048]: Attempting to form a new cluster Dec 14 14:33:53 star04 cmcld[2048]: 3 nodes have formed a new cluster Dec 14 14:33:53 star04 cmcld[2048]: The new active cluster membership is: star04(id=1) , star05(id=2), star06(id=3) Dec 14 17:33:53 star04 cmlvmd[2049]: Clvmd initialized successfully. Dec 14 14:34:44 star04 CM-CMD[2054]: cmrunpkg -v pkg5 Dec 14 14:34:44 star04 cmcld[2048]: Request from node star04 to start package pkg5 on node star04. Dec 14 14:34:44 star04 cmcld[2048]: Executing '/etc/cmcluster/pkg5/pkg5_run start' for package pkg5. Dec 14 14:34:45 star04 LVM[2066]: vgchange -a n /dev/vg02 Dec 14 14:34:45 star04 cmcld[2048]: Package pkg5 run script exited with NO_RESTART. Dec 14 14:34:45 star04 cmcld[2048]: Examine the file /etc/cmcluster/pkg5/pkg5_run.log for more details. |
The following is an example of a successful package starting: Dec 14 14:39:27 star04 CM-CMD[2096]: cmruncl Dec 14 14:39:27 star04 cmcld[2098]: Starting cluster management protocols. Dec 14 14:39:27 star04 cmcld[2098]: Attempting to form a new cluster Dec 14 14:39:27 star04 cmclconfd[2097]: Command execution message Dec 14 14:39:33 star04 cmcld[2098]: 3 nodes have formed a new cluster Dec 14 14:39:33 star04 cmcld[2098]: The new active cluster membership is: star04(id=1), star05(id=2), star06(id=3) Dec 14 17:39:33 star04 cmlvmd[2099]: Clvmd initialized successfully. Dec 14 14:39:34 star04 cmcld[2098]: Executing '/etc/cmcluster/pkg4/pkg4_run start' for package pkg4. Dec 14 14:39:34 star04 LVM[2107]: vgchange /dev/vg01 Dec 14 14:39:35 star04 CM-pkg4[2124]: cmmodnet -a -i 15.13.168.0 15.13.168.4 Dec 14 14:39:36 star04 CM-pkg4[2127]: cmrunserv Service4 /vg01/MyPing 127.0.0.1 >>/dev/null Dec 14 14:39:36 star04 cmcld[2098]: Started package pkg4 on node star04. |
The following DLM daemons produce messages: cmlkmgrd
- DLM configuration daemon cmdlmmon -
DLM monitor daemon
These daemon processes direct their messages to the logs
directory inside the dlm home directory.
There are two log files that contain messages produced by the DLM
daemons (and client processes attached to the DLM): dlmstart.log.
This file contains messages from the DLM daemons produced during
startup. dlm.log.
This file contains messages from the DLM daemons produced during
normal operation, reconfiguration and shutdown.
Important DLM messages are also directed to /var/adm/syslog/syslog.log. See Appendix C for a listing of all DLM error messages, together
with a probable cause for the error condition, and the action you
should take to eliminate the problem. GMS errors at configuration time are reported on the standard
output, and runtime GMS errors appear in the syslog file. A complete
list of GMS configuration and runtime errors appears in Appendix
C. GMS Internal Errors (OPS 8.0.5)Some GMS internal errors are written to a trace file. For
OPS 8.0.5, the trace file for the GMS daemon is /tmp/.ogms/daem_xxxx.trc.
(xxxx is the GMS daemon's process id). This file, which
contains messages from OPS as well as from MC/LockManager, is of
use only to Oracle or HP support personnel. Each MC/LockManager
internal message consists of three parts: timestamp
- date and time the error occurred function -
name of the node manager API (nmapi) function issuing the message message - text
of the message itself.
If the message contains the keyword ERROR,
there has been an error in the function issuing the message. It
is usually caused by a hardware failure or a software defect. OPS
users only need to pay attention this kind of message. The following
is an example: Thu May 14 18:42:31 1998 skgxndinfo: Unable to retrieve bootstrap info from status database at this moment. |
Send a copy of the trace file when requested by HP support. Reviewing Configuration Files |  |
Review the following configuration files: Cluster configuration file /etc/cmcluster/cmclconf.asc. Package configuration files. For each package, the
file is called /etc/cmcluster/package_name/package_nameconf.asc.
Reviewing the Package Control Script |  |
Ensure that the package control script is found on all nodes
where the package can run and that the file is identical on all
nodes. Ensure that the script is executable on all nodes. Ensure
that the name of the control script appears in the package configuration
file, and ensure that all services named in the package configuration
file also appear in the package control script. Information about the starting and halting of each package
is found in the package's control script log. This log provides
the history of the operation of the package control script. It is
found at /etc/cmcluster/package_name/control.sh.log.
This log documents all package run and halt activities. If you have
written a separate run and halt script script for a package, each
script will have its own log. Using cmquerycl
and cmcheckconf |  |
In addition, cmquerycl
and cmcheckconf
can be used to troubleshoot your cluster just as they were used
to verify its configuration. The following example shows the commands
used to verify the existing cluster configuration on node 1
and node 2: # cmquerycl -v -C /etc/cmcluster/verify.asc -n node1 -n node2 Return # cmcheckconf -v -C /etc/cmcluster/verify.asc Return |
The cmcheckconf command
checks the following: The network addresses and connections. The cluster lock connectivity. The validity of configuration parameters of the
cluster and packages for: The existence and permission of scripts.
the cmcheckconf command does not check
the following: The correct setup of the power circuits. The correctness of the package control script.
Using cmscancl |  |
The command cmscancl
displays information about all the nodes in a cluster in a structured
report that allows you to compare such items as IP addresses or
subnets, physical volume names for disks, and other node-specific
items for all nodes in the cluster. cmscancl
actually runs several different HP-UX commands on all nodes and
gathers the output into a report on the node where you run the command. The following are the types of configuration data that cmscancl
displays for each node: Table 8-1 Data
Displayed by the cmscancl Command Description | Source of Data |
|---|
LAN device configuration and status | lanscan
command | network status and interfaces | netstat
command | file systems | mount
command | LVM configuration | /etc/lvmtab
file | LVM physical volume group data | /etc/lvmpvg
file | link level connectivity for all links | linkloop
command | binary configuration file | cmviewconf
command |
Using cmviewconf |  |
cmviewconf
allows you to examine the binary cluster configuration file, even
when the cluster is not running. The command displays the content
of this file on the node where you run the command. Reviewing the LAN Configuration |  |
The following networking commands can be used to diagnose
problems: netstat -in
can be used to examine the LAN configuration. This command lists
all IP addresses assigned to each LAN interface card. lanscan
can also be used to examine the LAN configuration. This command
lists the MAC addresses and status of all LAN interface cards on
the node. arp -a
can be used to check the arp tables. landiag
is useful to display, diagnose, and reset LAN card information. linkloop
verifies the communication between LAN cards at MAC address levels.
For example, if you enter looplink -i4 0x08000993AB72,
you should see displayed the message "Link Connectivity
to LAN station: 0x08000993AB72 — OK" /usr/contrib/bin/cmgetconfig -f
can be used to verify that Primary and Standby LANs are on the same
bridged net. cmviewcl -v
shows the status of primary and standby LANs.
Use these commands on all nodes. Reviewing the Status of Shared Volume Groups |  |
To display the current configuration of a shared volume group,
use the vgdisplay -v
command. An example is as follows: # vgdisplay -v /dev/vg_ops
|
The output includes a list of all volume groups, together
with the logical volumes configured in them and all the physical
volumes associated with them. Physical volume groups are also included. Using DLM Diagnostic Tools (OPS 7.3.x) |  |
MC/LockManager software includes a group of diagnostic tools
that may be helpful in troubleshooting. Use these tools in cooperation
with your HP representative or technical consultant. Refer also
to the man page for each command. dlmdump |  |
dlmdump is a tool that
dumps DLM-related memory structures. dlmdump
is used for debugging purposes. It allows the user to obtain a snapshot
about an object given its handle. Since dlmdump
only provides a "snapshot" of the current state, what is displayed
might not be completely consistent with the actual data. dlmstat |  |
dlmstat is a tool that
tracks DLM-related statistical information. dlmstat
is used to acquire statistics for the process, resource, instance,
and cluster objects from the DLM database. For example, dlmstat -i -q -t +1 -n 10
|
Core Dump Locations |  |
Core dumps for the cmcld and cmlvmd
daemons are produced in the /var/adm/cmcluster
and /etc/lvmconf directories, respectively.
The DLM daemons (OPS 7.3.x) deposit dumps in the cores
subdirectory of the dlm home directory.
|