ServiceGuard Frequently Asked Questions


Table of Contents


Functionality

Heartbeat

What happens when a heartbeat is missed?

A missed heartbeat is detected when heartbeat interval occurs and no heartbeat is detected. When no heartbeat is detected during a NODE_TIMEOUT interval, the current cluster coordinator initiates a cluster reformation. If the heartbeat interval is one a second, and the node timeout interval is two seconds, it takes two consecutive missed heartbeats to cause the node to be timed out and a cluster reformation to start. Cluster reformation involves informing all nodes of the reformation (including the node which missed the heartbeat), voting for a new cluster coordinator, and reforming the cluster (the new cluster is based upon the number of nodes which responded during the reformation). If the node which timedout is able to respond during the reformation, the reformation will end up with the same number of nodes in the cluster and your packages will not be affected.

How long does it take to perform a cluster reformation?

Cluster reformations can be extremely fast if all nodes join in the message sending for the reformation (for example, if the cause of the reformation was just a missed heartbeat on a busy network). However, if a node actually has failed, it can take up to minute to time that node out and form the new cluster. The approximate amount of time cluster reformation takes can be seen via "cmquerycl". By using this command, the cluster reformation time is given next to the disks to be used as the cluster lock.

Are packages halted during a cluster reformation?

Packages on healthy nodes are not halted during a cluster reformation. The applications are not affected. Only the packages on nodes which have exited the cluster (because they have failed) are moved to a healthy node.

Does cluster reformation occur during a LAN card failure since failure is not detected for 2 seconds and potentially two heart beats could be missed?

Yes. A cluster will be reformed with the same number of nodes if a LAN card failover (known as a local failover) occurrs. The reformation would be very fast (max. of 2 seconds after the LAN card failover completed for a total of around 10 seconds).

 

D-Series Support Issues

Why is there only 2 node cluster support on ServiceGuard 10.03?

The configuration guide states that only two D-Class systems are supported in a cluster. However, this does not mean that only a 2 node cluster can be configured with D-Class systems. A 4 node cluster can be configured but only 2 D-Class systems can be configured into the cluster. The reason for this is that there were not enough testing resources to ensure the viability of a greater than 2 D-Class system cluster at release time.

ServiceGuard 10.4 supports 4 D-Class systems in the cluster (but you may have up to 8 nodes in a cluster).*

Are Services and Subnets required parts of packages?

Can you comment out the SERVICE_CMD[], SERVICE_RESTARTS, etc. from the package control script?

Yes! This configuration is supported. You are not required to have a service or a subnet.

Does it make more sense to have a daemon spawn out of inittab or be spawned by ServiceGuard as a Services?

It makes more sense to move the daemon process from inittab and make it a service. ServiceGuard will (just like init) monitor the Process ID. If it fails, it can be restarted on the same node without moving the package. This is a new feature in ServiceGuard 10.03. This is the type of application for which a service was designed.

If you don't do this, how will the inittab be handled after a failover? Do you have the process in the inittab for both the primary machine and and failover machine?

Either way, you are in a supported configuration. What the customer puts in the ServiceGuard package script is up to the customer. You don't need services or subnets. You could just have a simple exit 0 in the package and you would be supported.

 

LVM configuration changes during exclusive activation

Suppose I want to extend the mirroring (2-way to 3-way) or I want to create a new logical volume, how will the other nodes(s) know about the changes? Do I have to bring down the cluster? Do I need to import the VG again?

You only need to vgimport again if you add or remove a LV, or add or remove a PV. The reason for this is that the device files on the other nodes need to be updated. If you only change the traits of the LV (mirroring, bad block relocation, etc), you don't need to vgimport.

In either case, you don't need to bring the cluster down.

For example:

VG01 is activated (exclusive) on node1. Node2 has the vg imported in case the package that contains VG01 fails. on node1, we add a new LV to the VG. We do a lvcreate -L 100 /dev/vg01. We do this while the cluster is up and while the VG is activated exclusively on node1. This will create a new lv device file in /dev/vg01/lvol5 and /dev/vg01/rlvol5 on node1 only.

Node2 has no idea that lvol5 was created. You need to go to node2 and vgexport /dev/vg01, and vgimport /dev/vg01. You can do this while the vg is still exclusively activated on node1. It was not necessary to activate the volume group on node2.

 

JFS versus RAW for Database Access

Why would someone use JFS rather than raw disk access for a database application?

The real issue here is from a performance standpoint, do you want raw or filesystem access? JFS adds another layer to the I/O path. JFS is going to be slower than raw I/O. Yes, JFS has a log, but that is a log for the file system's structure. The database would still need to be recovered after the fsck of JFS completed.

With LVM, you can use raw I/O. A LV does not need to have a file system on it. So, you can do all the usual striping and mirroring with LVM and still have raw I/O (the database opens /dev/vgXX/rlvolX).

However, there are many reasons customers use file systems. Namely, backup is a lot easier with file systems. Also, JFS's online backup is great. Finally, many people like seeing their data (via bdf, etc).


Hardware


Model 10/20 (Nike) support

The S800 Configuration Guide covers how to setup a Model 10/20 in a MC/ServiceGuard two node configuration.

The following diagram shows how to connect a Nike array to three nodes using the "V" cable:

       -----                    ------                   -------
       |   |                    |    |                   |     |
       |n1 |                    | n2 |                   | n3  |
       |   |                    |    |                   |     |
       |F F|                    |F  F|                   |F   F|
       |W W|                    |W  W|                   |W   W|
       -----                    -----      V Cables      ------
        \  \                     ^  ^  
         \  \                   / \/\                    /   /

\ \--------+--------/--/\ \----------------- /---/
\ | / \ /
\------+--|------/ \----------------/
-------
| S S|
| P P|
| |
| |
------- Two Nike arrays connected to three nodes would look like: ----- ------ ------- | | | | | | |n1 | | n2 | | n3 | | | | | | | |F F| |F F| |F F| |W W| |W W| |W W| ----- ----- V Cables ------ \ \ ^ ^ \ \ / \/\ / /
FWBUS #1 \ \--------+--------/--/\ \---------+------- /---/
\ | / \ | /
FWBus #2 \------+--|------/ \------+--|------/
------- ---------
| S S| | S S |
| P P| | P P |
| A B| | A B |
| | | |
------- --------
Nike#1 Nike#2
On FWBUS #1: N1 FWD = 1 1/2 + N2 FWD = 1 1/2 + N3 FWD = 1 1/2 + Nike#1 SPB = 2 + Nike#2 SPB = 2 =========== 9 1/2 performance load factor.... which is within the limit of 11 1/2.

Given the above example, it is legal to put three NIKE's SPs on one bus. As you get closer to the 11 1/2 number, you will start getting closer and closer to performance issues. One other point, you can put the Nike#2 on either side of node #2. It makes no difference to FW-SCSI.

Choosing between Single and Dual Attached FDDI

From an HA point of view, is dual attached or single attached FDDI a better choice? If I have dual attached FDDI, do I only need one card per system?

Both dual attached and single attached FDDI cards are valid HA solutions. With a dual attached FDDI card, the card itself provides local failover to a second ring in the event of ring, cable or concentrator failure. Therefore, the only failure point is the card itself. ServiceGuard will monitor the card and fail a package to the backup machine in the event the card (and therefore the subnet) fails. Therefore, a single dual attached FDDI is a valid configuration with ServiceGuard. Again, this failover would result in downtime to the application, whereas local failover is transparent to the application.

Single attached FDDI provides for only one ring, therefore, with single attached FDDI, there must be two cards on each system. But with two lan cards per system, local failover is transparent to the application and will cause no downtime.

In addition, some customers have chosen two dual attached FDDIs per system, getting both ring and card redundancy.

The tradeoff is the cost of the slot versus the downtime if the FDDI card fails.

We would recommend using two single attached FDDIs first. Then as a second alternative, the dual attached FDDI.

The configuration should be as follows:

 

  1. Use the ethernet port you have on the system by default for dedicated heartbeat traffic for ServiceGuard.

     

  2. Use single attached FDDI for the database traffic. Also send heartbeat over this network.

For example:


       -----    ethernet        ------
       |  E| -----------------  |E   |
       |n1 |    heartbeat       | n2 |
       |   |                    |    |
       |F F|                    |F  F|
       |D D|                    |D  D|
       -----                    -----
        | |   database/heartb    |  |
        |  \                    /   |
        |   ------- C ---------     |
        |                           |
        \     standby db/heartb    /
         -----------C -------------
In this example, you don't need to buy any additional ethernet cards. As a matter of fact, your ethernet lan is just a single cable.

Again, you could do the same thing using a dual attached FDDI and only need the one card on each system.


Troubleshooting


cmquerycl problems

cmquerycl could have problems if the following symptoms are seen:

cmquerycl only shows some of the nodes that have ServiceGuard installed. If this is the case, go to the systems that are not shown with cmquerycl and ensure that the /etc/inetd.conf has the entries for /usr/lbin/cmclconfd. You can look at a system that is working properly to determine what the inetd.conf should have in it. After fixing the inetd.conf file, issue the "inetd -c" command to have inetd reread the configuration.

 

cmquerycl gets back "The physical volume with name VGNAME on node NODENAME1 cannot be found on node NODENAME2". This is generally caused by a mismatch of the physical volumes (disks) in the volume group. ServiceGuard looks at every disk in the system, regardless of its future use in a package or as a cluster lock disk. Every disk is checked to see what volume group it is in, whether that volume group is connected to more than one node, etc. The LVM configuration on all the nodes in the cluster must be consistent. ServiceGuard uses the PV_ID (Physical Volume Identifier) to determine unique disks.

Basically, SG wants all of its LVM information to be correct. Therefore, we go out and scan all the VGs on all systems, Even if some of the VGS are only connected to one of the systems. This is important if, for example, you really wanted the VG connected to both systems, but didn't get it right. You want to know what the configuration looks like, even for non-shared VGs.

 

Someone dd'ed the root disk of one system to another disk, and a second system is now booting off that disk. i.e., the PV_ID (physical volume Identifier) is the same on both disks. It is illegal to copy disks around like that, and SG definitely gets confused. The only solution is to re-install one of the systems. This will cause the VG to be recreated, and the disk to have a new PV_ID.

 

The LVMTAB file also has the pv_id (that is some of those funny characters when you do a strings /etc/lvmtab). Try, mv /etc/lvmtab to /etc/lvmtab.back, and then doing a vgscan or vgimport as necessary on the two systems.

 


What happens when ...


Which node wins when all networks fail?

With MC/ServiceGuard or MC/LockManager, if the networking between the two nodes fails, only one node can continue to be in the cluster. In this case, the cluster lock disk is used as a tie breaker. There is an equal chance (50/50) that a given node will win the cluster lock and remain in the cluster. The other node will TOC. This is true even if it was the network adapter cards on node1 which failed. In this case, node1 would be shut out from the outside world, but node2 would also be shut out from node1. To node2, this appears to be a networking failure, and the cluster lock is used to break the tie. Again, there is a 50/50 chance that the cluster lock will be won by node1 (the system with the failed LAN card).
With the A.10.04 release of ServiceGuard, a new feature was added that allows an RS-232 serial cable connected between two nodes to be used as a backup for heartbeat communication and to improve the detection of LAN failures, so that the node which had the LAN card failure would not immediately go for the cluster lock, thereby allowing the healthy node to get the cluster lock first.