Cluster reformations can be extremely fast if all nodes join in the message sending for the reformation (for example, if the cause of the reformation was just a missed heartbeat on a busy network). However, if a node actually has failed, it can take up to minute to time that node out and form the new cluster. The approximate amount of time cluster reformation takes can be seen via "cmquerycl". By using this command, the cluster reformation time is given next to the disks to be used as the cluster lock.
The configuration guide states that only two D-Class systems are supported in a cluster. However, this does not mean that only a 2 node cluster can be configured with D-Class systems. A 4 node cluster can be configured but only 2 D-Class systems can be configured into the cluster. The reason for this is that there were not enough testing resources to ensure the viability of a greater than 2 D-Class system cluster at release time.
ServiceGuard 10.4 supports 4 D-Class systems in the cluster (but you may have up to 8 nodes in a cluster).*
Yes! This configuration is supported. You are not required to have a service or a subnet.
It makes more sense to move the daemon process from inittab and make it a service. ServiceGuard will (just like init) monitor the Process ID. If it fails, it can be restarted on the same node without moving the package. This is a new feature in ServiceGuard 10.03. This is the type of application for which a service was designed.
Either way, you are in a supported configuration. What the customer puts in the ServiceGuard package script is up to the customer. You don't need services or subnets. You could just have a simple exit 0 in the package and you would be supported.
You only need to vgimport again if you add or remove a LV, or add or remove a PV. The reason for this is that the device files on the other nodes need to be updated. If you only change the traits of the LV (mirroring, bad block relocation, etc), you don't need to vgimport.
In either case, you don't need to bring the cluster down.
For example:
VG01 is activated (exclusive) on node1. Node2 has the vg imported in case the package that contains VG01 fails. on node1, we add a new LV to the VG. We do a lvcreate -L 100 /dev/vg01. We do this while the cluster is up and while the VG is activated exclusively on node1. This will create a new lv device file in /dev/vg01/lvol5 and /dev/vg01/rlvol5 on node1 only.
Node2 has no idea that lvol5 was created. You need to go to node2 and vgexport /dev/vg01, and vgimport /dev/vg01. You can do this while the vg is still exclusively activated on node1. It was not necessary to activate the volume group on node2.
The real issue here is from a performance standpoint, do you want raw or filesystem access? JFS adds another layer to the I/O path. JFS is going to be slower than raw I/O. Yes, JFS has a log, but that is a log for the file system's structure. The database would still need to be recovered after the fsck of JFS completed.
With LVM, you can use raw I/O. A LV does not need to have a file system on it. So, you can do all the usual striping and mirroring with LVM and still have raw I/O (the database opens /dev/vgXX/rlvolX).
However, there are many reasons customers use file systems. Namely, backup is a lot easier with file systems. Also, JFS's online backup is great. Finally, many people like seeing their data (via bdf, etc).
The following diagram shows how to connect a Nike array to three nodes using the "V" cable:
----- ------ -------
| | | | | |
|n1 | | n2 | | n3 |
| | | | | |
|F F| |F F| |F F|
|W W| |W W| |W W|
----- ----- V Cables ------
\ \ ^ ^
\ \ / \/\ / /
\ \--------+--------/--/\ \----------------- /---/
\ | / \ /
\------+--|------/ \----------------/
-------
| S S|
| P P|
| |
| |
-------
Two Nike arrays connected to three nodes would look like:
----- ------ -------
| | | | | |
|n1 | | n2 | | n3 |
| | | | | |
|F F| |F F| |F F|
|W W| |W W| |W W|
----- ----- V Cables ------
\ \ ^ ^
\ \ / \/\ / /
FWBUS #1 \ \--------+--------/--/\ \---------+------- /---/
\ | / \ | /
FWBus #2 \------+--|------/ \------+--|------/
------- ---------
| S S| | S S |
| P P| | P P |
| A B| | A B |
| | | |
------- --------
Nike#1 Nike#2
On FWBUS #1: N1 FWD = 1 1/2
+ N2 FWD = 1 1/2
+ N3 FWD = 1 1/2
+ Nike#1 SPB = 2
+ Nike#2 SPB = 2
===========
9 1/2 performance load factor.... which is within
the limit of 11 1/2.
Given the above example, it is legal to put three NIKE's SPs on one bus. As you get closer to the 11 1/2 number, you will start getting closer and closer to performance issues. One other point, you can put the Nike#2 on either side of node #2. It makes no difference to FW-SCSI.
Both dual attached and single attached FDDI cards are valid HA solutions. With a dual attached FDDI card, the card itself provides local failover to a second ring in the event of ring, cable or concentrator failure. Therefore, the only failure point is the card itself. ServiceGuard will monitor the card and fail a package to the backup machine in the event the card (and therefore the subnet) fails. Therefore, a single dual attached FDDI is a valid configuration with ServiceGuard. Again, this failover would result in downtime to the application, whereas local failover is transparent to the application.
Single attached FDDI provides for only one ring, therefore, with single attached FDDI, there must be two cards on each system. But with two lan cards per system, local failover is transparent to the application and will cause no downtime.
In addition, some customers have chosen two dual attached FDDIs per system, getting both ring and card redundancy.
The tradeoff is the cost of the slot versus the downtime if the FDDI card fails.
We would recommend using two single attached FDDIs first. Then as a second alternative, the dual attached FDDI.
The configuration should be as follows:
For example:
----- ethernet ------
| E| ----------------- |E |
|n1 | heartbeat | n2 |
| | | |
|F F| |F F|
|D D| |D D|
----- -----
| | database/heartb | |
| \ / |
| ------- C --------- |
| |
\ standby db/heartb /
-----------C -------------
In this example, you don't need to buy any additional ethernet
cards. As a matter of fact, your ethernet lan is just a single
cable.
Again, you could do the same thing using a dual attached FDDI and only need the one card on each system.
cmquerycl could have problems if the following symptoms are seen:
cmquerycl only shows some of the nodes that have ServiceGuard installed. If this is the case, go to the systems that are not shown with cmquerycl and ensure that the /etc/inetd.conf has the entries for /usr/lbin/cmclconfd. You can look at a system that is working properly to determine what the inetd.conf should have in it. After fixing the inetd.conf file, issue the "inetd -c" command to have inetd reread the configuration.
cmquerycl gets back "The physical volume with name VGNAME on node NODENAME1 cannot be found on node NODENAME2". This is generally caused by a mismatch of the physical volumes (disks) in the volume group. ServiceGuard looks at every disk in the system, regardless of its future use in a package or as a cluster lock disk. Every disk is checked to see what volume group it is in, whether that volume group is connected to more than one node, etc. The LVM configuration on all the nodes in the cluster must be consistent. ServiceGuard uses the PV_ID (Physical Volume Identifier) to determine unique disks.
Basically, SG wants all of its LVM information to be correct. Therefore, we go out and scan all the VGs on all systems, Even if some of the VGS are only connected to one of the systems. This is important if, for example, you really wanted the VG connected to both systems, but didn't get it right. You want to know what the configuration looks like, even for non-shared VGs.
Someone dd'ed the root disk of one system to another disk, and a second system is now booting off that disk. i.e., the PV_ID (physical volume Identifier) is the same on both disks. It is illegal to copy disks around like that, and SG definitely gets confused. The only solution is to re-install one of the systems. This will cause the VG to be recreated, and the disk to have a new PV_ID.
The LVMTAB file also has the pv_id (that is some of those funny characters when you do a strings /etc/lvmtab). Try, mv /etc/lvmtab to /etc/lvmtab.back, and then doing a vgscan or vgimport as necessary on the two systems.
With MC/ServiceGuard or MC/LockManager, if the networking between
the two nodes fails, only one node can continue to be in the cluster.
In this case, the cluster lock disk is used as a tie breaker. There
is an equal chance (50/50) that a given node will win the cluster lock
and remain in the cluster. The other node will TOC.
This is true even if it was the network adapter cards on node1 which
failed. In this case, node1 would be shut out from the outside world,
but node2 would also be shut out from node1. To node2, this appears
to be a networking failure, and the cluster lock is used to break the
tie. Again, there is a 50/50 chance that the cluster lock will be
won by node1 (the system with the failed LAN card).
With the A.10.04 release of ServiceGuard, a new feature was added that
allows an RS-232 serial cable connected between two nodes to be used
as a backup for heartbeat communication and to improve the detection of
LAN failures, so that the node which had the LAN card failure would not
immediately go for the cluster lock, thereby allowing the healthy node
to get the cluster lock first.