Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Managing Serviceguard Version A.11.16, Eleventh EditionSecond Printing > Chapter 4 Planning and Documenting an HA Cluster

Cluster Configuration Planning

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Index

A cluster should be designed to provide the quickest possible recovery from failures. The actual time required to recover from a failure depends on several factors:

  • The length of the cluster heartbeat interval and node timeout. They should each be set as short as practical, but not shorter than 1000000 (one second) and 2000000 (two seconds), respectively. The recommended value for heartbeat interval is 1000000 (one second), and the recommended value for node timeout is within the 5 to 8 second range (5000000 to 8000000).

  • The design of the run and halt instructions in the package control script. They should be written for fast execution.

  • The availability of raw disk access. Applications that use raw disk access should be designed with crash recovery services.

  • The application and database recovery time. They should be designed for the shortest recovery time.

In addition, you must provide consistency across the cluster so that:

  • User names are the same on all nodes.

  • UIDs are the same on all nodes.

  • GIDs are the same on all nodes.

  • Applications in the system area are the same on all nodes.

  • System time is consistent across the cluster.

  • Files that could be used by more than one node, such as /usr files, must be the same on all nodes.

The Serviceguard Extension for Faster Failover is a purchased product that can optimize failover time for certain two-node clusters. The clusters must be configured to meet certain requirements. When installed, the product is enabled by a parameter in the cluster configuration file. Release Notes for the product are posted at http://docs.hp.com -> high availability.

Heartbeat Subnet and Re-formation Time

The speed of cluster re-formation is partially dependent on the type of heartbeat network that is used. Ethernet results in a slower failover time than the other types. If two or more heartbeat subnets are used, the one with the fastest failover time is used.

Cluster Lock Information

The purpose of the cluster lock is to ensure that only one new cluster is formed in the event that exactly half of the previously clustered nodes try to form a new cluster. It is critical that only one new cluster is formed and that it alone has access to the disks specified in its packages. You can specify either a lock disk or a quorum server as the cluster lock.

A one-node cluster does not require a lock. Two-node clusters require the use of a cluster lock, but the lock is recommended for larger clusters as well. Clusters larger than 4 nodes can only use a quorum server as the cluster lock.

Cluster Lock Disk and Re-formation Time

If you are using a lock disk, the acquisition of the cluster lock disk takes different amounts of time depending on the disk I/O interface that is used. After all the disk hardware is configured, but before configuring the cluster, you can use the cmquerycl command specifying all the nodes in the cluster to display a list of available disks and the re-formation time associated with each. Example:

cmquerycl -v -n ftsys9 -n ftsys10 

Alternatively, you can use SAM to display a list of cluster lock physical volumes, including the re-formation time.

By default, Serviceguard selects the disk with the fastest re-formation time. But you may need to choose a different disk because of power considerations. Remember that the cluster lock disk should be separately powered, if possible.

Cluster Lock Disks and Planning for Expansion

You can add additional cluster nodes after the cluster is up and running, but doing so without bringing down the cluster requires you to follow some rules. Recall that a cluster with more than 4 nodes may not have a lock disk. Thus, if you plan to add enough nodes to bring the total to more than 4, you should use a quorum server.

Cluster Configuration Parameters

For the operation of the cluster manager, you need to define a set of cluster parameters. These are stored in the binary cluster configuration file, which is located on all nodes in the cluster. These parameters can be entered by editing the cluster configuration template file created by issuing the cmquerycl command, as described in the chapter “Building an HA Cluster Configuration.” The parameter names given below are the names that appear in the cluster ASCII configuration file.

The following parameters must be identified:

CLUSTER_NAME

The name of the cluster as it will appear in the output of cmviewcl and other commands, and as it appears in the cluster configuration file.

The cluster name must not contain any of the following characters: space, slash (/), backslash (\), and asterisk (*). All other characters are legal. The cluster name can contain up to 39 characters.

QS_HOST

The name or IP address of a host system outside the current cluster that is providing quorum server functionality. This parameter is only used when you employ a quorum server for tie-breaking services in the cluster.

QS_POLLING_INTERVAL

The time (in microseconds) between attempts to contact the quorum server to make sure it is running. Default is 300,000,000 microseconds (5 minutes).

QS_TIMEOUT_EXTENSION

The quorum server timeout is the time during which the quorum server is not communicating with the cluster. After this time, the cluster will mark the quorum server DOWN. This time is calculated based on Serviceguard parameters, but you can increase it by adding an additional number of microseconds as an extension.

The QS_TIMEOUT_EXTENSION is an optional parameter.

FIRST_CLUSTER_LOCK_VG, SECOND_CLUSTER_LOCK_VG

The volume group containing the physical disk volume on which a cluster lock is written. Identifying a cluster lock volume group is essential in a two-node cluster. If you are creating two cluster locks, enter the volume group name or names for both locks. This parameter is only used when you employ a lock disk for tie-breaking services in the cluster.

Use FIRST_CLUSTER_LOCK_VG for the first lock volume group. If there is a second lock volume group, the parameter SECOND_CLUSTER_LOCK_VG is included in the file on a separate line.

NOTE: Lock volume groups must also be defined in VOLUME_GROUP parameters in the cluster ASCII configuration file.
NODE_NAME

The hostname of each system that will be a node in the cluster. The node name can be up to 31 bytes long. The node name must not contain the full domain name. For example, enter ftsys9, not ftsys9.cup.hp.com.

NETWORK_INTERFACE

The name of each LAN that will be used for heartbeats or for user data. An example is lan0.

HEARTBEAT_IP

IP notation indicating the subnet that will carry the cluster heartbeat. Note that heartbeat IP addresses must be on the same subnet on each node. A heartbeat IP address can only be an IPv4 address.

If you will be using VERITAS CVM disk groups for storage, you can only use a single heartbeat subnet. In this case, the heartbeat should be configured with standby LANs or as a group of aggregated ports.

NOTE: The use of a private heartbeat network is not advisable if you plan to use Remote Procedure Call (RPC) protocols and services. RPC assumes that each network adapter device or I/O card is connected to a route-able network. An isolated or private heartbeat LAN is not route-able, and could cause an RPC request-reply, directed to that LAN, to risk time-out without being serviced.

NFS, NIS and NIS+, and CDE are examples of RPC based applications that are frequently used on HP-UX. Other third party and home-grown applications may also use RPC services directly through the RPC API libraries. If necessary, consult with the application vendor to confirm its usage of RPC.

STATIONARY_IP

The IP address of each monitored subnet that does not carry the cluster heartbeat. You can identify any number of subnets to be monitored. If you want to separate application data from heartbeat messages, define a monitored non-heartbeat subnet here.

A stationary IP address can be either an IPv4 or an IPv6 address. For more details of IPv6 address format, see the “IPv6 Address Types”

FIRST_CLUSTER_LOCK_PV, SECOND_CLUSTER_LOCK_PV

The name of the physical volume within the Lock Volume Group that will have the cluster lock written on it. This parameter is FIRST_CLUSTER_LOCK_PV for the first physical lock volume and SECOND_CLUSTER_LOCK_PV for the second physical lock volume. If there is a second physical lock volume, the parameter SECOND_CLUSTER_LOCK_PV is included in the file on a separate line. These parameters are only used when you employ a lock disk for tie-breaking services in the cluster.

Enter the physical volume name as it appears on both nodes in the cluster (the same physical volume may have a different name on each node). If you are creating two cluster locks, enter the physical volume names for both locks. The physical volume group identifier can contain up to 39 characters.

SERIAL_DEVICE_FILE

The name of the device file that corresponds to serial (RS232) port that you have chosen on each node. Specify this parameter when you are using RS232 as a heartbeat line.

In the ASCII cluster configuration file, this parameter is SERIAL_DEVICE_FILE. The device file name can contain up to 39 characters.

HEARTBEAT_INTERVAL

The normal interval between the transmission of heartbeat messages from one node to the other in the cluster. Enter a number of seconds.

In the ASCII cluster configuration file, this parameter is HEARTBEAT_INTERVAL, and its value is entered in microseconds.

Default value is 1,000,000 microseconds; setting the parameter to a value less than the default is not recommended.

The default should be used where possible. The maximum value recommended is 15 seconds, and the maximum value supported is 30 seconds. This value should be at least half the value of Node Timeout (below).

NODE_TIMEOUT

The time after which a node may decide that the other node has become unavailable and initiate cluster reformation. This parameter is entered in microseconds.

Default value is 2,000,000 microseconds in the ASCII file. Minimum is 2 * (Heartbeat Interval). The maximum recommended value for this parameter is 30,000,000 in the ASCII file, or 30 seconds in Serviceguard Manager. The default setting yields the fastest cluster reformations. However, the user of the default value increases the potential for spurious reformations due to momentary system hangs or network load spikes. For a significant portion of installations, a setting of 5,000,000 to 8,000,000 (5 to 8 seconds) is more appropriate.

The maximum value recommended is 30 seconds and the maximum value supported is 60 seconds.

AUTO_START_TIMEOUT

The amount of time a node waits before it stops trying to join a cluster during automatic cluster startup. In the ASCII cluster configuration file, this parameter is AUTO_START_TIMEOUT. All nodes wait this amount of time for other nodes to begin startup before the cluster completes the operation. The time should be selected based on the slowest boot time in the cluster. Enter a value equal to the boot time of the slowest booting node minus the boot time of the fastest booting node plus 600 seconds (ten minutes).

Default is 600,000,000 microseconds in the ASCII file (600 seconds in Serviceguard Manager).

NETWORK_POLLING_INTERVAL

The frequency at which the networks configured for Serviceguard are checked. In the ASCII cluster configuration file, this parameter is NETWORK_POLLING_INTERVAL.

Default is 2,000,000 microseconds in the ASCII file (2 seconds in Serviceguard Manager). Thus every 2 seconds, the network manager polls each network interface to make sure it can still send and receive information. Changing this value can affect how quickly a network failure is detected. The minimum value is 1,000,000 (1 second). The maximum value recommended is 15 seconds, and the maximum value supported is 30 seconds.

MAX_CONFIGURED_PACKAGES

This parameter sets the maximum number of packages that can be configured in the cluster. In the ASCII cluster configuration file, this parameter is known as MAX_CONFIGURED_PACKAGES.

Default is 0, which means that you must set this parameter if you want to use packages. The minimum value is 0, and the maximum value is 150. Set this parameter to a value that is high enough to accommodate a reasonable amount of future package additions without the need to bring down the cluster to reset the parameter. However, be sure not to set the parameter so high that memory is wasted. The use of packages requires 6MB plus about 100 KB of lockable memory on all cluster nodes. Be sure to add one for the VxVM-CVM-pkg if you are using CVM disk storage.

NOTE: Remember to tune HP-UX kernel parameters on each node to ensure that they are set high enough for the largest number of packages that would ever run concurrently on that node.
VOLUME_GROUP

The name of an LVM volume group whose disks are attached to at least two nodes in the cluster. Such disks are considered cluster aware. In the ASCII cluster configuration file, this parameter is VOLUME_GROUP. The volume group name can have up to 39 characters.

Access Control Policies

Specify three things for each policy: USER_NAME, USER_HOST, and USER_ROLE. For Serviceguard Manager, USER_HOST must be the name of the Session node. Policies set in the configuration file of a cluster and its packages must not be conflicting or redundant. For more information, see “Editing Security Files ”.

FAILOVER_OPTIMIZATION

You will only see this parameter if you have installed Serviceguard Extension for Faster Failover, a separately purchased product. You enable the product by setting this parameter to TWO_NODE. Default is disabled, set to NONE. For more information about the product and its cluster configuration requirements, go to http://www.docs.hp.com/hpux/ha and click Serviceguard Extension for Faster Failover.

NETWORK_FAILURE_DETECTION

When there is a primary and a standby network card, Serviceguard needs to determine when a card has failed, so it knows whether to fail traffic over to the other card. To detect failures, Serviceguard’s Network Manager monitors both inbound and outbound traffic. The Manager will mark the card DOWN and begin to attempt a failover when network traffic is not noticed for a time. (Serviceguard calculates the time depending on the type of LAN card.)

The configuration file specifies one of two ways to decide when the network interface card has failed:

  • INOUT - The default method will count inbound and outbound failures separately, and declare a card down only when both have reached a critical level.

  • INONLY_OR_INOUT - This option combines the inbound and outbound failure counts, and will declare a card down when the total failures reach a critical amount, regardless of their source. With this method, Serviceguard tries to validate inbound failure reports by doing additional remote polling.

The default is INOUT.

The suitability of an option depends mainly on your network configuration. To see more about whether the new INONLY_OR _INOUT option is the best for your network configuration, see “Inbound Failure Detection Enhancement” http://docs.hp.com/hpux/ha -> Serviceguard White Papers.

Cluster Configuration Worksheet

The following worksheet will help you to organize and record your cluster configuration.

 Name and Nodes:
===============================================================================
Cluster Name: ___ourcluster_______________

Node Names: ____node1_________________ ____node2_________________

Maximum Configured Packages: ______12________
===============================================================================
     Quorum Server Data:
===============================================================================
    Quorum Server Host Name or IP Address: __lp_qs __________________

    Quorum Server Polling Interval: _300000000_ microseconds

    Quorum Server Timeout Extension: _______________ microseconds
===========================================================================
Subnets:
===============================================================================
Heartbeat Subnet: ___15.13.168.0______

Monitored Non-heartbeat Subnet: _____15.12.172.0___

Monitored Non-heartbeat Subnet: ___________________
===========================================================================
Cluster Lock Volume Groups and Volumes:
===============================================================================
First Lock Volume Group: | Physical Volume: |
________________ | Name on Node 1: ___________________
|
| Name on Node 2: ___________________
|
| Disk Unit No: ________
|
| Power Supply No: ________
===========================================================================
Timing Parameters:
===============================================================================
Heartbeat Interval: _1 sec_
===============================================================================
Node Timeout: _2 sec_
===============================================================================
Network Polling Interval: _2 sec_
    Metwork Monitor  _INOUT_
===============================================================================
    Autostart Delay: _10 min___
===============================================================================
Cluster Aware LVM Volume Groups __________________________________________
===============================================================================
   Access Policies
     User: __ ANY_USER
     Host: __ ftsys9__
     Role: __ full_admin__

     User: __ sara itgrp lee __
     Host: __ ftsys10__
     Role: __ package_admin__
===============================================================================
Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.