| United States-English |
|
|
|
![]() |
Designing Disaster Tolerant High Availability Clusters: > Chapter 4 Designing
a Continental ClusterUnderstanding Continental Cluster Concepts |
|
The Continentalclusters product provides the ability to monitor a high availability cluster and fail over mission critical applications to another cluster if the monitored cluster should become unavailable. In the following example, the Los Angeles cluster runs the mission critical application and replicates data to the New York cluster, which has another copy of the mission critical application ready to run in case of failover. In addition, Continentalclusters supports mutual recovery, which allows for mission critical applications to be run on each cluster, with each cluster configured to recover the mission critical applications of the other. Because clusters may be separated over wide geographical distances, and because they have independent function, the operation of clusters in a Continentalclusters configuration is somewhat different from that of typical Serviceguard clusters. A typical Continentalclusters recovery pair environment is shown in Figure 4-1 “Sample Continentalclusters Configuration”. Two packages are running on the cluster in Los Angeles, and their data is replicated to the cluster in New York. Physical data replication is carried out using ESCON (Enterprise Storage Connect) links between the disk array hardware in New York and Los Angeles via an ESCON/WAN converter at each end. The New York cluster is running a monitor that checks the status of the Los Angeles cluster. In this example, the Los Angeles cluster runs just like any Serviceguard cluster, with applications configured in packages that may fail from node to node as necessary. The New York cluster is configured with a recovery version of the packages that are running on the Los Angeles cluster. These packages do not run under normal circumstances, but are set to start up when they are needed. In addition, either cluster may run other packages that are not involved in Continentalclusters operation. Bi-directional failover is supported in what is called a mutual recovery configuration. This lets you define recovery groups for primary packages running in both component clusters of a recovery pair in the Continentalclusters configuration. Figure 4-2 “Sample Mutual Recovery Configuration” shows a mutual recovery configuration. In the above figure, the salespkg is running on the New York cluster and can be recovered by the Los Angeles cluster. Similarly, the custpkg running on the Los Angeles cluster can be recovered by the New York cluster. As stated previously, physical data replication is carried out using ESCON (Enterprise Storage Connect) links between the disk array hardware in New York and Los Angeles via an ESCON/WAN converter at each end. Each cluster is running a monitor that checks the status of the alternate cluster. As shown in the above example, each cluster runs just like any Serviceguard cluster, with applications configured in packages that may fail from node to node as necessary. Each cluster is configured with a recovery version of the packages that are running on the alternate cluster. These packages do not run under normal circumstances, but are set to start up when they are needed. In addition, either cluster may run other packages that are not involved in Continentalclusters operation. If a given cluster in a recovery pair of a continental cluster should become unavailable, Continentalclusters allows an administrator to issue a single command (cmrecovercl, described later) to transfer mission critical applications from that cluster to another cluster, making sure that the packages do not run on both clusters at the same time. Transfer is not automatic, although it is automated through a recovery command, which a root user must issue. The result after issuing the recovery command is shown in Figure 4-3 “Continental Cluster After Recovery”. The movement of an application from one cluster to another cluster does not replace local failover activity; packages are normally configured to fail over from node to node as they would on any high availability cluster. Cluster recovery—failover of packages to a different cluster—occurs only after the following:
A monitor package running on one cluster tracks the health of another cluster in the recovery pair and sends notification to system administrators if the state of the monitored cluster changes. (If a cluster contains any packages to be recovered it must be monitored.) The monitor software polls the monitored cluster at a specific MONITOR_INTERVAL defined in an ASCII configuration file, which also indicates when and where to send messages if there is a state change. The physical separation between clusters will require communication by way of a Wide Area Network (WAN). Since the polling takes place across the WAN, interruptions of WAN service cannot always be differentiated from cluster failure states. This means that if the WAN is unreliable, the monitoring facility will often detect and report an unreachable state for the monitored cluster that is actually an interruption of WAN service. Because the monitoring is indeterminate in some instances, information from independent sources must be gathered to determine the need for proceeding with the recovery process. For these reasons, cluster recovery is not automatic, but must be initiated by a root user. Once initiated, however, the cluster recovery is automated to reduce the chance of human error that might occur if manual steps were needed. In Continentalclusters, a system of cluster events and notifications is provided so that events can be easily tracked, and so that users will know when to seek additional information before initiating recovery. A cluster event is a change of state in a monitored cluster. The four cluster states reported by the monitor are Unreachable, Down, Up, and Error. Table 4-1 “Monitored States and Possible Causes” summarizes possible causes for the cluster events with regard to both the monitored cluster and the WAN. It is clear that in many cases, the causes of cluster events are indeterminate without additional information that is not available to the software. Table 4-1 Monitored States and Possible Causes
Because some cluster events (e.g., Up -> Unreachable) can be caused by changes in either a cluster state or a WAN state, additional independent information is required to achieve the primary objective of determining whether you need to recover a cluster’s applications. Sources of independent information include:
When worrisome cluster events persist, you obtain as much information as possible, including authorization to recover, if your business practices require this, and then issue the recovery command. A central part of the operation of Continentalclusters is the transmission of notifications following the detection of a cluster event. Notifications occur at specifically coded times, and at two different levels:
Notifications are typically sent as:
In addition, notifications are sent to an event log on the system where monitoring is taking place.
Alerts are intended as informational. Some typical uses of alerts include:
The expected process in dealing with alerts is to continue watching for additional notifications and to contact individuals at the site of the monitored cluster to see whether problems exist. Alarms are intended to indicate that a cluster failure might have taken place. The most common example of an alarm is the following:
The expected process in dealing with cluster events that persist at the alarm level is to obtain as much information as possible, including authorization to recover, if your business practices require this, and then to issue the recovery command. For events that might indicate cluster failure, you can show the escalation of your concern over cluster health by defining alerts followed by one or more alarms. A typical sequence is to issue a cluster alert at 5 minutes and 10 minutes followed by a cluster alarm at 15 minutes. This could be accomplished by entering two CLUSTER_ALERT lines in the configuration file, and one CLUSTER_ALARM line. A detailed example is provided in the comments in the ASCII configuration file template, shown in “Editing Section 3—Monitoring Definitions” For those events that indicate that the cluster is back online or that communication with the monitor has been restored, use cluster alerts to show the de-escalation of concern. In this case, use a CLUSTER_ALERT line in the configuration file with a time of zero (0), so that notifications are sent as soon as the return to service is detected. When a CLUSTER_ALARM is issued, there may be a need for recovery, and the recovery command, cmrecovercl, is enabled for use by the root user. Cluster recovery is carried out at the site of the recovery cluster by using the cmrecovercl command, as follows: # cmrecovercl This command will fail if a cluster alarm has not been issued. The command has the effect of halting any data replication activity from the failed cluster to the local cluster, and starting up on the local cluster all the recovery packages that are pre-configured in recovery groups, which are the units of recovery in a continental cluster. If option “-g RecoveryGroup” is specified with the command, the recovery process, halting of data replication activity and starting of recovery package, will be done only for the specified recovery group. After the cmrecovercl command is issued, there is a delay of at least 90 seconds per recovery group as the command makes sure that the package is not active on another cluster. Cluster recovery is done as a last resort, after all other approaches to restore the unavailable cluster have been exhausted. It is important to remember that cluster recovery sets in motion a process that cannot be easily reversed. Unlike the failover of a package from one node to another, failing a package from one cluster to another normally involves a significant quantity of data that is being accessed from a new set of disks. Returning control to the original cluster will involve resynchronizing this data and resetting the roles of the clusters in a process that is easier for some data replication techniques than others.
Packages have somewhat different behavior in a continental cluster than in a normal Serviceguard environment. There are specific differences in
Normally, an application (package) can run on only one node at a time in a cluster. However, in a continental cluster, there are two clusters in which an application—the primary package or the recovery package—could operate on the same data. The primary package and the recovery package must not both be allowed to run at the same time. To prevent this, it is very important to ensure that packages are not allowed to start automatically and are not started up at inappropriate times. To keep packages from starting up automatically when a cluster starts, you must set the AUTO_RUN (PKG_SWITCHING_ENABLED used prior to Serviceguard 11.12) parameter for all primary and recovery packages to NO. Then use the cmmodpkg command with -e <packagename> option to start up only the primary packages and enable switching. The cmrecovercl command, when run, will start up the recovery packages and enable switching during the cluster recovery operation.
To prevent packages from being started at the wrong time and in the wrong place, you can use the following strategies:
Another important difference between the packages configured in a continental cluster and the packages configured in a standard Serviceguard cluster is that the same or different subnets can be used for primary cluster and recovery cluster configurations. In addition, the same or different relocatable IP addresses can be used for the primary package and its corresponding recovery package. The client application must be designed properly to connect to the appropriate IP address following a recovery operation. Continentalclusters packages are manipulated manually by the user via Serviceguard commands and by cmcld automatically in the same way as any other packages. In a continental cluster the recovery package are not allowed to run at the same time as the primary, data sender, or data receiver packages. To enforce this, several Serviceguard commands behave in a slightly different manner when used in a continental cluster. Table 4-2 “Serviceguard and Continentalclusters Commands” describes the Serviceguard commands whose behavior is different in a continental cluster environment. Specifically, when one of the following commands attempts to start or enable switching of a package, it first checks the status of the other packages in the recovery group. Based on the status, the operation is either allowed or disallowed. The checking is done based on the stable clusters' environment and the proper functioning of the network communication. In the case that the network commnication between clusters can not be established or the cluster or package status can not be determined, it is up to user's manual checking to ensure that the operation to be performed on the target package will not have a conflict with other packages configured in the same recovery group Table 4-2 Serviceguard and Continentalclusters Commands
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||