| United States-English |
|
|
|
![]() |
Designing Disaster Tolerant HA Clusters Using Metrocluster and Continentalclusters: > Chapter 2 Designing
a Continental ClusterUnderstanding Continental Cluster Concepts |
|
The Continentalclusters product provides the ability to monitor a high availability cluster and fail over mission critical applications to another cluster if the monitored cluster should become unavailable. In the following example, the Los Angeles cluster runs the mission critical application and replicates data to the New York cluster, which has another copy of the mission critical application ready to run in case of failover. In addition, Continentalclusters supports mutual recovery, which allows for different critical applications to be run on each cluster, with each cluster configured to recover the mission critical applications of the other. Because clusters may be separated over wide geographical distances, and because they have independent function, the operation of clusters in a Continentalclusters configuration is somewhat different from that of typical Serviceguard clusters. A typical Continentalclusters recovery pair environment is shown in Figure 2-1 “Sample Continentalclusters Configuration”. Two packages are running on the cluster in Los Angeles, and their data is replicated to the cluster in New York. Physical data replication is carried out using ESCON (Enterprise Storage Connect) links between the disk array hardware in New York and Los Angeles via an ESCON/WAN converter at each end. The New York cluster is running a monitor that checks the status of the Los Angeles cluster. In this example, the Los Angeles cluster runs just like any Serviceguard cluster, with applications configured in packages that may fail from node to node as necessary. The New York cluster is configured with a recovery version of the packages that are running on the Los Angeles cluster. These packages do not run under normal circumstances, but are set to start up when they are needed. In addition, either cluster may run other packages that are not involved in Continentalclusters operation. Bi-directional failover is supported in what is called a “mutual recovery configuration.” This allows recovery groups to be defined for primary packages running in both component clusters of a recovery pair in the Continentalclusters configuration. Figure 2-2 “Sample Mutual Recovery Configuration” shows a mutual recovery configuration. In the above figure, the salespkg is running on the New York cluster and can be recovered by the Los Angeles cluster. Similarly, the custpkg running on the Los Angeles cluster can be recovered by the New York cluster. As stated previously, physical data replication is carried out using ESCON (Enterprise Storage Connect) links between the disk array hardware in New York and Los Angeles via an ESCON/WAN converter at each end. Each cluster is running a monitor that checks the status of the alternate cluster. As depicted in the above example, each cluster runs just like any Serviceguard cluster, with applications configured in packages that may fail from node to node as necessary. Each cluster is configured with a recovery version of the packages that are running on the alternate cluster. These packages do not run under normal circumstances, but are set to start up when they are needed. In addition, either cluster may run other packages that are not involved in Continentalclusters operation. If a given cluster in a recovery pair of a continental cluster should become unavailable, Continentalclusters allows an administrator to issue a single command, cmrecovercl (described later) to transfer mission critical applications from that cluster to another cluster, making sure that the packages do not run on both clusters at the same time. Transfer is not automatic, although it is automated through a recovery command, which a root user must issue. The result after issuing the recovery command is shown in Figure 2-3 “Continental Cluster After Recovery”. The movement of an application from one cluster to another cluster does not replace local failover activity; packages are normally configured to fail over from node to node as they would on any high availability cluster. Cluster recovery, failover of packages to a different cluster, occurs only after the following events:
A monitor package running on one cluster tracks the health of another cluster in the recovery pair and sends notification to configured destinations if the state of the monitored cluster changes. (If a cluster contains any packages to be recovered it must be monitored.) The monitor software polls the monitored cluster at a specific MONITOR_INTERVAL defined in an ASCII configuration file, which also indicates when and where to send messages if there is a state change. The physical separation between clusters will require communication by way of a Local or Wide Area Network (LAN or WAN). Since the polling takes place across the network, interruptions of network service cannot always be differentiated from cluster failure states. This means that if the network is unreliable, the monitoring facility will often detect and report an unreachable state for the monitored cluster that is actually an interruption of the network service. Because the monitoring is indeterminate in some instances, information from independent sources must be gathered to determine the need for proceeding with the recovery process. For these reasons, cluster recovery is not automatic, but must be initiated by a root user. Once initiated, however, the cluster recovery is automated to reduce the chance of human error that might occur if manual steps were needed. In Continentalclusters, a system of cluster events and notifications is provided so that events can be easily tracked, and users will know when to seek additional information before initiating recovery. A cluster event is a change of state in a monitored cluster. The four cluster states reported by the monitor are Unreachable, Down, Up, and Error. Table 2-1 “Monitored States and Possible Causes” summarizes possible causes for the cluster events with regard to both the monitored cluster and the network. However, in many cases the causes of cluster events are indeterminate without additional information that is not available to the software. Table 2-1 Monitored States and Possible Causes
Because some cluster events (for example, Up -> Unreachable) can be caused by changes in either a cluster state or a network state, additional independent information is required to achieve the primary objective of determining whether you need to recover a cluster’s applications. Sources of independent information include:
When problematic cluster events persist, obtain as much information as possible, including authorization to recover, if your business practices require this, and then issue the Continentalclusters recovery command, cmrecovercl. A central part of the operation of Continentalclusters is the transmission of notifications following the detection of a cluster event. Notifications occur at specifically coded times, and at two different levels:
Notifications are typically sent as:
In addition, notifications are sent to the eventlog file located in the /var/opt/resmon/log/cc directory on the system where monitoring is taking place.
Alerts are intended as informational. Some typical uses of alerts include:
The expected process in dealing with alerts is to continue watching for additional notifications and to contact individuals at the site of the monitored cluster to see whether problems exist. Alarms are intended to indicate that a cluster failure might have taken place. The most common example of an alarm is the following:
The expected process in dealing with cluster events that persist at the alarm level is to obtain as much information as possible, including authorization to recover, if your business practices require this. At which point, issue the Continentalclusters recovery command, cmrecovercl. For events that indicate potential cluster failure, display the escalation of concern of the cluster health by defining alerts followed by one or more alarms. The following is a typical sequence:
This could be accomplished by entering two CLUSTER_ALERT lines in the configuration file, and one CLUSTER_ALARM line. A detailed example is provided in the comments in the ASCII configuration file template, shown in “Editing Section 3—Monitoring Definitions”. For those events that indicate that the cluster is back online or that communication with the monitor has been restored, use cluster alerts to show the de-escalation of concern. In this case, use a CLUSTER_ALERT line in the configuration file with a time of zero (0), so that notifications are sent as soon as the return to service is detected. A recovery group in a maintenance mode allows the recovery group to be exempted from a recovery. This implies that the recovery package cannot be started in a recovery cluster. By default, all recovery groups in the Continentalclusters configuration are not in the maintenance mode. To move a recovery group in continentalclusters into the maintenance mode, you must disable it. To move a recovery group out of the maintenance mode, you must enable it. You can complete rehearsal operations on a recovery group only when the recovery group is in the maintenance mode. For more information on rehearsal operations, see “Performing a Rehearsal Operation in your Environment”. Use the cmrecovercl -d -g command to move a recovery group into the maintenance mode. To move the recovery group out of the maintenance mode, use the cmrecovercl -e -g command. Maintenance mode for recovery groups is an optional feature. You must explicitly configure Continentalclusters to use this feature. Consider the following guidelines when you move a recovery group into the maintenance mode:
When a recovery group is in the maintenance mode, start up of a recovery package with cmrecovercl, cmrunpkg or cmmodpkg commands is prevented by Continentalclusters for that recovery group. When a recovery group is in the maintenance mode there is no impact on the availability of the primary packages. The primary package continues to be up and is highly available within the primary cluster (i.e., local failover allowed). Clients can continue to connect to the primary package and access its production data on the primary cluster. There is no dependency on data replication to move a recovery group into maintenance mode. Array based replication can be suspended or can be in progress. Similarly, logical replication can either be suspended (receiver package is down) or can be resumed (receiver package is up). Table 2-2 “Impact of Maintenance Mode” describes the impact on recovery when a recovery group is in the maintenance mode. Table 2-2 Impact of Maintenance Mode
Run the following command to disable a recovery group and move it into the maintenance mode: cmrecovercl -d [-f] -g <recovery group> Where: <recovery group> is the name of the recovery group to be disabled. Run this command only on recovery cluster nodes. This command succeeds only when Continentalclusters is configured for maintenance mode. The command checks for the following conditions to successfully disable a recovery group:
Run the following command to enable a recovery group and move it out of the maintenance mode: cmrecovercl -e -g <recovery group> Where: <recovery group> is the name of the recovery group to be enabled and moved out of the maintenance mode. You can run this command only on recovery cluster nodes. The command succeeds only when Continentalclusters is configured for maintenance mode. Following are the conditions that need to be met for the recovery group to be enabled and moved out of the maintenance mode:
When a CLUSTER_ALARM is issued, there may be a need for a cluster recovery using the recovery command, cmrecovercl, which is enabled for use by the root user. Cluster recovery is carried out at the site of the recovery cluster by using the cmrecovercl command. The cmrecovercl command will only recover recovery groups that are enabled for recovery and are not in the maintenance mode. # cmrecovercl Issuing this command will halt any configured data replication activity from the failed cluster to the recovery cluster, and will start all configured recovery packages on the recovery cluster that are pre-configured in recovery groups. A recovery group is the basic unit of recovery used in a continental cluster configuration. This command will fail if a cluster alarm has not been issued. If option “-g RecoveryGroup” is specified with the recovery command, then the recovery process of halting data replication activity and starting of the recovery package will only be done for the specified recovery group. After the cmrecovercl command is issued, there is a delay of at least 90 seconds (per recovery group) while the command ensures that the package is not active on another cluster. Cluster recovery is done as a last resort, after all other approaches to restore the unavailable cluster have been exhausted. It is important to remember that cluster recovery sets in motion a process that cannot be easily reversed. Unlike the failover of a package from one node to another, failing a package from one cluster to another normally involves a significant quantity of data that is being accessed from a new set of disks. Returning control to the original cluster will involve resynchronizing this data and resetting the roles of the clusters in a process that is easier for some data replication techniques than others.
During a recovery in Continentalclusters, a configuration inconsistency at the recovery cluster can result in an unsuccessful recovery attempt. Rehearsing the recovery procedure provides you a method to proactively identify and fix these configuration inconsistencies so that there are no issues during an actual recovery. Continentalclusters provides the environment and a set of required tools to complete a Disaster Recovery (DR) Rehearsal. Continentalclusters allows recovery groups to be configured with a special rehearsal package for DR rehearsal. You must configure the rehearsal package in the recovery cluster and specify it as part of the recovery group definition. The rehearsal package is identical to the recovery package and can be effectively used in place of the recovery package to verify the environment. This rehearsal package bundles the same application and uses the same storage devices as the recovery package. During a DR rehearsal process, Continentalclusters will start the rehearsal package and validate the recovery cluster environment. However, to stop clients from using the recovery package application instance, it is necessary that the client access network IP address be different for the rehearsal package. For more information on configuring a rehearsal package, see “Configuring Recovery Groups with Rehearsal Packages”. Also, you must configure Continentalclusters to enable the maintenance mode feature for recovery groups. For more information on the maintenance mode, see “Maintenance Mode for Recovery Groups”. Disaster Recovery Rehearsal for a recovery group is done in the following phases:
For information on running a rehearsal process in your environment, see Appendix G “Data Replication Rehearsal in a Sample Environment”. Packages have somewhat different behavior in a continental cluster than in a normal Serviceguard environment. There are specific differences in
From Serviceguard A.11.17 and above, you can configure the following package types in a recovery group:
In the case of a multi-node package, a recovery process recovers all instances of the package in a recovery cluster.
Normally, an application (package) can run on only one node at a time in a cluster. However, in a continental cluster, there are two clusters in which an application—the primary package or the recovery package—could operate on the same data. Both the primary and the recovery package must not be allowed to run at the same time. To prevent this, it is important to ensure that packages are not allowed to start automatically and are not started at inappropriate times. To keep packages from starting up automatically, when a cluster starts, set the AUTO_RUN (PKG_SWITCHING_ENABLED used prior to Serviceguard A.11.12) parameter for all primary and recovery packages to NO. Then use the cmmodpkg command with the -e <packagename> option to start up only the primary packages and enable switching. The cmrecovercl command, when run, will start up the recovery packages and enable switching during the cluster recovery operation.
To prevent packages from being started at the wrong time and in the wrong place, use the following strategies:
Another important difference between the packages configured in a continental cluster and the packages configured in a standard Serviceguard cluster is that the same or different subnets can be used for primary cluster and recovery cluster configurations. In addition, the same or different relocatable IP addresses can be used for the primary package and its corresponding recovery package. The client application must be designed properly to connect to the appropriate IP address following a recovery operation. For recovery groups with a rehearsal package configured, ensure that the rehearsal package IP address is different from the recovery package IP address. Continentalclusters packages are manipulated manually by the user via Serviceguard commands and by cmcld automatically in the same way as any other packages. In a continental cluster the recovery package are not allowed to run at the same time as the primary, data sender, or data receiver packages. To enforce this, several Serviceguard commands behave in a slightly different manner when used in a continental cluster. Table 2-3 “Serviceguard and Continentalclusters Commands” describes the Serviceguard commands whose behavior is different in a continental cluster environment. Specifically, when one of the commands listed in Table 2-3 “Serviceguard and Continentalclusters Commands” attempts to start or enable switching of a package, it first checks the status of the other packages in the recovery group. Based on the status, the operation is either allowed or disallowed. The checking is done based on the stable clusters' environment and the proper functioning of the network communication. In the case when the network communication between clusters can not be established or the cluster or package status can not be determined, it is must be checked manually to ensure that the operation to be performed on the target package will not have a conflict with other packages configured in the same recovery group. Table 2-3 Serviceguard and Continentalclusters Commands
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||