| United States-English |
|
|
|
![]() |
Designing Disaster Tolerant High Availability Clusters: > Chapter 5 Building a Continental ClusterSwitching to the Recovery Packages in Case of Disaster |
|
Once the clusters are configured and tested, packages will be able to fail over to an alternate node in another data center and still have access to the data they need to function. The primary steps for failing over a package are:
It is crucial that you have a well-defined recovery process, and that all members at both sites are educated on how to use this process. Once the monitor is started, as described in “Starting the ContinentalClusters Monitor Package”, the monitor will send notifications as configured. You may get one of the following types of notification as configured in cmclconf.ascii:
The issuing of notifications takes place at the timing intervals specified for each cluster event. However, it sometimes may appear that an alert or alarm takes longer than configured. Keep in mind that if several changes of cluster state (for example, Down to Error to Unreachable to Down) take place in a smaller time than the configured interval for an alert or alarm, the timer is reset to 0 after each change of state; thus, the time to the alert or alarm will be the configured interval plus the time used by all the earlier state changes.
It is important to follow the established protocol for coordinating with the remote site to determine whether moving the package is required. This includes initiating person-to-person communication between sites. For example, it may be possible that the WAN network failed, causing the cluster alarm. Some network failures, such as those that prevent clients from using the application, may require recovery. Other network failures, such as those that only prevent the two clusters from communicating, may not require recovery. Following an established protocol for communicating with the remote site would verify this. See Figure 5-9 “Recovery Checklist” for an example of a recovery checklist. Once you have received an appropriate notification and have coordinated between the sites (see “Documenting the Recovery Procedure” for a sample worksheet,) and have determined that moving the package is necessary, use the cmrecovercl command to start the failover process: # cmrecovercl If you have not received a notification defined in a CLUSTER_ALARM statement in the configuration file, but you have received a CLUSTER_ALERT and the remote site has confirmed the need to fail over, you may override the disabled cmrecovercl command by using the -f forcing option: # cmrecovercl -f This should only be used after positive confirmation from the remote site. If the monitored cluster comes back up following an alert or alarm, but you are certain that the primary packages cannot start (say, because of damage to the disks on the primary site), you need to use a special procedure to initiate recovery:
After the cmrecovercl command is issued, ContinentalClusters displays a warning message like the following and prompts for a verification that recovery should proceed (the names "LAcluster" and "NYcluster" are examples):
Reply Y to proceed only if you are sure that recovery should take place. After replying Y to the prompt, you should see a group of messages like the following as the processing of each recovery group occurs (the message about the data receiver package only appears if you are using logical data replication with data sender and receiver packages):
Use the cmviewcl command on the local cluster to confirm that the recovery packages are running correctly. Following recovery, you can halt the package that was monitoring the remote cluster if you wish. If you do not do so, you will continue to receive notification if there is a change in the remote cluster's state. The following table shows the status of ContinentalClusters packages after recovery has taken place, and applications are now running on the local cluster. Table 5-6 Status of ContinentalClusters Packages After Recovery
The cmrecovercl command uses the configuration file to loop through each defined recovery group. For each group, the command communicates with the monitor package (ccmonpkg) and verifies that the remote cluster is unreachable or down, then if there is a data replication package it is halted, and the recovery package is enabled on the Recovery Cluster. The recovery package can then start up on the local cluster on the appropriate node, as determined by the FAILOVER_POLICY configured for the package. The process continues for the next recovery group, even if there are problems with one recovery group. After processing one recovery group, if the command discovers that the local cluster is back up, the command exits, since the alarm or alert state no longer exists. This process keeps both the primary and recovery packages from running on the remote cluster and local cluster at the same time, which would result in data corruption.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||