Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Designing Disaster Tolerant HA Clusters Using Metrocluster and Continentalclusters: > Chapter 2 Designing a Continental Cluster

Switching to the Recovery Packages in Case of Disaster

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

Once the clusters are configured and tested, packages will be able to fail over to an alternate node in another data center and still have access to the data they need to function. The primary steps for failing over a package are:

  1. Receive notification that a monitored cluster is unavailable.

  2. Verify that it is necessary and safe to start the recovery packages.

  3. Use the recovery command to stop data replication and start recovery packages.

  4. View the status of the continental cluster.

    # cmviewconcl

It is important to have a well-defined recovery process, and that all members at both sites are educated on how to use this process.

Receiving Notification

Once the monitor is started, as described in “Starting the Continentalclusters Monitor Package”, the monitor will send notifications as configured. The following types of notifications are generated as configured in cmclconf.ascii:

  • CLUSTER_ALERT is a change in the status of a cluster. Recovery via the cmrecovercl command is not enabled by default. This should be treated as information that the cluster either may be developing a problem or may be recovering from a problem.

  • CLUSTER_ALARM is a change in the status of a cluster that indicates that the cluster has been unavailable for an unacceptable amount of time. Recovery via the cmrecovercl command is enabled.

The issuing of notifications takes place at the timing intervals specified for each cluster event. However, it sometimes may appear that an alert or alarm takes longer than configured. Keep in mind that if several changes of cluster state (for example, Down to Error to Unreachable to Down) take place in a smaller time than the configured interval for an alert or alarm, the timer is reset to 0 after each change of state; thus, the time to the alert or alarm will be the configured interval plus the time used by all the earlier state changes.

NOTE: The cmrecovercl command is fully enabled only after a CLUSTER_ALARM is issued; however, the command may be used with the -f option when a CLUSTER_ALERT has been issued.

Verifying that Recovery is Needed

It is important to follow the established protocol for coordinating with the remote site to determine whether moving the package is required. This includes initiating person-to-person communication between sites. For example, it may be possible that the WAN network failed, causing the cluster alarm.

Some network failures, such as those that prevent clients from using the application, may require recovery. Other network failures, such as those that only prevent the two clusters from communicating, may not require recovery. Following an established protocol for communicating with the remote site would verify this. See Figure 2-10 “Recovery Checklist” for an example of a recovery checklist.

Using the Recovery Command to Switch All Packages

If other types of data replication technology are chosen other than Metrocluster Continuous Access XP, or Metrocluster Continuous Access EVA, or Metrocluster with EMC SRDF, use the following steps prior to executing the Continentalclusters recovery command, cmrecovercl.

Once notification is received and there is coordination between the sites in a recovery pair, (For a sample worksheet, see “Documenting the Recovery Procedure”), and have determined that moving the package is necessary:

  • Check to make sure the data used by the application is in usable state. Usable state means the data is consistent and recoverable, even though it may not be current.

  • Check to make sure the secondary devices are in read-write mode. If you are using database or software data replication make sure the data copy at the recovery site is in read-write mode as well.

  • If LVM and physical data replication are used, the ID of the primary cluster is also replicated and written on the secondary devices in the recovery site. The ID of the primary cluster must be cleared and the ID of the recovery cluster must be written on the secondary devices before they can be used.

    If LVM exclusive-mode is used, issue the following commands from a node in the recovery cluster on all the volume groups that are used by the recovery packages:

    # vgchange -c n <volume group name>

    # vgchange -c y <volume group name>

    If LVM shared-mode (SLVM) is used, from a node in the recovery cluster, issue the following commands:

    # vgchange -c n -S n <volume group name>

    # vgchange -c y -S y <volume group name>

  • If VxVM and physical data replication are used, the host name of a node in the primary cluster is the host name of the last owner of the disk group. It is also replicated and written on the secondary devices in the recovery site. The host name of the last owner of the disk group must be cleared out before the secondary devices can be used.

    If VxVM is used, issue the following command from a node in the recovery cluster on all the disk groups that are used by the recovery packages:

    # vxdg deport <disk group name>

To Start the Failover Process

Use the following commands to start the failover process:

# cmrecovercl

If a notification defined in a CLUSTER_ALARM statement in the configuration file is not received, but a CLUSTER_ALERT and the remote site has confirmed the need to fail over has been received, then override the disabled cmrecovercl command by using the -f forcing option:

# cmrecovercl -f

This should only be used after positive confirmation from the remote site.

In a multiple recovery pair configuration where more than one primary cluster is sharing the same recovery cluster, running cmrecovercl without any option will attempt to recover packages for all of the recovery groups of the configured primary clusters. Recovery can also be done in this multiple recovery pair case on a per cluster basis by using option -c.

# cmrecovercl -c <PrimaryClusterName>

If the monitored cluster comes back up following an alert or alarm, but it is certain that the primary packages cannot start (say, because of damage to the disks on the primary site), then use a special procedure to initiate recovery:

  1. Use the cmhaltcl command to halt the primary cluster.

  2. Wait for the monitor to send an alert.

  3. Use cmrecovercl -f to perform recovery.

After the cmrecovercl command is issued, Continentalclusters displays a warning message, such as the following and prompts for a verification that recovery should proceed (the names “LAcluster” and “NYcluster” are examples):

WARNING: This command will take over for the primary cluster “LAcluster”
by starting the recovery package on the recovery cluster “NYcluster”.
You must follow your site disaster recovery procedure
to ensure that the primary packages on “LAcluster” are not running
and that recovery on “NYcluster” is necessary. Continuing
with this command while the applications are running on the
primary cluster may result in data corruption.

Are you sure that the primary packages are not running and will
not come back, and are you certain that you want to start the
recovery packages? [Y/N]

Reply “Y” to proceed only if you are certain that recovery should take place. After replying “Y”, a group of messages will appear as shown below.

As the processing of each recovery group occurs (the message about the data receiver package appears only using logical data replication with data sender and receiver packages):

Processing the recovery group nfsgroup on recovery cluster eastcoast
Disabling switching for data receiver package nfsreceiverpkg on recovery
cluster eastcoast
Halting data receiver package nfsreceiverpkg on recovery cluster eastcoast
Starting recovery package nfsbackuppkg on recovery cluster eastcoast
enabling package nfsbackuppkg in cluster eastcoast

-----------------
exit status = 0
-----------------

The command cmrecovercl starts up all the recovery packages that are configured in the recovery groups. In addition to starting the recovery packages all at once, another option is to recover an individual recovery group by using the following commands.

# cmrecovercl -g Recovery_Group_Name

Running the cmrecovercl with option “-g” starts up only the recovery package configured in the specified recovery group.

NOTE: After the cmrecovercl command is issued, there is a delay of at least 90 seconds per recovery group as the command makes sure that the package is not active on another cluster.

Use the cmviewcl command on the local cluster to confirm that the recovery packages are running correctly. Following recovery, halt the package that was monitoring the remote cluster if preferred. If this is not done then notification, if there is a change in the remote cluster’s state, will continue to be received. The following table shows the status of Continentalclusters packages after recovery has taken place, and applications are now running on the local cluster.

Table 2-6 Status of Continentalclusters Packages After Recovery

Primary ClusterRecovery Cluster

Data Replication Method

Primary PackageData Sender Package

Optional Monitor Package

Recovery Package

Data Receiver Package

Required Monitor Package

Physical— Symmetrix

HaltedNot used

Halted or Running

Running

Not used

Halted or Running

Physical— XP Series

HaltedNot used

Halted or Running

RunningNot usedHalted or Running
Physical—EVA SeriesHaltedNot used

Halted or Running

Running

Not used

Halted or Running

Logical— Oracle Standby Database

HaltedNot used

Halted or Running

RunningHaltedHalted or Running

 

How the cmrecovercl Command Works

The cmrecovercl command uses the configuration file to loop through each defined recovery group of a target remote cluster to be recovered. For each group, the command communicates with the monitor package (ccmonpkg) and verifies that the remote cluster is unreachable or down, then if there is a data replication package it is halted, and the recovery package is enabled on the Recovery Cluster. The recovery package can then start up on the local cluster on the appropriate node, as determined by the FAILOVER_POLICY configured for the package.

The process continues for the next recovery group, even if there are problems with one recovery group. After processing one recovery group, if the command discovers that the local cluster is back up, the command exits, since the alarm or alert state no longer exists. This process keeps both the primary and recovery packages from running on the remote cluster and local cluster at the same time, which would result in data corruption.

NOTE: If the remote cluster comes back up following a cluster event but the primary packages cannot run, halt the primary cluster with the cmhaltcl command, then issue cmrecovercl with the -f option.
Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.