Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Designing Disaster Tolerant High Availability Clusters: > Chapter 5 Building Disaster-Tolerant Serviceguard Solutions Using Metrocluster with Continuous Access XP

Completing and Running a Continental Cluster Solution with Continuous Access XP

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

The following section describes how to configure a continental cluster solution using Continuous Access XP, which requires the Metrocluster CA product.

Setting up a Primary Package on the Primary Cluster

Use the procedures in this section to configure a primary package on the primary cluster. Consult the Serviceguard documentation for more detailed instructions on setting up Serviceguard with packages, and for instructions on how to start, halt, and move packages and their services between nodes in a cluster.

NOTE: Neither the primary cluster nor the recovery cluster may configure an XP series paired volume, PVOL or SVOL, as a cluster lock disk. A cluster lock disk must always be writable. Since it cannot be guaranteed that either half of a paired volume is always writable, neither half may be used as a cluster lock disk. A configuration with a cluster lock disk that is part of a paired volume is not a supported configuration.
  1. Create and test a standard Serviceguard cluster using the procedures described in the Managing Serviceguard user’s guide.

  2. Install Continentalclusters on all the cluster nodes in the primary cluster (skip this step if the software has been pre installed).

    NOTE: Serviceguard should already be installed on all the cluster nodes.

    Run swinstall(1m) to install Continentalclusters and Metrocluster Continuous Access (CA) products from an SD depot.

  3. When swinstall(1m) has completed, create a directory as follows for the new package in the primary cluster.

    # mkdir /etc/cmcluster/<package_name>

    Create an Serviceguard package configuration file in the primary cluster.

    # cd /etc/cmcluster/<package_name>

    # cmmakepkg -p <package_name>.ascii

    Customize it as appropriate to your application. Be sure to include the pathname of the control script (/etc/cmcluster/<package_name>/ <package_name>.cntl) for the RUN_SCRIPT and HALT_SCRIPT parameters. Set the AUTO_RUN flag to NO. This is to ensure the package will not start when the cluster starts. Only after primary packages start, use cmmodpkg to enable package switching on all primary packages. Enabling package switching in the package configuration would automatically start the primary package when the cluster starts. However, had there been a primary cluster disaster, resulting in the recovery package starting and running on the recovery cluster, the primary package should not be started until after first stopping the recovery package.

  4. Create a package control script.

    # cmmakepkg -s pkgname.cntl

    Customize the control script as appropriate to your application using the guidelines in the Managing Serviceguard user’s guide. Standard Serviceguard package customizations include modifying the VG, LV, FS, IP, SUBNET, SERVICE_NAME, SERVICE_CMD, and SERVICE_RESTART parameters. Set LV_UMOUNT_COUNT to 1 or greater.

  5. Add customer-defined run and halt commands in the appropriate places according to the needs of the application. See the Managing Serviceguard user’s guide for more information on these functions.

  6. Copy the environment file template /opt/cmcluster/toolkit/SGCA/xpca.env to the package directory, naming it pkgname_xpca.env.

    # cp /opt/cmcluster/toolkit/SGCA/xpca.env \

      /etc/cmcluster/pkgname/pkgname_xpca.env

  7. Edit the environment file <pkgname>_xpca.env as follows:

    1. If necessary, add the path where the Raid Manager software binaries have been installed to the PATH environment variable. If the software is in the usual location, /usr/bin, you can just uncomment the line in the script.

    2. Uncomment the behavioral configuration environment variables starting with AUTO_. It is recommended that you retain the default values of these variables unless you have a specific business requirement to change them. See Appendix A for an explanation of these variables.

    3. Uncomment the PKGDIR variable and set it to the full path name of the directory where the control script has been placed. This directory, which is used for status data files, must be unique for each package. For example, set PKGDIR to /etc/cmcluster/package_name, removing any quotes around the file names.

    4. Uncomment the DEVICE_GROUP variable and set it to this package’s Raid Manager device group name, as specified in the Raid Manager configuration file.

    5. Uncomment the HORCMPERM variable and use the default value MGRNOINST if Raid Manager protection facility is not used or disabled. If Raid Manager protection facility is enabled set it to the name of the HORCM permission file.

    6. Uncomment the HORCMINST variable and set it to the Raid Manager instance name used by Metrocluster/CA.

    7. Uncomment the FENCE variable and set it to either ASYNC, NEVER, or DATA according to your business requirements or special Metrocluster requirements. This variable is used to compare with the actual fence level returned by the array.

    8. If using asynchronous data replication, set the HORCTIMEOUT variable to a value greater than the side file timeout value configured with the Service Processor (SVP), but less than the RUN_SCRIPT_TIMEOUT set in the package configuration file. The default setting is the side file timeout value + 60 seconds.

    9. Uncomment the CLUSTER_TYPE variable and set it to CONTINENTAL.

  8. Distribute Metrocluster/CA configuration, environment and control script files to other nodes in the cluster by using ftp or rcp:

    # rcp -p /etc/cmcluster/pkgname/* \

    other_node:/etc/cmcluster/pkgname

    See the example script Samples/ftpit to see how to semi-automate the copy using ftp. This script assumes the package directories already exist on all nodes.

    Using ftp may be preferable at your organization, since it does not require the use of a.rhosts file for root. Root access via .rhosts may create a security issue.

  9. Apply the Serviceguard configuration using the cmapplyconf command or SAM.

  10. Verify that each node in the Serviceguard cluster has the following files in the directory /etc/cmcluster/pkgname:

    pkgname.cntl

    Metrocluster/CA package control script

    pkgname_xpca.env

    Metrocluster/CA environment file

    pkgname.ascii

    Serviceguard package ASCII configuration file

    pkgname.sh

    Package monitor shell script, if applicable

    other files

    Any other scripts you use to manage Serviceguard packages.

    The Serviceguard cluster is ready to automatically switch packages to nodes in remote data centers using Metrocluster/CA.

  11. Edit the file /etc/rc.config.d/raidmgr, specifying the Raid Manager instance to be used for Continentalclusters, and specify the instance is to be started at boot time.

    The appropriate Raid Manager instance used by Continentalclusters must be running before the package is started. This normally means the Raid Manager instance must be started before starting Serviceguard.

  12. Using standard Serviceguard commands (cmruncl, cmhaltcl, cmrunpkg, cmhaltpkg), test the primary cluster for cluster and package startup and package failover.

  13. Any running package on the primary cluster that will have a counterpart on the recovery cluster must be halted at this time.

Setting up a Recovery Package on the Recovery Cluster

Use the procedures in this section to configure a recovery package on the recovery cluster. Consult the Serviceguard documentation for more detailed instructions on setting up Serviceguard with packages, and for instructions on how to start, halt, and move packages and their services between nodes in a cluster.

NOTE: Neither the primary cluster nor the recovery cluster may configure an XP series paired volume, PVOL or SVOL, as a cluster lock disk. A cluster lock disk must always be writable. Since it cannot be guaranteed that either half of a paired volume is always writable, they may not be used as a cluster lock disk. Using a disk as a cluster lock disk that is part of a paired volume is not a supported configuration.
  1. Create and test a standard Serviceguard cluster using the procedures described in the Managing Serviceguard user’s guide.

  2. Install Continentalclusters on all the cluster nodes in the recovery cluster (skip this step if the software has been pre installed).

    NOTE: Serviceguard should already be installed on all the cluster nodes.

    Run swinstall(1m)to install Continentalclusters and Metrocluster Continuous Access (CA) products from an SD depot. The toolkit integration scripts, environment file and contributed scripts will reside in the /opt/cmcluster/toolkit/SGCA and /usr/sbin directories.

  3. When swinstall(1m) has completed, create a directory as follows for the new package in the recovery cluster.

    # mkdir /etc/cmcluster/<package_name>

    Create an Serviceguard package configuration file in the recovery cluster.

    # cd /etc/cmcluster/<package_name>

    # cmmakepkg -p <package_name>.ascii


    Customize it as appropriate to your application. Make sure to include the pathname of the control script (/etc/cmcluster/<package_name>/ <package_name>.cntl) for the RUN_SCRIPT and HALT_SCRIPT parameters. Set the AUTO_RUN flag to NO. This is to ensure the package will not start when the cluster starts. Do not use cmmodpkg to enable package switching on any recovery package. Enabling package switching will automatically start the recovery package. Package switching on a recovery package will be automatically set by the cmrecovercl command on the recovery cluster when it successfully starts the recovery package.

  4. Create a package control script.

    # cmmakepkg -s pkgname.cntl

    Customize the control script as appropriate to your application using the guidelines in the Managing Serviceguard. standard Serviceguard package customizations include modifying the VG, LV, FS, IP, SUBNET, SERVICE_NAME, SERVICE_CMD and SERVICE_RESTART parameters. Be sure to set LV_UMOUNT_COUNT to 1 or greater.

    NOTE: Some of the control script variables, such as VG and LV, on the recovery cluster must be the same as on the primary cluster. Some of the control script variables, such as, FS, SERVICE_NAME, SERVICE_CMD and SERVICE_RESTART are probably the same as on the primary cluster. Some of the control script variables, such as IP and SUBNET, on the recovery cluster are probably different from those on the primary cluster. Make sure that you review all the variables accordingly.
  5. Add customer-defined run and halt commands in the appropriate places according to the needs of the application. See the Managing Serviceguard user’s guide for more information on these functions.

  6. Copy the environment file template /opt/cmcluster/toolkit/SGCA/xpca.env to the package directory, naming it pkgname_xpca.env.

    # cp /opt/cmcluster/toolkit/SGCA/xpca.env \

      /etc/cmcluster/pkgname/pkgname_xpca.env

  7. Edit the environment file <pkgname>_xpca.env as follows:

    1. If necessary, add the path where the Raid Manager software binaries have been installed to the PATH environment variable. If the software is in the usual location, /usr/bin, you can just uncomment the line in the script.

    2. Uncomment the behavioral configuration environment variables starting with AUTO_. It is recommended that you retain the default values of these variables unless you have a specific business requirement to change them. See Appendix A for an explanation of these variables.

    3. Uncomment the PKGDIR variable and set it to the full path name of the directory where the control script has been placed. This directory, which is used for status data files, must be unique for each package. For example, set PKGDIR to /etc/cmcluster/package_name, removing any quotes around the file names.

    4. Uncomment the DEVICE_GROUP variable and set it to this package’s Raid Manager device group name, as specified in the Raid Manager configuration file.

    5. Uncomment the HORCMPERM variable and use the default value MGRNOINST if Raid Manager protection facility is not used or disabled. If Raid Manager protection facility is enabled set it to the name of the HORCM permission file.

    6. Uncomment the HORCMINST variable and set it to the Raid Manager instance name used by Metrocluster/CA.

    7. Uncomment the FENCE variable and set it to either ASYNC, NEVER, or DATA according to your business requirements or special Metrocluster requirements. This variable is used to compare with the actual fence level returned by the array.

    8. If you are using asynchronous data replication, set the HORCTIMEOUT variable to a value greater than the side file timeout value configured with the Service Processor (SVP), but less than the RUN_SCRIPT_TIMEOUT set in the package configuration file. The default setting is the side file timeout value + 60 seconds.

    9. Uncomment the CLUSTER_TYPE variable and set it to CONTINENTAL.

  8. Distribute Continentalcluster/CA configuration, environment and control script files to other nodes in the cluster by using ftp or rcp.

    # rcp -p /etc/cmcluster/pkgname/* \

    other_node:/etc/cmcluster/pkgname

    See the example script Samples/ftpit to see how to semi-automate the copy using ftp. This script assumes the package directories already exist on all nodes.

    Using ftp may be preferable at your organization, since it does not require the use of a.rhosts file for root. Root access via .rhosts may create a security issue.

  9. Apply the Serviceguard configuration using the cmapplyconf command or SAM.

  10. Verify that each node in the Serviceguard cluster has the following files in the directory /etc/cmcluster/pkgname:

    bkpbkgname.cntl

    Metrocluster/CA package control script

    bkpkgname_xpca.env

    Metrocluster/CA environment file

    bkpkgname.ascii

    Serviceguard package ASCII configuration file

    bkpkgname.sh

    Package monitor shell script, if applicable

    other files

    Any other scripts you use to manage Serviceguard packages

  11. Edit the file /etc/rc.config.d/raidmgr, specifying the Raid Manager instance to be used for Continentalclusters, and specify that the instance be started at boot time.

    NOTE: The appropriate Raid Manager instance used by Continentalclusters must be running before the package is started. This normally means that the Raid Manager instance must be started before Serviceguard is started.
  12. Make sure the packages on the primary cluster are not running. Using standard Serviceguard commands (cmruncl, cmhaltcl, cmrunpkg, cmhaltpkg) test the recovery cluster for cluster and package startup and package failover.

  13. Any running package on the recovery cluster that has a counterpart on the primary cluster should be halted at this time.

Setting up the Continental Cluster Configuration

The steps below are the basic procedure for setting up the Continentalclusters configuration file and the monitoring packages on the two clusters. For complete details on creating and editing the configuration file, refer to Chapter 4 “Designing a Continental Cluster”

  1. Generate the Continentalclusters configuration.

    # cmqueryconcl -C cmconcl.config

  2. Edit the configuration file cmconcl.config with the names of the two clusters, the nodes in each cluster, the recovery groups and the monitoring definitions. The recovery groups define the primary and recovery packages. When data replication is done using Continuous Access XP, there are no data sender and receiver packages.

    Define the monitoring parameters, the notification mechanism (ITO, email, console, SNMP, syslog or tcp) and notification type (alert or alarm) based on the cluster status (unknown, down, up or error). Descriptions for these can be found in the configuration file generated in the previous step.

  3. Edit the continental cluster security file /etc/opt/cmom/cmomhosts to allow or deny hosts read access by the monitor software.

  4. On all nodes in both clusters copy the monitor package files from /opt/cmconcl/scripts to/etc/cmcluster/ccmonpkg. Edit the monitor package configuration as needed in the file /etc/cmcluster/ccmonpkg/ccmonpkg.config. Set the AUTO_RUN flag to YES. This is in contrast to the flag setting for the application packages. The desired result is to have the monitor package start automatically when the cluster is formed.

  5. Apply the monitor package to both cluster configurations.

    # cmapplyconf -P /etc/cmcluster/ccmonpkg/ccmonpkg.config

  6. Apply the continental cluster configuration file using cmapplyconcl. Files are placed in /etc/cmconcl/instances. There is no change to /etc/cmcluster/cmclconfig nor is there an equivalent file for Continentalclusters.

    # cmapplyconcl -C cmconcl.config

  7. Start the monitor package on both clusters.

    NOTE: The monitor package for a cluster checks the status of the other cluster and issues alerts and alarms, as defined in the Continentalclusters configuration file, based on the other cluster’s status.
  8. Check /var/adm/syslog/syslog.log for messages. Also check the ccmonpkg package log file.

  9. Start the primary packages on the primary cluster using cmrunpkg. Test local failover within the primary cluster.

  10. View the status of the Continentalcluster primary and recovery clusters, including configured event data.

    # cmviewconcl -v

The continental cluster is ready for testing. (See “Testing the Continental Cluster”)

Switching to the Recovery Cluster in Case of Disaster

It is vital the administrator verify that recovery is needed after receiving a cluster alert or alarm. Network failures may produce false alarms. After validating a failure, start the recovery process using the cmrecovercl [-f] command. Note the following:

  • During an alert, the cmrecovercl will not start the recovery packages unless the -f option is used.

  • During an alarm, the cmrecovercl will start the recovery packages without the -f option.

  • When there is neither an alert nor an alarm condition, cmrecovercl cannot start the recovery packages on the recovery cluster. This condition applies not only when no alert or alarm was issued, but also applies to the situation where there was an alert or alarm, but the primary cluster recovered and its current status is Up.

Failback Scenarios

The goal of HP Continentalclusters is to maximize system and application availability. However, even systems configured with Continentalclusters can experience hardware failures at the primary site or the recovery site, as well as the hardware or networking failures connecting the two sites. The following discussion addresses some of those failures and suggests recovery approaches applicable to environments using data replication provided by HP StorageWorks XP series disk arrays and Continuous Access (CA). In Chapter 4 “Designing a Continental Cluster” there is a discussion of failback mechanisms and methodologies in “Restoring Disaster Tolerance”.

Scenario 1

The primary site has lost power, including backup power (UPS), to both the systems and disk arrays that make up the Serviceguard Cluster at the primary site. There is no loss of data on either the XP disk array or the operating systems of the systems at the primary site.

Scenario 2

The primary site XP disk array experienced a catastrophic hardware failure and all data was lost on the array.

Failback in Scenarios 1 and 2

After reception of the Continentalclusters alerts and alarm, the administrators at the recovery site follow the prescribed processes and recovery procedures to start the protected applications on the recovery cluster. Each Continentalclusters package control script that invokes Metrocluster CA XP will evaluate the status of the XP paired volumes. Since neither the systems nor the XP disk array at the primary site are accessible, the control file will initially report the paired volumes with a local status of SVOL_PAIR or SVOL_PSUE (in ASYNC mode) and a remote status of EX_ENORMT, PSUE or PSUS, indicating that there is an error accessing the primary site. The control file script is programmed to handle this condition and will enable the volume groups, mount the logical volumes, assign floating IP addresses and start any processes as coded into the script.

NOTE: In ASYNC mode, the package will halt unless a force flag is present or unless the auto variable AUTO_SVOLPSUE is set to 1.

The fence level of the paired volume—NEVER, ASYNC, or DATA—will not impact the starting of the packages at the recovery site. The Metrocluster CAXP pre-integrated solution will perform the following command with regards to the paired volume.

# horctakeover -g <dev-grp-name> -S

Subsequently, the paired volume will have a status of SVOL_SWSS. To view the local status of the paired volumes.

# pairvolchk -g <dev-grp-name> -s

To view the remote status of the paired volumes.

# pairvolchk -g <dev-grp-name> -c

(While the remote XP disk array and primary cluster systems are down, the command will time out with an error code of 242.)

After power is restored to the primary site, or when a newly configured array is brought online, the XP paired volumes may have either a status of PVOL_PSUE on the primary site or SVOL_SWSS on the secondary site. The following procedure applies to this situation:

  1. While the package is still running, from the recovery host.

    # pairresync -g <dgname> -c 15 -swaps

    This starts the resynchronization, which can take a long time if the entire primary disk array was lost or a short time if the primary array was intact at the time of failover.

  2. When resynchronization is complete, halt the Continentalclusters recovery packages at the recovery site.

    # cmhaltpkg <pkg_name>

    This will halt any applications, remove any floating IP addresses, unmount file systems and deactivate volume groups as programmed into the package control files. The status of the paired volumes will remain SVOL_PAIR at the recovery site and PVOL_PAIR at the primary site.

  3. Start the cluster at the primary site. Assuming they have been properly configured, the Continentalclusters primary packages should not start. The monitor package should start automatically.

  4. Manually start the Continentalclusters primary packages at the primary site.

    # cmrunpkg <pkg_name>

  5. Ensure that the monitor packages at the primary and recovery sites are running.

Failback when the Primary has SMPL Status

Use the following procedure when the primary site paired volumes have a status set to SMPL, possibly through manual intervention:

  1. Halt the Continentalclusters recovery packages at the recovery site.

    # cmhaltpkg <pkg_name>

    This will halt any applications, remove any floating IP addresses, unmount file systems and deactivate volume groups as programmed into the package control files. The status of the paired volumes will remain SMPL at the recovery site and PSUE at the primary site.

  2. Start the cluster at the primary site. Assuming they have been properly configured the Continentalclusters primary packages should not start. The monitor package should start automatically.

  3. Since the paired volumes have a status of SMPL at both the primary and recovery sites, the XP views the two halves as unmirrored. From a system at the primary site, manually create the paired volume.

    # paircreate -g <dev-grp-name> -f <fence-level> -vr -c 15

    See the XP Raid Manager user’s guide on more paircreate command options.

    Since the most current data will be at the remote or recovery site, this will synchronize the data from the remote or recovery site (use of the -vr option directs the command to synchronize from the remote site). Wait for the synchronization process to complete before proceeding to the next step. Failure to wait for the synchronization to complete will result in the package failing to start in the next step.

  4. Manually start the Continentalclusters primary packages at the primary site.

    # cmrunpkg <pkg_name>

    The control script is programmed to handle this case. The control script recognizes that the paired volume is synchronized and will proceed with the programmed package startup.

  5. Ensure that monitor packages are running at both sites.

Maintaining the Continuous Access XP Data Replication Environment

Resynchronizing

After certain failures, data are no longer remotely protected. In order to restore disaster-tolerant data protection after repairing or recovering from the failure, you must manually run the command pairresync. This command must successfully complete for disaster-tolerant data protection to be restored. Following is a partial list of failures that require running pairresync to restore disaster-tolerant data protection:

  • failure of ALL CA links without restart of the application

  • failure of ALL CA links with Fence Level DATA with restart of the application on a primary host

  • failure of the entire recovery Data Center for a given application package

  • failure of the recovery XP disk array for a given application package while the application is running on a primary host

Following is a partial list of failures that require full resynchronization to restore disaster-tolerant data protection. Full resynchronization is automatically initiated by moving the application package back to its primary host after repairing the failure.

  • failure of the entire primary Data Center for a given application package

  • failure of all of the primary hosts for a given application package

  • failure of the primary XP disk array for a given application package

  • failure of all CA links with application restart on a secondary host

NOTE: The preceding steps are automated provided the default value of 1 is being used for the auto variable AUTO_PSUEPSUS. Once the CA link failure has been fixed, the user only needs to halt the package at the failover site and restart on the primary site. However, if you want to reduce the amount of application downtime, you should manually invoke pairresync before failback.

Full resynchronization must be manually initiated (as described in the next section) after repairing the following failures:

  • failure of the recovery XP disk array for a given application package followed by application startup on a primary host

  • failure of all CA links with Fence Level NEVER or ASYNC with restart of the application on a primary host

Pairs must be manually recreated if both the primary and recovery XP disk arrays are in the SMPL (simplex) state.

Make sure you periodically review the following files for messages, warnings and recommended actions. It is recommended to review these files after system, data center and/or application failures:

  • /var/adm/syslog/syslog.log

  • /etc/cmcluster/<package-name>/<package-name>.log

  • /etc/cmcluster/<bkpackage-name/<bkpackage-name>.log

Using the pairresync Command

The pairresync command can be used with special options after a failover in which the recovery site has started the application and has processed transaction data on the disk at the recovery site, but the disks on the primary site are intact. After the CA link is fixed, depending on which site you are on, use the pairresync command in one of the following two ways:

  • pairresync -swapp—from the primary site.

  • pairresync -swaps—from the failover site.

These options take advantage of the fact that the recovery site maintains a bit-map of the modified data sectors on the recovery array. Either version of the command will swap the personalities of the volumes, with the PVOL becoming the SVOL and SVOL becoming the PVOL. With the personalities swapped, any data that has been written to the volume on the failover site (now PVOL) are then copied back to the SVOL, which is now running on the primary site. During this time the package continues running on the failover site. After resynchronization is complete, you can halt the package on the failover site, and restart it on the primary site. Metrocluster will then swap the personalities between the PVOL and the SVOL, returning PVOL status to the primary site.

Some Further Points

  • This toolkit may increase package startup time by 5 minutes or more. Packages with many disk devices will take longer to start up than those with fewer devices due to the time needed to get device status from the XP disk array or to synchronize.

    NOTE: Long delays in package startup time will occur in those situations when recovering from broken pair affinity.
  • The value of RUN_SCRIPT_TIMEOUT in the package ASCII file should be set to NO_TIMEOUT or to a large enough value to take into consideration the extra startup time due to getting status from the XP disk array. (See the previous paragraph for more information on the extra startup time).

  • Online cluster configuration changes may require a Raid Manager configuration file to be changed. Whenever the configuration file is changed, the Raid Manager instance must be stopped and restarted. The Raid Manager instance must be running before any Continentalclusters package movement occurs.

  • A given file system must not reside on more than one XP frame for either the PVOL or the SVOL. A given LVM Logical Volume (LV) must not reside on more than one XP frame for either the PVOL or the SVOL.

  • The application is responsible for data integrity, and must use the O_SYNC flag when ordering of I/Os is important. Most relational database products are examples of applications that ensure data integrity by using the O_SYNC flag.

  • Each host must be connected to only the XP disk array that contains either the PVOL or the SVOL. A given host must not be connected to both the PVOL and the SVOL of a continuous access pair.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.