Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
VERITAS Volume Manager 3.1 Administrator's Guide: for HP-UX 11i and HP-UX 11i Version 1.5 > Chapter 8 Recovery

Detecting and Replacing Failed Disks

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

This section describes how to detect disk failures and replace failed disks. It begins with the hot-relocation feature, which automatically attempts to restore redundant Volume Manager objects when a failure occurs.

Hot-Relocation

NOTE: You may need an additional license to use this feature.

Hot-relocation automatically reacts to I/O failures on redundant (mirrored or RAID-5) Volume Manager objects and restores redundancy and access to those objects. The Volume Manager detects I/O failures on objects and relocates the affected subdisks to disks designated as spare disks and/or free space within the disk group. Volume Manager then reconstructs the objects that existed before the failure and makes them redundant and accessible again. See “Hot-Relocation” for more information.

NOTE: Hot-relocation is only performed for redundant (mirrored or RAID-5) subdisks on a failed disk. Non-redundant subdisks on a failed disk are not relocated, but the system administrator is notified of their failure.

Hot-relocation is enabled by default and goes into effect without system administrator intervention when a failure occurs. The vxrelocd hot-relocation daemon detects and reacts to Volume Manager events that signify the following types of failures:

  • disk failure—this is normally detected as a result of an I/O failure from a Volume Manager object. Volume Manager attempts to correct the error. If the error cannot be corrected, Volume Manager tries to access configuration information in the private region of the disk. If it cannot access the private region, it considers the disk failed.

  • plex failure—this is normally detected as a result of an uncorrectable I/O error in the plex (which affects subdisks within the plex). For mirrored volumes, the plex is detached.

  • RAID-5 subdisk failure—this is normally detected as a result of an uncorrectable I/O error. The subdisk is detached.

When such a failure is detected, the vxrelocd daemon informs the system administrator by electronic mail of the failure and which Volume Manager objects are affected. The vxrelocd daemon then determines which subdisks (if any) can be relocated. If relocation is possible, the vxrelocd daemon finds suitable relocation space and relocates the subdisks.

Hot-relocation space is chosen from the disks reserved for hot-relocation in the disk group where the failure occurred. If no spare disks are available or additional space is needed, free space in the same disk group is used. Once the subdisks are relocated, each relocated subdisk is reattached to its plex.

Finally, the vxrelocd daemon initiates appropriate recovery procedures. For example, recovery includes mirror resynchronization for mirrored volumes or data recovery for RAID-5 volumes. The system administrator is notified of the hot-relocation and recovery actions taken.

If relocation is not possible, the system administrator is notified and no further action is taken. Relocation is not possible in the following cases:

  • If subdisks are not redundant (that is, they do not belong to mirrored or RAID-5 volumes), they cannot be relocated.

  • If enough space is not available (from spare disks and free space) in the disk group, failing subdisks cannot be relocated.

  • If the only available space is on a disk that already contains a mirror of the failing plex, the subdisks in that plex cannot be relocated.

  • If the only available space is on a disk that already contains the RAID-5 plex log plex or one of its healthy subdisks, the failing subdisk in the RAID-5 plex cannot be relocated.

  • If a mirrored volume has a Dirty Region Logging log subdisk as part of its data plex, subdisks belonging to that plex cannot be relocated.

  • If a RAID-5 volume log plex or a mirrored volume DRL log plex fails, a new log plex is created elsewhere (so the log plex is not actually relocated).

You can prepare for hot-relocation by designating one or more disks per disk group as hot-relocation spares. For information on how to designate a disk as a spare, see “Placing Disks Under Volume Manager Control”. If no spares are available at the time of a failure or if there is not enough space on the spares, free space is automatically used.

By designating spare disks, you have control over which space is used for relocation in the event of a failure. If the combined free space and space on spare disks is not sufficient or does not meet the redundancy constraints, the subdisks are not relocated.

After a successful relocation occurs, you need to remove and replace the failed disk (see “Replacing Disks”). Depending on the locations of the relocated subdisks, you can choose to move the relocated subdisks elsewhere after hot-relocation occurs (see “Moving Relocated Subdisks”).

Modifying the vxrelocd Process

Hot-relocation is turned on as long as the vxrelocd process is running. As a rule, leave hot-relocation turned on to take advantage of this feature if a failure occurs. However, if you disable this feature because you do not want the free space on some of your disks used for relocation, you must prevent the vxrelocd process from starting at system startup time.

You can stop hot-relocation at any time by killing the vxrelocd process (this should not be done while a hot-relocation attempt is in progress).

You can make some minor changes to the way the vxrelocd process behaves by either editing the vxrelocd line in the startup file that invokes the vxrelocd process (/sbin/rc2.d/S095vxvm-recover) or killing the existing vxrelocd process and restarting it with different options. After making changes to the way the vxrelocd process is invoked in the startup file, reboot the system so that the changes go into effect. If you choose to kill and restart the daemon instead, make sure that hot-relocation is not in progress when you kill the vxrelocd process. Also restart the daemon immediately so that hot-relocation can take effect if a failure occurs.

You can alter the vxrelocd process in the following ways:

  • By default, the vxrelocd process sends electronic mail to root when failures are detected and relocation actions are performed. You can instruct the vxrelocd process to notify additional users by adding the appropriate user names and invoking the vxrelocd process using the following command:

    		# vxrelocd root user_name1 user_name2 & 
  • To reduce the impact of recovery on system performance, you can instruct the vxrelocd process to increase the delay between the recovery of each region of the volume using the following command:

    		# vxrelocd -o slow[=IOdelay] root & 

    where the optional IOdelay indicates the desired delay (in milliseconds). The default value for the delay is 250 milliseconds. See the vxrelocd(1M) manual page for more information.

Displaying Spare Disk Information

Use the vxdg command spare to display information about all of the spare disks available for relocation. The output displays the following information:

	GROUP			 DISK 			DEVICE			 TAG 			OFFSET			 LENGTH 			FLAGS
rootdg disk02 c0t2d0 c0t2d0 0 658007 s

In this example, disk02 is the only disk designated as a spare. The LENGTH field indicates how much spare space is currently available on this disk for relocation.

To display information about disks that are currently designated as spares, use the following commands:

  • vxdisk list—lists disk information and displays spare disks with a SPARE flag.

  • vxprint—lists disk and other information and displays spare disks with a SPARE flag.

Moving Relocated Subdisks

When hot-relocation occurs, subdisks are relocated to spare disks and/or available free space within the disk group. The new subdisk locations may not provide the same performance or data layout that existed before hot-relocation took place. To improve performance, move the relocated subdisks (after hot-relocation is complete).

You can also move the relocated subdisks off the spare disk(s) to keep the spare disk space free for future hot-relocation needs. Another reason for moving subdisks is to recreate the configuration that existed before hot-relocation occurred.

During hot-relocation, an email messages is sent to root, as shown in the following example:

To: root
Subject: Volume Manager failures on host teal
Attempting to relocate subdisk disk02-03 from plex home-02.
Dev_offset 0 length 1164 dm_name disk02 da_name c0t5d0.
The available plex home-01 will be used to recover the data.

This message contains information about the subdisk before relocation that can be used to decide where to move the subdisk after relocation.

For example, the following message indicates the new location for the relocated subdisk:

To: root
Subject: Attempting VxVM relocation on host teal
Volume home Subdisk disk02-03 relocated to disk05-01,
but not yet recovered.

Before you move any relocated subdisks, fix or replace the disk that failed (as described in “Replacing Disks”). Once this is done, you can move a relocated subdisk back to the original disk. For example, move the relocated subdisk disk05-01 back to disk02 using the following command:

# vxassist -g rootdg move home !disk05 disk02
NOTE: During subdisk move operations, RAID-5 volumes are not redundant.

Moving Hot-Relocated Subdisks

NOTE: You may need an additional license to use this feature.

After the disk that experienced the failure is fixed or replaced, vxunreloc can be used to move all the hot-relocated subdisks back to the disk. When a subdisk is hot-relocated, its original disk media name and the offset into the disk, are saved in the configuration database. When a subdisk is moved back to the original disk or to a new disk using vxunreloc, the information is erased. The original dm name and the original offset are saved in the subdisk records. To print all of the subdisks that were hot-relocated from disk01 in the rootdg disk group, use the following command:

# vxprint -g rootdg -se 'sd_orig_dmname="disk01"'

To move all the subdisks that were hot-relocated from disk01 back to the original disk, type:

# vxunreloc -g rootdg disk01

The vxunreloc utility provides the -n option to move the subdisks to a different disk from where they were originally relocated. For example, when disk01 failed, all the subdisks that resided on it were hot-relocated to other disks. After the disk is repaired, it is added back to the disk group using a different name, for example, disk05. To move all the hot-relocated subdisks to the new disk, use the following command:

# vxunreloc -g rootdg -n disk05 disk01

The destination disk should have at least as much storage capacity as was in use on the original disk. If there is not enough space, the unrelocate operation fails and none of the subdisks are moved.

When the vxunreloc utility moves the hot-relocated subdisks, it moves them to the original offsets. However, if some subdisks occupy part or all of the area on the destination disk, the vxunreloc utility fails. If the vxunreloc utility fails, perform one of the following procedures:

  • move the existing subdisks somewhere else, and then re-run the vxunreloc utility

  • use the -f option provided by the vxunreloc utility to move the subdisks to the destination disk, but allow the vxunreloc utility to find the space on the disk.

As long as the destination disk is large enough so that the region of the disk for storing subdisks can accommodate all subdisks, all the hot-relocated subdisks are "unrelocated" without using the original offsets.

A subdisk that has been hot-relocated more than once due to multiple disk failures can still be unrelocated back to its original location. For example, if disk01 fails, a subdisk named disk01-01 is moved to disk02.If disk02 then experiences disk failure, all the subdisks residing on disk02, including the one that was hot-relocated to it, are moved again. When disk02 is replaced, an unrelocate operation for disk02 does not affect the hot-relocated subdisk disk01-01. However, a replacement of disk01, followed by the unrelocate operation, moves disk01-01 back to disk01 when the vxunreloc utility is run, immediately after the replacement.

Restart vxunreloc After Errors

Internally, the vxunreloc utility moves the subdisks in three phases.The first phase creates as many subdisks on the specified destination disk as there are subdisks to be unrelocated. When the subdisks are created, the vxunreloc utility fills in the comment field in the subdisk record with the string UNRELOC as an identification. The second phase moves the data. If all the subdisk moves are successful, the third phase cleans up the comment field of the subdisk records.

Creating the subdisk is an all-or-none operation. If the vxunreloc utility cannot create all the subdisks successfully, no subdisk is created and the vxunreloc utility exits. The subdisk move operation is not all-or-none. One subdisk move is independent of another, and as a result, if one subdisk move fails, the vxunreloc utility prints an error message and then exits. However, all of the subsequent subdisks remain on the disk where they were hot-relocated and are not moved back. For subdisks that were returned home, the comment field in their subdisk records is still marked as UNRELOC because the cleanup phase is never executed.

If the system goes down after the new subdisks are made on the destination, but before they are moved back, the unrelocate utility can be executed again after the system comes back. As described above, when a new subdisk is created, the vxunreloc utility sets the comment field of the subdisk as UNRELOC. When the vxunreloc utility is re-executed, it checks the offset, the len, and the comment fields of the subdisk on the destination disk to determine if it was left on the disk at a previous execution of the vxunreloc utility.

NOTE: Do not manually modify the string UNRELOC in the comment field.

If one out of a series of subdisk moves fails, the vxunreloc utility exits. Under this circumstance, you should check the error that caused the subdisk move to fail and determine if the unrelocation can proceed. When you re-execute the vxunreloc utility to resume the subdisk moves, it uses the subdisks created at a previous run.

The cleanup phase requires one transaction. The vxunreloc utility resets the comment field to a NULL string for all the subdisks marked UNRELOC that reside on the destination disk. This includes cleanup of subdisks that were unrelocated in a previous invocation of the vxunreloc utility that did not successfully complete.

Detecting Failed Disks

The Volume Manager hot-relocation feature automatically detects disk failures and notifies the system administrator of the failures by email. If hot-relocation is disabled or you miss the email, view disk failures using the vxprint command or by using the graphical user interface to view the status of the disks. Driver error messages are also displayed on the console or in the system messages file.

If a volume has a disk I/O failure (for example, because the disk has an uncorrectable error), the Volume Manager can detach the plex involved in the failure.

If a plex is detached, I/O stops on that plex but continues on the remaining plexes of the volume. If a disk fails completely, the Volume Manager can detach the disk from its disk group.

If a disk is detached, all plexes on the disk are disabled. If there are any unmirrored volumes on a disk when it is detached, those volumes are also disabled.

Partial Disk Failure

If hot-relocation is enabled when a plex or disk is detached by a failure, mail listing the failed objects is sent to root. If a partial disk failure occurs, the mail identifies the failed plexes. For example, if a disk containing mirrored volumes fails, you can receive mail information as shown in the following example:

To: root
Subject: Volume Manager failures on host teal
Failures have been detected by the VERITAS Volume Manager:
failed plexes:
home-02
src-02

See “Modifying the vxrelocd Process” for information on how to send the mail to users other than root.

To determine which disk is causing the failures in the above example, use the following command:

# vxstat -s -ff home-02 src-02

The following is a typical output display:

                            FAILED
TYP NAME READS WRITES
sd disk01-04 0 0
sd disk01-06 0 0
sd disk02-03 1 0
sd disk02-04 1 0

This display indicates that the failures are on disk02 (and that subdisks disk02-03 and disk02-04 are affected).

Hot-relocation automatically relocates the affected subdisks and initiates any necessary recovery procedures. However, if relocation is not possible or the hot-relocation feature is disabled, you have to investigate the problem and attempt to recover the plexes. These errors can be caused by cabling failures, so check the cables connecting your disks to your system. If there are obvious problems, correct them and recover the plexes with this command:

# vxrecover -b home src 

This command starts recovery of the failed plexes in the background (the command returns before the operation is done). If the disk has become detached, this command does not perform any recovery. If an error message appears later, or if the plexes become detached again and there are no obvious cabling failures, replace the disk (see “Replacing Disks”).

Complete Disk Failure

If a disk fails completely and hot-relocation is enabled, the email lists the disk that failed and all plexes that use the disk. The following is an example of an email:

To: root
Subject: Volume Manager failures on host teal
Failures have been detected by the VERITAS Volume Manager:
failed disks:
disk02
failed plexes:
home-02
src-02
mkting-01
failing disks:
disk02

This message shows that disk02 was detached by a failure. When a disk is detached, I/O cannot get to that disk. The plexes home-02, src-02, and mkting-01 were also detached (probably because of the failure of the disk).

Again, the problem can be a cabling error. If the problem is not a cabling error, replace the disk (see “Replacing Disks”).

Replacing Disks

Disks that have failed completely (that have been detached by failure) can be replaced by running the vxdiskadm utility and selecting item 4(Replace a failed or removed disk) from the main menu. If any initialized but unadded disks are available, select one of those disks as a replacement.

NOTE:

Do not choose the old disk drive as a replacement even though it appears in the selection list. If there are no suitable initialized disks, initialize a new disk.

If a disk failure caused a volume to be disabled, the volume must be restored from backup after the disk is replaced. To identify volumes that wholly reside on disks that were disabled by a disk failure, use the following command:

# vxinfo

Any volumes that are listed as Unstartable must be restored from backup.

To restart the volume mkting so that it can be restored from backup, use the following command:

# vxvol -o bg -f start mkting 

The -o bg option combination resynchronizes plexes as a background task.

If failures are starting to occur on a disk, but the disk has not yet failed completely, replace the disk. To replace the disk, use the following procedure:

  1. Detach the disk from its disk group.

  2. Replace the disk with a new one.

To detach the disk, run the vxdiskadm utility and select item 3 (Remove a disk for replacement) from the main menu. If initialized disks are available as replacements, specify the disk as part of this operation. Otherwise, specify the replacement disk later by selecting item 4 (Replace a failed or removed disk) from the main menu.

When you select a disk to remove for replacement, all volumes that can be affected by the operation are displayed. The following is an example display:

The following volumes will lose mirrors as a result of this operation:
home src
No data on these volumes will be lost.
The following volumes are in use, and will be disabled as a result of this operation:
mkting
Any applications using these volumes will fail future accesses. These volumes will require restoration from backup.
Are you sure you want do this? [y,n,q,?] (default: n)

If any volumes are likely to be disabled, quit from the vxdiskadm utility and save the volume. Either back up the volume or move the volume off of the disk. For example, to move the volume mkting to a disk other than disk02, use the following command:

# vxassist move mkting !disk02

After the volume is backed up or moved, run the vxdiskadm utility again and continue to remove the disk for replacement.

After the disk has been removed for replacement, a replacement disk can be specified by selecting item 4 (Replace a failed or removed disk) from the vxdiskadm main menu.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 1983-2000 Hewlett-Packard Development Company, L.P.