| United States-English |
|
|
|
![]() |
VERITAS Volume Manager 3.1 Administrator's Guide: for HP-UX 11i and HP-UX 11i Version 1.5 > Chapter 8 RecoveryDetecting and Replacing Failed Disks |
|
This section describes how to detect disk failures and replace failed disks. It begins with the hot-relocation feature, which automatically attempts to restore redundant Volume Manager objects when a failure occurs.
Hot-relocation automatically reacts to I/O failures on redundant (mirrored or RAID-5) Volume Manager objects and restores redundancy and access to those objects. The Volume Manager detects I/O failures on objects and relocates the affected subdisks to disks designated as spare disks and/or free space within the disk group. Volume Manager then reconstructs the objects that existed before the failure and makes them redundant and accessible again. See “Hot-Relocation” for more information.
Hot-relocation is enabled by default and goes into effect without system administrator intervention when a failure occurs. The vxrelocd hot-relocation daemon detects and reacts to Volume Manager events that signify the following types of failures:
When such a failure is detected, the vxrelocd daemon informs the system administrator by electronic mail of the failure and which Volume Manager objects are affected. The vxrelocd daemon then determines which subdisks (if any) can be relocated. If relocation is possible, the vxrelocd daemon finds suitable relocation space and relocates the subdisks. Hot-relocation space is chosen from the disks reserved for hot-relocation in the disk group where the failure occurred. If no spare disks are available or additional space is needed, free space in the same disk group is used. Once the subdisks are relocated, each relocated subdisk is reattached to its plex. Finally, the vxrelocd daemon initiates appropriate recovery procedures. For example, recovery includes mirror resynchronization for mirrored volumes or data recovery for RAID-5 volumes. The system administrator is notified of the hot-relocation and recovery actions taken. If relocation is not possible, the system administrator is notified and no further action is taken. Relocation is not possible in the following cases:
You can prepare for hot-relocation by designating one or more disks per disk group as hot-relocation spares. For information on how to designate a disk as a spare, see “Placing Disks Under Volume Manager Control”. If no spares are available at the time of a failure or if there is not enough space on the spares, free space is automatically used. By designating spare disks, you have control over which space is used for relocation in the event of a failure. If the combined free space and space on spare disks is not sufficient or does not meet the redundancy constraints, the subdisks are not relocated. After a successful relocation occurs, you need to remove and replace the failed disk (see “Replacing Disks”). Depending on the locations of the relocated subdisks, you can choose to move the relocated subdisks elsewhere after hot-relocation occurs (see “Moving Relocated Subdisks”). Hot-relocation is turned on as long as the vxrelocd process is running. As a rule, leave hot-relocation turned on to take advantage of this feature if a failure occurs. However, if you disable this feature because you do not want the free space on some of your disks used for relocation, you must prevent the vxrelocd process from starting at system startup time. You can stop hot-relocation at any time by killing the vxrelocd process (this should not be done while a hot-relocation attempt is in progress). You can make some minor changes to the way the vxrelocd process behaves by either editing the vxrelocd line in the startup file that invokes the vxrelocd process (/sbin/rc2.d/S095vxvm-recover) or killing the existing vxrelocd process and restarting it with different options. After making changes to the way the vxrelocd process is invoked in the startup file, reboot the system so that the changes go into effect. If you choose to kill and restart the daemon instead, make sure that hot-relocation is not in progress when you kill the vxrelocd process. Also restart the daemon immediately so that hot-relocation can take effect if a failure occurs. You can alter the vxrelocd process in the following ways:
Use the vxdg command spare to display information about all of the spare disks available for relocation. The output displays the following information:
In this example, disk02 is the only disk designated as a spare. The LENGTH field indicates how much spare space is currently available on this disk for relocation. To display information about disks that are currently designated as spares, use the following commands:
When hot-relocation occurs, subdisks are relocated to spare disks and/or available free space within the disk group. The new subdisk locations may not provide the same performance or data layout that existed before hot-relocation took place. To improve performance, move the relocated subdisks (after hot-relocation is complete). You can also move the relocated subdisks off the spare disk(s) to keep the spare disk space free for future hot-relocation needs. Another reason for moving subdisks is to recreate the configuration that existed before hot-relocation occurred. During hot-relocation, an email messages is sent to root, as shown in the following example:
This message contains information about the subdisk before relocation that can be used to decide where to move the subdisk after relocation. For example, the following message indicates the new location for the relocated subdisk:
Before you move any relocated subdisks, fix or replace the disk that failed (as described in “Replacing Disks”). Once this is done, you can move a relocated subdisk back to the original disk. For example, move the relocated subdisk disk05-01 back to disk02 using the following command:
After the disk that experienced the failure is fixed or replaced, vxunreloc can be used to move all the hot-relocated subdisks back to the disk. When a subdisk is hot-relocated, its original disk media name and the offset into the disk, are saved in the configuration database. When a subdisk is moved back to the original disk or to a new disk using vxunreloc, the information is erased. The original dm name and the original offset are saved in the subdisk records. To print all of the subdisks that were hot-relocated from disk01 in the rootdg disk group, use the following command:
To move all the subdisks that were hot-relocated from disk01 back to the original disk, type:
The vxunreloc utility provides the -n option to move the subdisks to a different disk from where they were originally relocated. For example, when disk01 failed, all the subdisks that resided on it were hot-relocated to other disks. After the disk is repaired, it is added back to the disk group using a different name, for example, disk05. To move all the hot-relocated subdisks to the new disk, use the following command:
The destination disk should have at least as much storage capacity as was in use on the original disk. If there is not enough space, the unrelocate operation fails and none of the subdisks are moved. When the vxunreloc utility moves the hot-relocated subdisks, it moves them to the original offsets. However, if some subdisks occupy part or all of the area on the destination disk, the vxunreloc utility fails. If the vxunreloc utility fails, perform one of the following procedures:
As long as the destination disk is large enough so that the region of the disk for storing subdisks can accommodate all subdisks, all the hot-relocated subdisks are "unrelocated" without using the original offsets. A subdisk that has been hot-relocated more than once due to multiple disk failures can still be unrelocated back to its original location. For example, if disk01 fails, a subdisk named disk01-01 is moved to disk02.If disk02 then experiences disk failure, all the subdisks residing on disk02, including the one that was hot-relocated to it, are moved again. When disk02 is replaced, an unrelocate operation for disk02 does not affect the hot-relocated subdisk disk01-01. However, a replacement of disk01, followed by the unrelocate operation, moves disk01-01 back to disk01 when the vxunreloc utility is run, immediately after the replacement. Internally, the vxunreloc utility moves the subdisks in three phases.The first phase creates as many subdisks on the specified destination disk as there are subdisks to be unrelocated. When the subdisks are created, the vxunreloc utility fills in the comment field in the subdisk record with the string UNRELOC as an identification. The second phase moves the data. If all the subdisk moves are successful, the third phase cleans up the comment field of the subdisk records. Creating the subdisk is an all-or-none operation. If the vxunreloc utility cannot create all the subdisks successfully, no subdisk is created and the vxunreloc utility exits. The subdisk move operation is not all-or-none. One subdisk move is independent of another, and as a result, if one subdisk move fails, the vxunreloc utility prints an error message and then exits. However, all of the subsequent subdisks remain on the disk where they were hot-relocated and are not moved back. For subdisks that were returned home, the comment field in their subdisk records is still marked as UNRELOC because the cleanup phase is never executed. If the system goes down after the new subdisks are made on the destination, but before they are moved back, the unrelocate utility can be executed again after the system comes back. As described above, when a new subdisk is created, the vxunreloc utility sets the comment field of the subdisk as UNRELOC. When the vxunreloc utility is re-executed, it checks the offset, the len, and the comment fields of the subdisk on the destination disk to determine if it was left on the disk at a previous execution of the vxunreloc utility.
If one out of a series of subdisk moves fails, the vxunreloc utility exits. Under this circumstance, you should check the error that caused the subdisk move to fail and determine if the unrelocation can proceed. When you re-execute the vxunreloc utility to resume the subdisk moves, it uses the subdisks created at a previous run. The cleanup phase requires one transaction. The vxunreloc utility resets the comment field to a NULL string for all the subdisks marked UNRELOC that reside on the destination disk. This includes cleanup of subdisks that were unrelocated in a previous invocation of the vxunreloc utility that did not successfully complete. The Volume Manager hot-relocation feature automatically detects disk failures and notifies the system administrator of the failures by email. If hot-relocation is disabled or you miss the email, view disk failures using the vxprint command or by using the graphical user interface to view the status of the disks. Driver error messages are also displayed on the console or in the system messages file. If a volume has a disk I/O failure (for example, because the disk has an uncorrectable error), the Volume Manager can detach the plex involved in the failure. If a plex is detached, I/O stops on that plex but continues on the remaining plexes of the volume. If a disk fails completely, the Volume Manager can detach the disk from its disk group. If a disk is detached, all plexes on the disk are disabled. If there are any unmirrored volumes on a disk when it is detached, those volumes are also disabled. If hot-relocation is enabled when a plex or disk is detached by a failure, mail listing the failed objects is sent to root. If a partial disk failure occurs, the mail identifies the failed plexes. For example, if a disk containing mirrored volumes fails, you can receive mail information as shown in the following example:
See “Modifying the vxrelocd Process” for information on how to send the mail to users other than root. To determine which disk is causing the failures in the above example, use the following command:
The following is a typical output display:
This display indicates that the failures are on disk02 (and that subdisks disk02-03 and disk02-04 are affected). Hot-relocation automatically relocates the affected subdisks and initiates any necessary recovery procedures. However, if relocation is not possible or the hot-relocation feature is disabled, you have to investigate the problem and attempt to recover the plexes. These errors can be caused by cabling failures, so check the cables connecting your disks to your system. If there are obvious problems, correct them and recover the plexes with this command:
This command starts recovery of the failed plexes in the background (the command returns before the operation is done). If the disk has become detached, this command does not perform any recovery. If an error message appears later, or if the plexes become detached again and there are no obvious cabling failures, replace the disk (see “Replacing Disks”). If a disk fails completely and hot-relocation is enabled, the email lists the disk that failed and all plexes that use the disk. The following is an example of an email:
This message shows that disk02 was detached by a failure. When a disk is detached, I/O cannot get to that disk. The plexes home-02, src-02, and mkting-01 were also detached (probably because of the failure of the disk). Again, the problem can be a cabling error. If the problem is not a cabling error, replace the disk (see “Replacing Disks”). Disks that have failed completely (that have been detached by failure) can be replaced by running the vxdiskadm utility and selecting item 4(Replace a failed or removed disk) from the main menu. If any initialized but unadded disks are available, select one of those disks as a replacement.
If a disk failure caused a volume to be disabled, the volume must be restored from backup after the disk is replaced. To identify volumes that wholly reside on disks that were disabled by a disk failure, use the following command:
Any volumes that are listed as Unstartable must be restored from backup. To restart the volume mkting so that it can be restored from backup, use the following command:
The -o bg option combination resynchronizes plexes as a background task. If failures are starting to occur on a disk, but the disk has not yet failed completely, replace the disk. To replace the disk, use the following procedure:
To detach the disk, run the vxdiskadm utility and select item 3 (Remove a disk for replacement) from the main menu. If initialized disks are available as replacements, specify the disk as part of this operation. Otherwise, specify the replacement disk later by selecting item 4 (Replace a failed or removed disk) from the main menu. When you select a disk to remove for replacement, all volumes that can be affected by the operation are displayed. The following is an example display:
If any volumes are likely to be disabled, quit from the vxdiskadm utility and save the volume. Either back up the volume or move the volume off of the disk. For example, to move the volume mkting to a disk other than disk02, use the following command:
After the volume is backed up or moved, run the vxdiskadm utility again and continue to remove the disk for replacement. After the disk has been removed for replacement, a replacement disk can be specified by selecting item 4 (Replace a failed or removed disk) from the vxdiskadm main menu. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||