You are on page 1of 6
disk_failure_limiton page 811 in the configuration file. This oarameter defauits to 20and defines the maximum percentage of disk capacity per node that may fal be‘ore node failover is triggered For example, ‘fa node nas é local d'sks (dliskOthrough disk7), and you set the threshold to 26, up to uo disks may fail without triggering node failure; however, if a third disk ‘ils, full node “allover will ‘occur. Ona node with 8 disks and tne defaut limit of20, the node will survive the failure of one disk, but a second disk ‘ailure wil tigger node falover. Setting this mit to causes a de‘ault made af node ‘allover whenever any one disk fails on any comaute node, A Note: Ifa disk ‘ils and tre remaining disks do not contain enough free space to accommodate re-epication, Matrix biggers the node failove” process. Tolerating a high percentage of disk failures on a node is not recommended. As the number of failed disks increases, tre performance ofthe entire cluster degrades, and the time required ‘or re-replication lengthens. A\s0, the failure of muttiole disks might indicate a serious prooiem with the comoute nade: therefore, itis recommended to swap n tne standby node and determine tne root cause ofthe failures Disabling a failed compute node manuallyon page 216. Note: Setting the tolerance level for failed disks is not required on clusters with a SAN, and Matrix ignores the disk failure_limit setting. In practice, these systems can continue to operate even if all ofthe local disks have failed. Nonetheless, ogerating Matrix under such extreme Conditions is not recommended: for optimal oerformance. always replace ‘ailed disks as soon as ossible: Mirroring temporary data ‘mitror_temp_data on page 840is set toOFF and a disk failure occurs, the database continues to ‘operate without interruption unless some temporary data was lost. If temporary data was lost, the database will automatically restart all running queries will be cancelled and query progress will be lost. I you want to ensure that the database runs without interruption when a disk fai's, set mmiror_tema_data to ON. Note that setting mirror_ temp _data toON has a gerformance cost when wwrting temporary data to disk. Disabling a disk manually If performance for a disk has degraded or you want to prepare for a future disk replacement, you can disable a disk manually DISABLE LUNon page 368 command ta disable a disk. For information about replacing a Recovering a failed disk an page 235, Recovering a failed disk Follow the instructions in this section to replace a failed disk and configure the new disk by recreating the disk storage partitions, as shown in the image below. The first partition, the raw oartiton, 'sfor data storage. The second partition, the fle system partition, is used for loading and unioading data (COPY and UNLOAD conimands). The third partitton hosts the operating system. And the fourth partition, the swap garttion, hosts virtual memory for data that does not it into physical memory (RAM). 235 Using the RAID controller cammand ine tool for your server type. replace the failed disk with a work'ng disk. The steps below include the commands to use ‘or HP and LS! MegaRAID SAS. Controllers. For more informat on agout tne commands, see the documentation for your server. Perform these steps as the roat user a) Determine which disk failed. its corresponding logic array and physical disk HP controller i the root user to run the following commands iSk_number show Use hpacu # olrl sot=controller_slot_numberld logical « Review the results to make sure that the disk name returned is the same cone tat Malr x identified # ctrl sot=controller_slot_numberpd all show In the results, match up the array name with the array name from the first command, For example, 1I:1:2indicates port 1hbox ‘bay 2 LSIMegaRAIDSAS Run the ‘ollowing commands: controller # foptiMegaRAID/MegaClliMegaCii64 -LdPdinfo -aALL # foptMegaRAID/MegaCiiMegaCl64 -PDList -aALL. b} Light up the LED on the disk that needs te be replaced. HP controller Use hpacucli as the root user to run the ‘ollawing command: # ltl slot=controfler_slot_numberarray aray_namemodify led=on LSI MegaRAID SAS _ Run the following command. This command might be optional for some controller dnvers or enclosures: # fopt!MegaRAlD/MegaCiiMegaClis4 -AdpSetProo UseDiskActivityforLocate -1 -aALL Run the Zollowing command: # foptMegaRAID/MegaCi’MegaClis4 -PdLocate -start -physdrv[enciosure_device_icisiot_number -aALL. ) As the root user, run the following commands to unmount all OS references to the failed disk HP controller Ls! MegaRAID SAS controller Unmount the file system on the failed disk. For example’ # umount idevicciss'e1d1p2 Disable the swag partiton ‘or the failed disk. For example # swapott {devicoissictdtpa Fail the OS parttion on the ‘ailed disk. For example # mdadm —-manage {devimd0 —fall Mdeviccissictd193 Remove the OS partition on the failed disk. For example: # mdadm ~manage /devimd0 —remove idevicciss/ctdp3 Unmount the file system on the failed disk. For example’ # umount idevisde2 Disable the swag partiton “or the failed disk. For example # swapot fdewsdod Acimnistator's Guise ane SOL Reference Fail the OS partition an the failed disk. For example: # mdadm —manage /devimdd ~fall Mdevisde3 Remove the OS partition on the failed disk. For example’ # mdadm —manage /devimd0 —remove /dev'sdc3 d) Remove tne logic array from the RAID controller HP controller Use hpacuclias the root user te run the following command # clr slot=controller_slot_numberld fagical_disk_number delete forced LSI MegaRAID I the drive is online, run the fo lowing command to set i offine: SAS controller — # joptiMegaRAID/MegaC/MegaCli64 -PDOMiine "PnysDrvenciosure device ‘Slot numbej -acontroter slot number Run the ‘ollowing command to mark the drive as mssing i fopt/MegaRAIDiMegaCliiMegaCii64 -PDMarkMissin -PhysDrvlenciosure_device_idslot_ number} -acontrolier_slot_number Run the following command to prepare the drive for removal # fopt!MegaRAID/MegaClliMegaCii64 -PDProRmy -PhysDrvienciosure_device_islot_number| -acontrolier_slot_number €} Physically replace the failed disk with a working disk. #) Add the logic array back to the RAID contraler. HP controller Use hoac as the root user to run the ‘ollawing command: # trl slot=controller_stot_number create type=d drives=11"1:2 raid=0 Stripsize=256 LSI MegaRAID Run the folowing command to clear any foreign configuration from the new SAS controller disk # foptiMegaRAID/MegaCliMegaCii64 -CfgForeign -Clear -aALL Run the “ollowing commands to cleer the preserved cache for the orginal Virtual disk. These commands might be optiona) # fopt/MegaRAID/MegaCliMegaQii64 -GetPreservedCs List -aALL # loptIMegaRAID/MegaCliMegaCli4 -DiscardPreservedCache -Lall force “ALL Run the following command to add the single-disk RAIDO volume to the new disk i foptMegaRAID/MegaClliMegaCii64 -Cigldadd =rOlenciosure_device_c'slot_number| WB ADRA Direct -strps2256 -econtroller_slot_number ‘The newly added disk should naw be accessible and have an ematy oartiion table. 2. As the root user, run the ‘ollowing commands to coay the partion taole ‘tom a working disk to the new disk. The partition table contains information about the sizes and locations af tre parttions on the disk 238 Note: The first disk on each node has a slightly d‘ferent parttion table than the partion tables on the other disks. There‘ore, do not use the partition table from the first disk as a template. For the same reason, when recovering the partition table on the frst d'sk, do not Use the partition table om another disk on the Same node: use the partition table from tne on another compute node # sidisk -d working_disk> /tmp/oart out # sidisk —force new_disk itmp/oart out For example, to copy the partition table from sdevisde to the new disk idevisdf, run the follwing command) # sfdisk -d /dew # sfdisk de> itmolpart.out roe Idey #

You might also like