Professional Documents
Culture Documents
Baremetal RestoreCN
Baremetal RestoreCN
Introduction
This document describes the procedure to restore from bare metal on Sun Oracle Database Machine database server
(Sun Fire X4170) and HP Oracle Database Machine database server (HP DL360 G5).
Scope
This document describes the procedure to rebuild a computer node that was determined to have been irretrievably
damaged and replace it with a new, unconfigured computer node (Bare Metal) that must be re-imaged to the
proper specifications. At the end of the procedure, the Bare Metal compute node will be synchronized with the
surviving members of the cluster with respect to Exadata and Oracle stack components.
This document does not include the diagnostics to determine that hardware has failed, the procedures to mechanically replace the failed hardware, the replacement of any customer scripting, cron jobs, maintenance actions, or
non-Oracle software that may have been placed on the compute nodes of the cluster. After the Bare Metal Restore Procedure is complete, the customer is responsible for restoring scripting, cron jobs, maintenance actions,
and non-Oracle software.
Impact
During the Bare Metal Restore Procedure, the databases that are running on the database machine are available.
When the failed database server is added back into the cluster, the software is copied from one of the surviving
database servers to the replacement database server over the TCP/IP network. However, apart from this, the system resources on the surviving database servers are not significantly impacted.
Conventions
The examples in this document use the standard naming and deployment conventions of other Oracle documentation. If the environment in which the Bare Metal Restore Procedure is to be performed uses other naming and
location conventions, then you must modify the commands presented in this document to fit the environment.
The host dm01db01 refers to the replacement database server. Host dm01db02 refers to the surviving database server. In this example, there are only two nodes in the cluster (this is a Quarter Rack configuration).
[root@replacement] indicates the command should be run on the replacement database server while
[oracle@surviving] indicates the command should be run on a surviving database server while logged in
The environment variable ORACLE_HOME is set to the directory where the database software was installed previously.
Default username and passwords used during the installation of the database machine are used throughout the
example syntax.
As part of the re-imaging process, a blank 2 GB USB flash drive needs to be obtained onto which the Image
Maker will write the bootable imaging software.
The repair phase is performed either by an HP or Oracle engineer, or in the case of a CRU, you can perform the
repair phase.
The reconfiguration steps are performed after the failed database server is repaired.
Table 1 shows time estimates for performing each step in the pre-repair and reconfiguration phases. Each step
provides a link to the section that provides more information.
TABLE 1. TIME DURATION FOR THE PRE-RESTORE AND RECONFIGURATION PHASES
STEP
PRE-REPAIR STEPS
15
30 to 60
30
RECONFIGURATION STEPS
30 to 60
30
60
cluster
Apply Exadata patch bundles to replacement
30 to 60
database server
Clone Oracle Grid Infrastructure to replacement
30 to 60
database server
Clone Oracle Database home to replacement
30 to 60
database server
1.
2.
3.
4.
Stop the VIP Resources for the failed database server and delete:
[root@surviving]# srvctl stop vip -i dm01db01-vip
PRCC-1016 : dm01db01-vip.acme.com was already stopped
[root@surviving]# srvctl remove vip -i dm01db01-vip
Please confirm that you intend to remove the VIPs dm01db01-vip (y/[n]) y
6.
7.
Identify the original image, so the correct linux kernel will be installed. If command imageinfo is used, this
will return the latest patch applied. Reimaging the node with that version, may install a newer Linux kernel
which could be different to the kernel used on the other nodes of the cluster.
Apply the latest convenience package included in the latest image applied to the storage cells.
Master note 888828.1 includes the matrix with the different images and the linux kernel.
Note: To make sure dualboot is forced to no, modify file makeImageMedia.sh. Search for a line
like dualboot= and add no, like this:
dualboot=no
a.
#cd dl360
b. #./makeImageMedia.sh
Configure the service processor: This must include the IP Address / Subnet Mask / Gateway, NTP Server and
Time Zone.
Configure the BIOS Boot Order: This should be the RAID Controller first and USB flash drive next. Assuming
the system is unable to boot from the RAID Controller, then the BIOS will fall through to USB flash drive.
Insert the USB flash drive prepared in the previous step into the USB port on the replacement database
server.
2.
Login to the console through the service processor or via the KVM to monitor the progress.
3.
Power on the database server using either the service processor interface or by physically pressing the
server power button.
4.
The system boots and should detect the CELLUSBINSTALL media. Allow the system to boot.
a.
The first phase of the imaging process identifies any BIOS or Firmware that is out of date, and
upgrades the components to the expected level for the ImageMaker. If any components must be
upgraded (or downgraded), then the system is automatically rebooted.
b. The second phase of the imaging process will install the factory image on to the replacement database server. At the end of the imaging process, a message requests you to remove the USB
flash drive from the server, and to then press enter to power off the server.
5.
Remove the USB flash drive from the replacement database server.
6.
The first phase of the imaging process identifies any BIOS or Firmware that is out of date, and
upgrades the components to the expected level for the ImageMaker. If any components must be
upgraded (or downgraded), then the system is automatically rebooted.
b. The second phase of the imaging process will install the factory image on to the replacement database server. At the end of the imaging process, a message requests you to remove the USB flash
drive from the server, and to then press enter to power off the server.
7. Remove the USB flash drive from the replacement database server.
8. Press Enter to power off the server.
Name servers
2.
3.
NTP servers
4.
5.
6.
7.
8.
On the replacement node, use the groupadd command to add the group information:
[root@replacement]#
[root@replacement]#
[root@replacement]#
[root@replacement]#
groupadd
groupadd
groupadd
groupadd
g
g
g
g
1001
1002
1003
1004
oinstall
dba
oper
asmdba
b) Add the user (or users) for the Oracle environment (typically, this is oracle).
On the surviving node, obtain the current user information:
[root@surviving]# id oracle
Name: (null)
Shell: /bin/bash
c) Set the password for the Oracle software owner (the password is typically configured during deployment to be the oracle user):
[root@replacement]# passwd oracle
Changing password for user oracle.
New UNIX password:
Retype new UNIX password:
passwd: all authentication tokens updated successfully.
d) Create the ORACLE_BASE and Grid Infrastructure directories such as /u01/app/oracle and
/u01/app/11.2.0/grid, as follows:
[root@replacement]# mkdir -p /u01/app/oracle
[root@replcaement]# mkdir -p /u01/app/11.2.0/grid
[root@replacement]# chown -R oracle:oinstall /u01/app
e) Change the ownership on the cellip.ora and cellinit.ora files. This is typically
oracle:dba
[root@replacement]# chown -R oracle:dba /etc/oracle/cell/networkconfig
ii.
Create a dcli group file listing the nodes in the Oracle Cluster.
iii.
Run the setup ssh script (this assumes the oracle password on all servers in the dbs_group list
is set to welcome)
[oracle@replacement]$ /opt/oracle.SupportTools/onecommand/setssh.sh -s
-u oracle -p welcome -n N -h dbs_group
.........................
At the end of the report, you should see the text Post-check for hardware and operating
system setup was successful.
2.
Status
Ref. node status
Comment
----------------------- ----------------------- ---------31.02GB (3.2527572E7KB) 29.26GB (3.0681252E7KB) mismatched
Status
Ref. node status
Comment
----------------------- ---------------------- ---------55.52GB (5.8217472E7KB) 51.82GB (5.4340608E7KB) mismatched
If the only components that failed are related to physical memory, swap space, and disk space, then it is
safe to continue.
10
If the only component that fails is related to swap space, then it is safe to continue.
4.
This initiates the OUI to copy the clusterware software to the replacement database server.
WARNING: A new inventory has been created on one or more nodes in this session.
However, it has not yet been registered as the central inventory of this system.
To register the new inventory please run the script at
'/u01/app/oraInventory/orainstRoot.sh' with root privileges on nodes 'dm01db01'.
If you do not register the inventory, you may not be able to update or patch the
products you installed.
The following configuration scripts need to be executed as the "root" user in
each cluster node:
/u01/app/oraInventory/orainstRoot.sh #On nodes dm01db01
/u01/app/11.2.0/grid/root.sh #On nodes dm01db01
5.
Run the orainstRoot.sh and root.sh scripts for the replacement database server:
[root@replacement]# /u01/app/oraInventory/orainstRoot.sh
Creating the Oracle inventory pointer file (/etc/oraInst.loc)
Changing permissions of /u01/app/oraInventory.
Adding read,write permissions for group.
Removing read,write,execute permissions for world.
Changing groupname of /u01/app/oraInventory to oinstall.
The execution of the script is complete.
[root@replacement]# /u01/app/11.2.0/grid/root.sh
Check /u01/app/11.2.0/grid/install/root_dm01db01.acme.com_2010-03-10_17-5915.log for the output of root script
The output file created above will report that the LISTENER resource on the
replaced database server failed to start. This is the expected output.
PRCR-1013 : Failed to start resource ora.LISTENER.lsnr
PRCR-1064 : Failed to start resource ora.LISTENER.lsnr on node dm01db01
CRS-2662: Resource 'ora.LISTENER.lsnr' is disabled on server 'dm01db01'
start listener on node=dm01db01 ... failed
11
Reenable the listener resource that was stopped and disabled in the Remove the Failed Database Server
from the Cluster section earlier in this white paper.
[root@replacement]# /u01/app/11.2.0/grid/bin/srvctl enable listener -l LISTENER
-n dm01db01
[root@replacement]# /u01/app/11.2.0/grid/bin/srvctl start listener -l LISTENER n dm01db01
These commands initiate the OUI (Oracle Universal Installer) to copy the Oracle Database software to
the replacement database server. However, to complete the installation, you must run the root scripts on
the replacement database server after the command completes.
WARNING: The following configuration scripts need to be executed as the root
user in each cluster node.
/u01/app/oracle/product/11.2.0/dbhome_1/root.sh #On nodes dm01db01
To execute the configuration scripts:
Open a terminal window.
Log in as root.
Run the scripts on each cluster node.
After the scripts are finished, you should see the following informational messages:
The Cluster Node Addition of /u01/app/oracle/product/11.2.0/dbhome_1 was
successful.
Please check '/tmp/silentInstall.log' for more details.
2.
3.
Review that file init<SID>.ora under $ORACLE_HOME/dbs reference the spfile in the ASM shared
storage.
Review the password file which gets copied over under $ORACLE_HOME/dbs during addnode,
needs to be changed to orapw<SID>
12