You are on page 1of 4

1) Rack up the 4 nodes and connect the cabling to all the nodes.

2) We must have the cable connect to the port 3 OR 4 which is BOND0 and a mandatory cable for
the port 5 which is our IPMI port.

3) Once the cabling is done to the switch from all 4 nodes. We can power-on the nodes.
4) We will have to be on the login screen on each node.
5) There will be no communication between the nodes or the IPMI IP as they are still holding the
older IP configuration.
6) Connect to each node using a KVM.
7) Login to the node as “rksupport” user
8) We will be sharing the passwords for all the nodes when we start the procedure.
9) After logging in as rksupport, we need to run the below commands to setup the new IPMI IP
configuration, and this step must be done on each node.
• sudo ipmitool lan set 1 ipsrc static
• sudo ipmitool lan set 1 ipaddr <ip_address>
• sudo ipmitool lan set 1 netmask <ip_address>
• sudo ipmitool lan set 1 defgw ipaddr <ip_address>
10) Please keep the new set of IP’s for the IPMI configuration.
11) Once the IPMI IP is changed on all the 4 nodes, we should be able to login to the IPMI using the
browser.

FROM IPMI WE NEED TO FOLLOW THE BELOW PROCEDURE:

Step-by-step Guide
1. Stop services on all nodes

sdservice.sh "*" stop 

2. On nodes with wrong IP, fix the bond configuration. Connect to each node through IPv6 link
local, and perform following steps.

2.1)  Make a copy of /etc/network/interfaces.d/, just in case

2.2)  Go to /etc/network/interfaces.d/ and remove all files except bond0.cfg


and bond1.cfg.  

 2.3) Update bond configuration.


if management  and data are same

    sudo /opt/rubrik/src/scripts/forge/configure_network.py -a -f -g <GATEWAY_IP> -mvl


<MANAGMENT_VLAN> -mip <MANAGEMENT_IP> -mn <MANAGEMENT_NETMASK> 

or if management  and data are split

    sudo /opt/rubrik/src/scripts/forge/configure_network.py -a -f -g <GATEWAY_IP> -mvl


<MANAGMENT_VLAN> -mip <MANAGEMENT_IP> -mn <MANAGEMENT_NETMASK>  -dip
<DATA_IP> -dn <DATA_NETMASK> -dvl <DATA_VLAN>

This step can also be done manually if preferred. In that case, refer to Section "Standard
form of network configuration file" below for expected configuration.

2.4) restart network and confirm now correct IPs are used. It's recommended to do so one
node at a time, and log in through IPv6 link local during the process.

      sudo systemctl restart networking.service

Please try this 2-3 times if it does not go through. If network still fails to restart, do node
reboot.

3. Fix ansible host vars (If need to fix multiple nodes): 

Pick a driving node


Manually update /var/lib/rubrik/ansible/host_vars to reflect the new DATA IPs
This step makes it possible to use rkcl to do following node specific steps from this driving
node

Note: these files are actively managed by node monitor based on information from meta
datastore. Keep services down to avoid manual edits to be overwritten.

4. On all nodes, update cockroach:

4.1 Stop cockroachdb service on node (sudo service cockroachdb stop)

4.2 Update listen_address in /etc/cassandra/cassandra.yaml to indicate new IP for


the node.

4.3 Update seeds in seeds_provider section. Randomly pick two new IP addresses as


seeds (NOTE: use same seeds for all nodes)

4.4 Take backup of old cockroachdb certificate (just in case). Then rm -f


/var/lib/rubrik/certs/cockroachdb/node.*

4.5 sudo /opt/rubrik/dist/gen_tls_cert_cockroachdb.pex --mode=node


--certs-dir='/var/lib/rubrik/certs/cockroachdb'   (this will use the new listen_address listed
in cassandra.yaml)

4.6 sudo touch /var/lib/rubrik/flags/cockroach_certs_backed_up


4.7  Create /var/lib/cockroachdb/kronos/re_ip_host_mapping.json (format of each line is
OLD_IP : NEW_IP). Note, NEW_IP is what you want to see after re_ip recovery. The following
example assumes you are moving forward (manually continue the re_ip). If you are going
BACK (restoring cluster to use previous IPs) the mappings will be different.
{
"10.10.222.1": "10.20.26.1",
"10.10.222.2": "10.20.26.2",
"10.10.222.3": "10.20.26.3",
"10.10.222.4": "10.20.26.4"
}

4.8 sudo /opt/rubrik/src/scripts/cockroachdb/rkcockroach kronos cluster backup --data-


dir=/var/lib/cockroachdb/kronos

4.9 sudo touch /var/lib/rubrik/flags/kronos_metadata_backed_up

4.10 sudo /opt/rubrik/src/scripts/cockroachdb/rkcockroach kronos cluster  re_ip  --mapping-


file=/var/lib/cockroachdb/kronos/re_ip_host_mapping.json
--data-dir=/var/lib/cockroachdb/kronos

4.11 Start cockroachdb service on node (sudo service cockroachdb start)

If 4.10 or 4.11 fails, double check if the new IPs include IP(s) reused from removed nodes. If
that's the case, apply the recovery steps in that section.

Note: until nodes reboot, cockroachdb service will not be able to talk to each other across
nodes. This is OK and expected. Only the local cockroach service is needed for following steps
until reboot.

After nodes reboot (STEP 6 at the end), Node Monitor service will add necessary iptables rules
to allow cockroach (and all other services) communication between nodes.

4.12 Update IP config in cockroach node table

        cqlsh -e "consistency local_quorum; update sd.node set data_ip_address='XXXX',


management_ip_address='XXXX' where node_id='XXXX' and cluster_id='cluster' "

It was observed in a few cases, iptables was blocking cockroach ports at this step and that
makes above command fail.

To overcome this, relax iptables rules as follows on each node.

sudo iptables -A IN-INTERNODE-WHITELIST -s x.x.x.x/32 -d x.x.x.x/32 -m comment --


comment "Manual" -j ACCEPT"

here -d option is the local node's data IP, -s is the data IP of other node (one rule per
each other node)
Or we can flush iptables all-together. After node reboot, all iptables rules will be
recreated. (Use this only as last resort. Try above first.)

sudo iptables -P INPUT ACCEPT


sudo iptables -P FORWARD ACCEPT
sudo iptables -P OUTPUT ACCEPT
sudo iptables -F

4.13 Check Cockroachdb and Kronos status:

rkcockroach node status --all

rkcockroach kronos status

5. Start services all all nodes, confirm services come up ok, and "rknodestatus" shows all nodes
in OK state.

sdservice.sh "*" start on all nodes

6. reboot all nodes and confirm everything still working

You might also like