Professional Documents
Culture Documents
www.huawei.com
VNF
VNRS configuration, maintenance, and other routine operations for
HAF SDR SCF ...... DCF
products.
Service layer: consists of the modules for service processing.
Data & network access layer: provides data storage & access
functions for services.
PaaS (FusionStage…)
Platform layer: provides communication, status awareness, token
allocation, and resource monitoring functions for the upper
CloudOS (FusionSphere OpenStack…) layers.
Cloud OS
Data Center
Background:
The 5G Core products use pods as the entities to run services. The VNF layer does not use a VM (referred to as a
node at the PaaS layer) as a management object for fault monitoring and recovery. The NFVI layer checks NFVI
network and board faults to detect and recover from VM faults, and does not monitor internal VM status or
perform fault detection and restoration from the service perspective. The reliability requirements for VNF layer
are not met.
The VNF layer provides node management services at the software level. It monitors intra- and inter-VM faults
and takes different recovery measures based on the monitoring results, ensuring quick VM fault detection and
restoration.
Working principle:
The server/client mode is used.
The client functionality is deployed on each node to detect faults.
Types of intra-node faults:
− Storage: storage read/write error (failure or suspension), slow disk response, I/O overload, partition/volume read-only fault,
partition exhaustion...
− Computing: CPU overload, memory overload or exhaustion, node overload, PID overload, system key service fault…
− Network: IP/route loss, network fault, IP address conflict, abnormal port status…
Types of inter-node faults:
− Communication link fault, or sub-health issue
The server functionality is deployed on control nodes. It diagnoses and rectifies faults reported by the client, detects node
faults (possibly caused by VM suspension, reboot, or repeated reboot) using the heartbeat mechanism, and provides node-
specific O&M commands (such as query, reboot, and rebuilding commands). Once detecting a fault, the server functionality
interacts with services to determine whether to start node self-healing.
VNF HA
service
3. Fault info report &
Service status/Self-
healing policy query
1. Fault
4. Node info 2a. Fault info
detection
retrieval report
VNFM Server Client
5. Self - healing 2 b. Heartbeat
policy issued detection +Ping
Node management service
Host (7)
Faults on:
Running entities: (7) host; (6) node or VM; (5) pod; (2) container or RU; (1)
Node (6)
cell or process
Pod (5) Networks: (3)(4) VM container network; (8)(9) NFVI network
(3) For example, a running entity detects a normal exit and reports it to the remote server.
Overlay SR-IOV Scope: Running entities' normal exits can be detected.
(4)
VF 1 VF 2 VF 3 Slow detection:
(8)
PF 1 PF 2 It must work with the heartbeat mechanism implemented between the remote server and
NFVI network (9)
running entities (deployed in processes).
Simplified VM Scope: Running entities' unexpected suspensions or exits as well as network faults can
container network
be detected.
Note: There are no nodes on a bare metal container network.
Product layer: The HA-Server determines whether all processes in the same container are Product layer: If there is a pod fault, then pod self-healing
faulty based on the process fault detection result in step 1, summarizes the data, and reports
Pod the container faults to the HA-Governance (as shown in step 2 of the figure). Then the HAF- operations (issued via the VNFM) are triggered.
PaaS layer: If the number of operational instances is not equal to
Governance determines whether the faults are pod faults.
the configured number, pod self-healing is triggered.
PaaS layer: The pod running status is monitored.
Product layer: The NRS-Agent detects faults in the node and checks the network between Product layer: If the fault is deterministic, the NRS-Master
nodes. If a fault occurs, the NRS-Agent notifies the NRS-Master of the fault. The NRS- performs node-level self-healing. (This function has been
Master notifies the HA-Governance of the fault. See steps 1 and 2 in the figure. available.) For undeterministic faults, the HA-Governance
Node PaaS layer: The node resource (CPU/ memory/disk) usage is checked to see if it exceeds the performs node-level self-healing based on service status
threshold. The network connection is checked. They are the criteria for determining if a node diagnosis. (This function has been planned.)
is still available. PaaS layer: If a node is unavailable, the pods will be migrated.
NFVI layer: The NFVI layer status is monitored for VMs. NFVI layer: HA is triggered at the NFVI layer.
Product layer: Faults are checked through heartbeat or network detection. If there are service Product layer: Node-level self-healing is triggered.
Host instance, process, container, or node faults, the information will be released. PaaS layer: The pod is migrated.
PaaS layer: The nodes on the host are checked to see if they are faulty.
NFVI layer: Hosts are checked.
NFVI layer: HA is triggered at the NFVI layer.
(4) Service 1
(1) (1) (1) (4) (1) (1) (1) (3)
(4) becomes faulty. (3)
Database Database
service service
Split brain:
A network is split into two separate partitions if multiple nodes in the cluster are not able to communicate anymore. The two
partitions may interfere with each other because each of them has an active node, affecting services across the network.
Principle:
To resolve the dual-active problem, an arbitration center serves as a lighthouse at the product layer. Once the network is split,
nodes in the partition well connected to the lighthouse keep running properly, whereas nodes in the other partition disconnected
from the lighthouse are isolated.
Isolation measures:
Active control node: demoted to the standby node via a reboot if it is disconnected from the arbitration center.
Non-control node: reboots or keeps running after being disconnected from the active control node.
Subnet A Subnet B
ETCD 1 ETCD 2 ETCD 3
Arbitration Follower
center as a Leader Follower
ETCD 1 ETCD 2 ETCD 3
lighthouse Arbitration Follower Leader
center as a Follower
Network lighthouse
split
Active Standby Standby
Service cluster - control control control
control nodes node node 1 node 2 Standby control
Service cluster Standby Active control
- control nodes control node node node 2
1. A split brain is discovered within a VNF, with the network split into a partitions A and B. The ETCD cluster has
three nodes (1, 2, and 3), where nodes 1 and 2 are in partition A and node 3 is in partition B. How will the master
instance A in partition B respond in this case? ( )
A. Reboots
B. Continues running
2. An AMF keeps subscribers online using the ( ) mechanism if a single NF becomes faulty.
A. Backup redundancy
B. Pool redundancy
Feature Description
Disk array bypass When a disk array is faulty, services are not affected and limited O&M operations are allowed.
Solution to no-master termination in case Instances in a smaller partition will keep running, and services will not be affected during or after
of a split brain network consolidation.
Enhanced multi-level self-healing More fault or fault escalation scenarios will be considered.