5G Core 21.1 Product Reliability ISSUE 1.0

5G Core Product Reliability
www.huawei.com
Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Foreword
 The 5G Core follows the concept of "layered construction and cross-layer collaboration" to ensure high
service availability and reliability. Its reliability design has shifted from hardware-centric to software-
centric E2E service availability.
 Layered construction: A layered model is formulated for faulty objects, the solution is designed based on FMEA.
 NFVI layer: Redundancy design and anti-affinity deployment are used to eliminate single points of failure. Shared and
distributed storage is used to ensure data reliability.
 VNF layer: The cloud architecture and service processing unit stateless design are leveraged, and software is layered by
functions, paving the way for smooth service migration. The distributed database ensures 1+1 redundancy of service data.
Software-based fast fault detection and service migration within VNFs ensure quick service recovery in the case of a fault.
The VNFs inherit traditional network reliability characteristics, achieving remote disaster recovery.
 Cross-layer collaboration: High availability is ensured at both the VNF and NFVI layers to ensure quick fault
recovery.
Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 3

Logical Architecture of 5G Core
Peripheral components:
 U2020: interconnects with products, and provides a GUI for
U2020
product O&M.
 VNFM: provides life cycle and pod/VM management for
Product products.
O&M layer  PaaS: provides container cluster management, and works with
ACS MNT KAFKA MML Service sHA ...... VNFMAdaptor the MANO to provide instantiation and routine operations for
containerized products.
Data & network access
Service layer  Cloud OS: works with the MANO to provide VM-level entities
layer
SM AM HTTP ...... CM CSDB and routine operations for products.
Architecture:
CSLB
 O&M layer: interconnects with the U2020 or VNFM to provide
Platform layer ......
VNF
VNRS configuration, maintenance, and other routine operations for
HAF SDR SCF ...... DCF
products.
 Service layer: consists of the modules for service processing.
 Data & network access layer: provides data storage & access
functions for services.
PaaS (FusionStage…)
 Platform layer: provides communication, status awareness, token
allocation, and resource monitoring functions for the upper
CloudOS (FusionSphere OpenStack…) layers.

Objectives
 Upon completion of this course, you will be able to:

 Describe the inter-VNF redundancy and VNF internal reliability solutions.
 Understand the difference between the NFVI and VNF reliability features and the scopes of fault rectification.

Contents
1. NFVI Reliability Solution

2. VNF Internal Reliability Solution
3. Inter-VNF Redundancy

NFVI Reliability Solution
 Different types of faults:

 Board restart or power-off, host OS, network port, or network adapter failure, packet loss or error, full disk, bad
sector of a disk, and storage controller or storage network failure
 VNF-layer fault: VM or VM communication failure, slow VM I/O, or VM I/O failure
 The VNF layer can quickly detect VM faults and initiate service migration and restoration when there is a
hardware failure.
 The NFVI reliability solution is summarized as follows:
 Computing reliability:
 Redundant resources: racks, subracks, power modules, fans, boards, network ports, network adapters, and more
 Redundant VM configuration, and anti-affinity deployment (pods of a type must be deployed on different VMs)
 HA at both the VNF and NFVI levels

 NFVI reliability solution

 Storage reliability:
 Shared storage (IP/FC SAN) is adopted on VMs. Dual-plane communication and RAID10 storage ensure data redundancy
and reliability.
 Distributed storage is adopted on VMs where data is backed up with three replicas to enhance data reliability.
 The anti-affinity mechanism is used for VM-level storage.
 A pod can leverage the hostPath volume to have access to the hosted node (VM) directory, where the pod can store its data
produced during operation.
 Multi-path access and storage status detection are supported.
 Redundant devices, such as disk array controllers and switches, are deployed. Storage devices are equipped with built-in
batteries.

 NFVI reliability solution

 Network reliability:
 Redundancy deployment is adopted for network ports, network adapters, switching devices, and routing devices to ensure
reliable communication.

Contents


VNF Reliability Solution
 The NFVI layer can be provided by different vendors. The 5G Core must provide
VNF-level reliability solutions independent of the NFVI layer.
 Different types of faults:
 Process/Container/Pod/VM failure
 Network communication failure
 Slow or failed VM I/O
 External traffic burst
 Prevention: Multiple instances must be deployed for multiple entities having the same function to avoid a single
point of failure. Service processing is separated from data storage. The service processing units are stateless.
 Fault detection: Each layer provides the monitoring function, creating a complete monitoring chain at the VNF
layer and enabling faults to be discovered in a timely manner.
 Service restoration or fault rectification: The faults are reported in a timely manner for quick service migration.
Different recovery policies are selected for different faults.

Contents

2.1 Stateless Design
2.2 Strong Anti-affinity Deployment
2.3 N-way Redundancy
2.4 Node Management Reliability
2.5 Multi-level Self-healing
2.6 VNF Fault Detection
2.7 Split Brain Avoidance

Stateless Design
The 5G Core divides VNF software into four layers (by service functions): O&M layer, service layer, data & access
layer, and platform layer. This design separates service processing from data storage. Service processing units become
stateless. Other units can smoothly take over services when some units fail and are isolated.
5G Core , CloudMSE , CloudUIC
Cloud Session Database
Service Process Unit
Cloud Session Load Balancer
Cloud OS
Data Center

Contents


Strong Anti-affinity Deployment
 The 5G Core leverages VM containers. The minimum unit for service processing is a pod. Multiple pods
are deployed on one VM, and multiple VMs are deployed on one blade server. Strong anti-affinity is
available at two levels:
 Pod-level: Only one pod of the same type can be deployed on one VM.
 VM-level: Only one VM of the same type can be deployed on a host.
 The two-level strong anti-affinity deployment ensures that only one pod of the same type can be deployed on one
host.
 VNF layer:
 Pods for control services require strong anti-affinity deployment at two layers to ensure that only one pod of the
same type is allowed on one board. If there is a VM- or board-level fault, the control service pods of the same
type will not be faulty (completely or partially).
 Strong anti-affinity deployment is not mandatory for pods used for service processing. The 5G Core products
adopt the deployment for the pods but not for the VMs.

Contents


N-way Redundancy
 N-way redundancy:
 A VNF has X+Y VMs or pods, among which X VMs/pods are used to meet service requirements and Y VMs/pods ensure
service continuity in case of multiple VM/pod faults. There are X+Y pod/VMs handling services simultaneously, with user data
and multiple replicas stored in the CSDB in real time. If Z (1 < Z <= Y) pods/VMs become faulty, the VNF migrates services
and the CSDB pushes the services to other X+Y-Z pods/VMs to mitigate the service impact.
 The service switchover process after AMF/SMF 1 is faulty is as

CSDB 1 CSDB 2 CSDB 3 follows:
 If AMF/SMF 1 is detected to be faulty, the information is reported to
the token allocation service. The service then allocates the tokens of
AMF/SMF 1 evenly to AMFs/SMFs 2 to 5.
AMF/ AMF/ AMF/ AMF/ AMF/

 The AMFs/SMFs 2 to 5 obtain service data from CSDBs 1 to 3 using
SMF 1 SMF 2 SMF 3 SMF 4 SMF 5 the new tokens to take over the original services from AMF/SMF 1.
 Using the new tokens, the AMFs/SMFs 2 to 5 update the service
routing information of the original AMF/SMF 1 on the CSLBs 1 to 3.
CSLB 1 CSLB 2 CSLB 3

Contents


Node Management Reliability
 Background:
 The 5G Core products use pods as the entities to run services. The VNF layer does not use a VM (referred to as a
node at the PaaS layer) as a management object for fault monitoring and recovery. The NFVI layer checks NFVI
network and board faults to detect and recover from VM faults, and does not monitor internal VM status or
perform fault detection and restoration from the service perspective. The reliability requirements for VNF layer
are not met.
 The VNF layer provides node management services at the software level. It monitors intra- and inter-VM faults
and takes different recovery measures based on the monitoring results, ensuring quick VM fault detection and
restoration.

 Working principle:
 The server/client mode is used.
 The client functionality is deployed on each node to detect faults.
 Types of intra-node faults:
− Storage: storage read/write error (failure or suspension), slow disk response, I/O overload, partition/volume read-only fault,
partition exhaustion...
− Computing: CPU overload, memory overload or exhaustion, node overload, PID overload, system key service fault…
− Network: IP/route loss, network fault, IP address conflict, abnormal port status…
 Types of inter-node faults:
− Communication link fault, or sub-health issue
 The server functionality is deployed on control nodes. It diagnoses and rectifies faults reported by the client, detects node
faults (possibly caused by VM suspension, reboot, or repeated reboot) using the heartbeat mechanism, and provides node-
specific O&M commands (such as query, reboot, and rebuilding commands). Once detecting a fault, the server functionality
interacts with services to determine whether to start node self-healing.

 Working principle:
 Fault rectification methods on the server:
 Alarm reporting: The server reports alarms indicating the key file loss, disk partition overload, slow disk, key service fault,
IP address loss, IP address conflict, and other issues.
 Direct rectification of deterministic faults: The server automatically restarts a network port which was found faulty.
 Multi-level self-healing if a node fault is detected: The server determines whether to start self-healing based on the service
status. It self-healing is required, it performs different self-healing operations based on the VM status at the NFVI layer. The
operations include reboot, VM rebuilding, and OS rebuilding.
VNF HA
service
3. Fault info report &
Service status/Self-
healing policy query
1. Fault
4. Node info 2a. Fault info
detection
retrieval report
VNFM Server Client
5. Self - healing 2 b. Heartbeat
policy issued detection +Ping
Node management service

Contents


Multi-level Self-healing
 Self-healing can be achieved at two levels:

 Multi-level self-healing operations for a faulty object: When rectifying a fault on an object, the system reboots the
object first. If the fault persists, the system rebuilds the object.
 Fault escalation for the object: If the fault cannot be rectified using the self-healing operations, the system
considers escalating the fault to a higher level based on the service status. For example, it escalates the fault from
the pod level to the VM level and performs VM-level self-healing until the fault is rectified.

Multi-level Self-healing
 Deployment Hierarchy  Self-healing Hierarchy

Node level: reboot/VM
rebuilding/OS rebuilding
Node Repeated reboots/persistent faults/large-
scale faults: escalation
Node
Pod level: repeated restarts
Pod Pod Repeated reboots/persistent faults/large-
scale faults: escalation
Container/RU Container/RU level: repeated restarts

Container/RU Container/RU Repeated reboots/persistent faults/large-
Cell scale faults: escalation
Cell level: repeated restarts

Cell Cell Cell Repeated reboots/persistent faults/large-
Actor scale faults: escalation
Actor Actor Actor Actor Actor level: repeated

restarts

Contents


VNF Fault Detection
Host (7)
 Faults on:
 Running entities: (7) host; (6) node or VM; (5) pod; (2) container or RU; (1)
Node (6)
cell or process
Pod (5)  Networks: (3)(4) VM container network; (8)(9) NFVI network
Container/RU (2)  There are two mechanisms to detect product faults:
Cell (1)  Quick detection:

 It must work with the proactive reporting or signal handing (interception) mechanisms.
(3) For example, a running entity detects a normal exit and reports it to the remote server.
Overlay SR-IOV  Scope: Running entities' normal exits can be detected.
(4)
VF 1 VF 2 VF 3  Slow detection:
(8)
PF 1 PF 2  It must work with the heartbeat mechanism implemented between the remote server and
NFVI network (9)
running entities (deployed in processes).
Simplified VM  Scope: Running entities' unexpected suspensions or exits as well as network faults can
container network
be detected.
Note: There are no nodes on a bare metal container network.

VNF Fault Detection
Object Fault Detection & Diagnosis Fault Detection, Isolation, and Recovery
 Quick detection: When a child process exits, the parent process can immediately detect the
exit. The parent process reports it to the HA-Server, as shown in steps 3 and 1 in the
The HA-Server pushes the status change to the subscriber.
Child process preceding figure. The process takes less than 1 second.
 Slow detection: The HA-Server and the SDK in the child process exchange heartbeat The parent process reboots the child process.
signaling. The process takes less than 6 seconds.
 The first process periodically monitors the parent process, as shown in step 2.A in the
Container preceding figure. If the first process detects that the parent process is faulty, the
Parent  The K8S livenessProbe mechanism is used to periodically check whether the parent process process exits. If the livenessProbe detection fails, the Kubelet reboots
process
is alive. If the parent process is not alive for a long time, an error is returned, as shown in the container.
steps 1.B and 2.B in the preceding figure.
Initial The Kubelet can detect container exit events. If the first process exits, it considers that the
The Kubelet reboots the container.
process container exits. See step 1.A in the preceding figure.
 Product layer: The HA-Server determines whether all processes in the same container are  Product layer: If there is a pod fault, then pod self-healing
faulty based on the process fault detection result in step 1, summarizes the data, and reports
Pod the container faults to the HA-Governance (as shown in step 2 of the figure). Then the HAF- operations (issued via the VNFM) are triggered.
 PaaS layer: If the number of operational instances is not equal to
Governance determines whether the faults are pod faults.
the configured number, pod self-healing is triggered.
 PaaS layer: The pod running status is monitored.
 Product layer: The NRS-Agent detects faults in the node and checks the network between  Product layer: If the fault is deterministic, the NRS-Master
nodes. If a fault occurs, the NRS-Agent notifies the NRS-Master of the fault. The NRS- performs node-level self-healing. (This function has been
Master notifies the HA-Governance of the fault. See steps 1 and 2 in the figure. available.) For undeterministic faults, the HA-Governance
Node  PaaS layer: The node resource (CPU/ memory/disk) usage is checked to see if it exceeds the performs node-level self-healing based on service status
threshold. The network connection is checked. They are the criteria for determining if a node diagnosis. (This function has been planned.)
is still available.  PaaS layer: If a node is unavailable, the pods will be migrated.
 NFVI layer: The NFVI layer status is monitored for VMs.  NFVI layer: HA is triggered at the NFVI layer.
 Product layer: Faults are checked through heartbeat or network detection. If there are service  Product layer: Node-level self-healing is triggered.
Host instance, process, container, or node faults, the information will be released.  PaaS layer: The pod is migrated.
 PaaS layer: The nodes on the host are checked to see if they are faulty.
 NFVI layer: Hosts are checked.
 NFVI layer: HA is triggered at the NFVI layer.

VNF Fault Detection Service migration:
(2)
Token allocation Token allocation
HA service (3) service HA service
(2) service
(4) Service 1
(1) (1) (1) (4) (1) (1) (1) (3)
(4) becomes faulty. (3)
Service 1 Service 2 Service 3 Service 1 Service 2 Service 3
(5) (5) (5) (4) (4)
Database Database
service service
 After the system starts up:  After service 1 becomes faulty:

 (1) The HA service monitors services by interacting with service clients.  (1) The HA service detects that service 1 is faulty.
 (2) The token allocation service subscribes to service status from the HA service.  (2) The HA service notifies the token allocation service of the fault.
 (3) The HA service notifies the token allocation service of any service status change.  (3) The token allocation service detects this fault and then migrates the token of
 (4) The token allocation service allocates tokens to services that are in normal status. service 1 to service 2 or service 3.
 (5) Services with tokens obtain service data from the database service.  (4) Service 2 or service 3 then obtains service data from the database service.

Contents


Split Brain Avoidance
 Split brain:
 A network is split into two separate partitions if multiple nodes in the cluster are not able to communicate anymore. The two
partitions may interfere with each other because each of them has an active node, affecting services across the network.
 Principle:
 To resolve the dual-active problem, an arbitration center serves as a lighthouse at the product layer. Once the network is split,
nodes in the partition well connected to the lighthouse keep running properly, whereas nodes in the other partition disconnected
from the lighthouse are isolated.
 Isolation measures:
 Active control node: demoted to the standby node via a reboot if it is disconnected from the arbitration center.
 Non-control node: reboots or keeps running after being disconnected from the active control node.

Split Brain Avoidance Network split:
Subnet A Subnet B
ETCD 1 ETCD 2 ETCD 3
Arbitration Follower
center as a Leader Follower
ETCD 1 ETCD 2 ETCD 3
lighthouse Arbitration Follower Leader
center as a Follower
Network lighthouse
split
Active Standby Standby
Service cluster - control control control
control nodes node node 1 node 2 Standby control
Service cluster Standby Active control
- control nodes control node node node 2
Non- Non- Non-

Service cluster - control
control control
non-control node 3 Non-control Non-control Non-control
node 1 node 2 Service cluster
nodes node 1 node 2 node 3
- non-control
nodes
Original network, shown in the left figure
Network split into partitions A and B, shown in the right figure

 Arbitration center: ETCD 1 in partition A is demoted to a follower, and ETCD 2 in partition B is promoted to the leader.
 Service cluster- control nodes: The active control node in partition A is disconnected from the arbitration center and is isolated after being demoted to a standby node via
a reboot. Standby control node 1 in partition B is promoted to the active control node.
 Service cluster - non-control nodes: Non-control node 1 in partition A is disconnected from the master node and then isolated. Non-control nodes 2 and 3 in partition B
keep running to interact with the new active control node.

Contents


Inter-VNF Redundancy Pool redundancy:
 Two NRFs work in active/standby mode. The active
1. Active/standby
NRF processes services. NRFs NRF ( (active) NRF ( standby) NSSF ( (active) NSSF (standby)
 Two NSSFs work in active/standby mode. The active 2. Active/standby NSSFs
NSSF processes services.

 Multiple AMFs work in load-sharing mode. Center AMF AMF AMF
 Multiple SMFs work in load-sharing mode. 3. Pooled AMFs
 SMFs and UPFs are fully meshed.
4. Pooled SMFs SMF SMF SMF

Fault Scenario NRF NSSF AMF SMF
Intra-NF fault Self-healing
Fault in a single NF,

Subscribers need PDU sessions
Edge 5. Fully meshed
SMFs and UPFs
DC GW, or DC
16s to be registered need to be re- UPF UPF UPF …
(CloudOS or
again. established.
equipment room)
The time in the table refers to the service interruption time.

Inter-VNF Redundancy Backup redundancy:
 Multiple NRFs work in load sharing mode and store NF information
in the distributed database cluster for mutual backup.
1. Pooled NRFs
 Multiple NSSFs work in load sharing mode and store slice 2. Pooled NSSFs NRF NRF NRF NSSF NSSF NSSF
information in the distributed database cluster for mutual backup. Distributed database cluster Distributed database cluster
 Multiple AMFs work in load sharing mode and store subscriber
information in the distributed database cluster for mutual backup.
 Multiple SMFs work in load sharing mode and store PDU sessions Center AMF AMF AMF
in the distributed database cluster for mutual backup. 3. Pooled AMFs Distributed database cluster
 SMFs and UPFs are fully meshed.
4. Pooled SMFs SMF SMF SMF

Distributed database cluster
Fault Scenario NRF NSSF AMF SMF Edge

Intra-NF fault Self-healing 5. Fully meshed CU …
UPF UPF UPF
Fault in a single NF, 10s Subscribers are PDU sessions
DC GW, or DC still online and do not need
(CloudOS or do not need to to be re-
equipment room) be registered established.
again. (< 14s) (< 14s)
The time in the table refers to the service interruption time.

Summary
 NFVI reliability solution: NFVI reliability design from the aspects of compute, storage, and network
 Intra-VNF reliability solution: Stateless service processing units, strong anti-affinity deployment, N-way redundancy,
node management reliability, multi-level self-healing, VNF fault detection, and brain-split avoidance ensure timely
detection of VNF layer faults and service migration. Then comprehensive diagnosis will be performed and effective
recovery measures will be taken.
 Inter-VNF reliability solution: Remote disaster recovery is used. Different NFs use different disaster recovery modes.

Q&A
1. A split brain is discovered within a VNF, with the network split into a partitions A and B. The ETCD cluster has
three nodes (1, 2, and 3), where nodes 1 and 2 are in partition A and node 3 is in partition B. How will the master
instance A in partition B respond in this case? ( )
A. Reboots
B. Continues running
2. An AMF keeps subscribers online using the ( ) mechanism if a single NF becomes faulty.
A. Backup redundancy
B. Pool redundancy

Future Plan
Feature Description
Disk array bypass When a disk array is faulty, services are not affected and limited O&M operations are allowed.
• 5% packet loss allowed on the communication plane

Enhanced communications reliability • Support for priority settings for communication channels or partitions
• Subhealthy communication detection and comprehensive diagnosis
Flow control Flow control over 64x/100x traffic
Solution to no-master termination in case Instances in a smaller partition will keep running, and services will not be affected during or after
of a split brain network consolidation.
Enhanced multi-level self-healing More fault or fault escalation scenarios will be considered.

Thank You
www.huawei.com

5G Core 21.1 Product Reliability ISSUE 1.0

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5G Core 21.1 Product Reliability ISSUE 1.0

Uploaded by

Copyright:

Available Formats

5G Core Product Reliability

Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 3

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 4

 Upon completion of this course, you will be able to:

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 5

1. NFVI Reliability Solution

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 6

 Different types of faults:

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 7

 NFVI reliability solution

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 8

 NFVI reliability solution

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 9

1. NFVI Reliability Solution

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 10

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 11

2. VNF Internal Reliability Solution

2.2 Strong Anti-affinity Deployment

2.3 N-way Redundancy

2.4 Node Management Reliability

2.5 Multi-level Self-healing

2.6 VNF Fault Detection

2.7 Split Brain Avoidance

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 12

5G Core , CloudMSE , CloudUIC

Cloud Session Database

Service Process Unit

Cloud Session Load Balancer

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 13

2. VNF Internal Reliability Solution

2.2 Strong Anti-affinity Deployment

2.3 N-way Redundancy

2.4 Node Management Reliability

2.5 Multi-level Self-healing

2.6 VNF Fault Detection

2.7 Split Brain Avoidance

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 14

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 15

2. VNF Internal Reliability Solution

2.2 Strong Anti-affinity Deployment

2.3 N-way Redundancy

2.4 Node Management Reliability

2.5 Multi-level Self-healing

2.6 VNF Fault Detection

2.7 Split Brain Avoidance

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 16

 The service switchover process after AMF/SMF 1 is faulty is as

AMF/ AMF/ AMF/ AMF/ AMF/

CSLB 1 CSLB 2 CSLB 3

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 17

2. VNF Internal Reliability Solution

2.2 Strong Anti-affinity Deployment

2.3 N-way Redundancy

2.4 Node Management Reliability

2.5 Multi-level Self-healing

2.6 VNF Fault Detection

2.7 Split Brain Avoidance

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 18

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 19

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 20

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 21

2. VNF Internal Reliability Solution

2.2 Strong Anti-affinity Deployment