You are on page 1of 8

2017 International Conference on High Performance Computing & Simulation

Adaptive Root Cause Analysis


for Self-Healing in 5G Networks

Harrison Mfula Jukka K. Nurminen


Nokia Networks Department of Computer Science
Karaportti, Espoo, Finland Aalto University, Espoo, Finland
Email: harrison.mfula@nokia.com Email: jukka.k.nurminen@aalto.fi

Abstract—Root cause analysis (RCA) is a common and re- continues to attract a lot of criticism especially in discussions
curring task performed by operators of cellular networks. It debating the challenges and opportunities of its possible ap-
is done mainly to keep customers satisfied with the quality of plication in emerging technologies such as LTE-A and 5G.
offered services and to maximize return on investment (ROI)
by minimizing and where possible eliminating the root causes Many operators and domain experts agree that the current
of faults in cellular networks. Currently, the actual detection fault detection and diagnosis procedure shown on Fig. 1 is too
and diagnosis of faults or potential faults is still a manual and slow and no longer viable technically and financially due to the
slow process often carried out by network experts who manually
analyze and correlate various pieces of network data such as,
ever growing size and complexity of modern cellular networks.
alarms, call traces, configuration management (CM) and key As a result, some industry and academic studies aimed at
performance indicator (KPI) data in order to come up with the addressing these challenges have been conducted [8] [9]. The
most probable root cause of a given network fault. In this paper, most prominent among these studies is [10] which resulted in
we propose an automated fault detection and diagnosis solution the standardization of self-healing (SH) by 3GPP since release
called adaptive root cause analysis (ARCA). The solution uses 10 [11]. Self-healing aims at automating network fault trou-
measurements and other network data together with Bayesian bleshooting and recovery. The standard provides the business
network theory to perform automated evidence based RCA. cases and theoretical guidelines of self-healing while leaving
Compared to the current common practice, our solution is faster the implementation details open to vendor preferences. In [2]
due to automation of the entire RCA process. The solution is
also cheaper because it needs fewer or no personnel in order
to operate and it improves efficiency through domain knowledge
reuse during adaptive learning. As it uses a probabilistic Bayesian
classifier, it can work with incomplete data and it can handle large
datasets with complex probability combinations. Experimental
results from stratified synthesized data affirmatively validate the
feasibility of using such a solution as a key part of self-healing
(SH) especially in emerging self-organizing network (SON) based
solutions in LTE Advanced (LTE-A) and 5G.
Keywords: Root cause analysis, self-healing, LTE-A, 5G

I. I NTRODUCTION
The lack of automation, rising costs and complexity of
fixing faults especially those that are related to emerging
cellular network technologies such as LTE-A and 5G were
the main motivational factors of this research.
Root cause analysis (RCA) is a common task performed
by operators for detecting and diagnosing the root causes of
faults or problems in cellular networks. In spite of the various Fig. 1: Manual Root Cause Analysis
improvements in the operations support systems (OSS) over
the years, RCA remains an ad-hoc, manual and slow task
which is heavily dependent on the scarce knowledge of a for example, the authors developed an automatic detection
few domain experts who actually perform fault detection and and diagnosis framework for mobile communication systems
diagnosis in a way similar to what is shown on Fig. 1. It is also while in a related study, the authors in [4] implemented an
strange to note that even though RCA is such a common and RCA solution based on self-organizing map (SOM). Despite
recurring task among network operators, there is a shortage of all these efforts, studies on this topic are still scarce and
sample datasets of faulty cases and accompanying RCA results fragmented. This fragmentation prompted authors of [1] to
available currently to the radio network research community. focus on unifying available information and key concepts
Furthermore, despite being the most widely adopted practice which they presented in the form of a common framework
in current networks, manual fault detection and diagnosis for self-healing.

978-1-5386-3250-5/17 $31.00 © 2017 IEEE 136


DOI 10.1109/HPCS.2017.31
While previous studies have made several contributions an automated troubleshooting solution that can cope
to self-healing, our work contributes field experience and with huge and ever growing data amounts as expected
practical domain knowledge to both the industry and research in LTE-A and 5G cellular networks.
community on this emerging topic through the introduction of:
I. A fault detection algorithm which uses a Chi-squared III. A DAPTIVE ROOT C AUSE A NALYSIS (ARCA)
test to detect significant cell performance deviation In this section, we introduce ARCA, our proposed solution
from expected profile values. for automating root cause analysis of cellular network faults
II. An automated fault root cause diagnosis algorithm based on continuous adaptive learning.
based on Bayesian network theory.
III. A ranking algorithm based on Pareto principle used Like a human expert, the developed solution uses continu-
to prioritize the diagnosed faults in order to maximize ous adaptive learning for smart decision making. Details about
return on investment with respect to business policies the key components of the solution are presented in the next
and severity. subsection.

The rest of the paper is organized as follows: Section II A. Chi-squared test based fault detection
summarizes some relevant inference methods investigated.
Section III describes the proposed automated root cause anal- Time-dependent statistical profiles are used to capture and
ysis solution. Evaluation of the proposed solution is done in present expected values and distribution of monitored cell
Section IV. Section V summarizes relevant literature. Finally, KPIs such as Reference Signal Received Power (RSRP) while
Section VI concludes the paper and gives some details about taking into account their temporal variations. Often profile
future plans. information is captured from expert knowledge or learned from
statistical network measurements. In this research, we created
II. OVERVIEW OF P OPULAR I NFERENCE M ETHODS three profiles which are commonly used in cellular networks:
Artificial neural networks (ANNs), fuzzy set theory (FST), • Urban weekday
rule based systems (RBS) and Bayesian networks (BNs) are
among the most important and popular methods used by • Urban weekend
experts to map observations to most probable root causes in • Rural
various disciplines.
Further, we used the sliding window concept to capture the
Other than our own comparative summary of characteristics
temporal aspect of processing measurements in defined inter-
relevant to this research which are presented in the next
vals. The actual fault detection is done using a Chi-squared
subsection, detailed descriptions of these methods is outside
test (χ2 ).
the scope of this paper.

A. Comparative summary of popular inference methods

TABLE I: Comparison of Popular Inference Methods


Parameter ANN FST RBS BN
Uncertainty type Fuzzy, probability Fuzzy Rule Probability
Uncertainty Weight vector Membership Rule Probability
representation Fig. 2: χ2 - based Profile Deviation Detection
Explanation Hard Hard Depends Easy
Relation explanation Hard Easy Depends Easy
Training Huge Huge n/a Small We use a Chi-squared test as a statistical test for determining
whether the deviations found, if any, between the observed
and the expected KPI values captured in the profiles are
B. Selected inference method significantly different. Formally, it is written as:
Based on Table I analysis, BN was seen as the most suitable n
inference method because:
X (Oi − Ei )2
χ2 = (1)
i=1
Ei
I. In cellular networks, relationships among different
probable causes of faults is uncertain and thus, they where O represents the observed value and E the expected
can be expressed in terms of probability. value. The degrees of freedom (d.f) is calculated as:
II. Training data related to faults is scarce and often
incomplete, BNs provide a good fit in this context
as well because they need less training data. v =n−k (2)
III. BNs have a proven reputation of coping well within
applications that deal with a lot data in various where v, n and k represent the d.f, number of possible nominal
industries [3] [6]. Likewise, their selection as a key categories a KPI can take and number of restrictions respec-
part of our diagnosis solution was aimed at building tively.

137
TABLE II: Example Chi-squared Statistic Calculation Using the naive independence assumption between every pair
of features, the joint probability based on (4) can be calculated
Day Observed (O) Expected (E) O-E (O − E)2
(O−E)2 as:
E
Monday 30 25 5 25 1.000 n
Y
Tuesday 14 25 11 121 4.840 P (h|D) = P (D|hi )P (hi ) (5)
Wednesday 34 30 4 16 0.533 i=1
Thursday 45 40 5 25 0.625
Friday 57 60 3 9 3.333 From the classification perspective, this means that we need
Total 180 180 34 45 10.331 solve:
n
Y
hM AP = argmaxh∈H P (hi ) P (D|hi ) (6)
Table II shows an example of how to calculate and use a i=1
χ2 statistic. For this example, we assume a significance level
of α = 0.05 and use a weekday profile which results in 5 In practice, solving (6) involves multiplying many conditional
possible categories and 1 restriction, therefore, d.f is calculated probabilities which are usually small values, therefore, we
as 5 − 1 = 4. From the Chi-square table, we read the critical switch to logarithm notation to avoid floating point underflow
value (c.v) of 9.488. Since our test statistic from Table II errors:
(χ2 = 10.331) is greater than the c.v (9.488), we conclude n
X
that the cell performance has deviated significantly from what hM AP = argmaxh∈H [ logP (hi ) + logP (D|hi ) ] (7)
is typically expected, therefore, we include this cell to the i=1
list of cells to be diagnosed. The actual diagnosis is done as
described in the next subsection. where logP (D|hi ) is interpreted as the weight that indicates
how good an indicator D is for hi . In the same way, logP (hi )
B. Naive Bayes classifier (NBC) based diagnosis is a weight that indicates the prior relative frequency of hi .
Together they represent the evidence used by our diagnosis
NBC is a probabilistic classifier which is based on Bayes’ algorithm. They are calculated using Bayesian estimation.
theorem with a strong (naive) assumption of independence
among predictors. As a probabilistic model, it is suitable In order to avoid bias and probability estimates which equal
for handling missing or uncertain data and inputs of high to zero during the calculation of likelihood, we use the m −
dimensions such as cellular network features. Bayes’ theorem estimate of probability which is written as:
describes the probability of an event based on conditions and nc + mp
symptoms that might be related to the event. For example, let P (D|h) = (8)
us assume that RSRP KPI directly influences the coverage of n+m
a cell in an LTE cellular network. Then using Bayes’ theorem, where n is the total number of instances of a given class in
a cell’s RSRP KPI value can be used to more accurately training dataset, nc is the number of instances of that given
assess the probability of an eventual cell coverage problem by class that have a given value. m is a constant called the
exploiting the mathematical probability dependence between equivalent sample size. It can be interpreted as augmenting
the cause and the symptom recorded in the service area. the n actual observations by an additional m virtual samples
Formally, Bayes’ theorem is formulated mathematically as: distributed according to p. In our KPI observation case, m may
P (D|h)P (h) have a value such as L, H or −− which represents low, high,
P (h|D) = (3) missing or negligible KPI value respectively, therefore, our m
P (D)
= 3. Finally, p is our prior estimate of the probability we wish
where P (h|D) is called the posterior probability. It represents to determine. In the absence of other prior information, it is
the probability of h given that D has been observed. P (D|h) commonly assumed to be uniform. Thus, in our case it can be
is called the likelihood. It represents the probability of D given assumed to be 1/3.
that h has been observed. P (h) represents the prior probability.
P (D) represents the probability of observing D. (7) forms the mathematical basis of the NBC based im-
plementation of the diagnosis component used by ARCA. The
From the diagnosis perspective, (3) is used as the basis for high level architecture and control flow of the entire solution
calculating the maximum a posteriori (MAP) among alternative is shown on Fig. 3.
hypotheses, for example, given H = {h1 , h2 ...hn } hypotheses
representing root causes, we need to compute the probability
of each hypothesis and then select the hypothesis with the C. Pareto analysis based fault fixing impact ranking
highest probability as the root cause. In other words: Pareto analysis is a statistical technique used in strategic
P (D|h1 )P (h1 ) P (D|h2 )P (h2 ) decision making processes. It is loosely based on the idea that
P (h1 |D) = P (D) , P (h2 |D) = P (D)
by doing 20% of the most important work, you can generate
P (D|hn )P (hn )
..., P (hn |D) = P (D)
80% of the benefit of doing the entire job [13].
In our solution, we use the Pareto Analyzer component
Since P (D) is the same in all calculations, we can ignore
shown on Fig. 3 to intelligently maximize return on investment
it and then simplify the calculation of the MAP hypothesis
by creating a list of top n faults which when fixed, will bring
using the following ratio:
the most value to the network operators and subscribers while
P (hi |D) ∝ P (D|hi )P (hi ) (4) taking into account business policies and severity.

138
TABLE III: Simulation Network Setup
Parameter Assumption
System frequency 2000 MHz
System bandwidth 10 MHz
Physical resource blocks (PRBs) 50
Thermal noise per PRB -121.4 dBm
Frequency reuse factor 1
Total active user equipment (UE) 500
Number of eNodeBs 28
Cells Monitored : 84 Diagnosed : 21
Cell layout Hexagonal grid with wrap-around
Inter-site distance 0.5Km
Path loss model L = 128.1 + 37.6 log10 ( R ),
R is user distance from BS,in Km
Shadowing STD 8 dB
Shadowing decorrelation 0.5 (sites), 1 (sectors)
Shadowing decorr. distance 50 m
Maximum eNodeB TX power 46dBm
Fig. 3: Adaptive Root Cause Analysis Architecture
UE velocity 10 km/h
Channel model Typical Urban (TU)
UE height hU E 1.5 m
D. SON function (SF) based corrective action eNB antenna height heN B 18.5 m (mean), 16 m (min.), 20 m (max.),
Packet scheduler(s) Resource-fair
After ranking of all diagnosed faults, relevant SFs are Traffic type Full-buffer
triggered in order to perform corrective actions as shown on UE TX power Minimum: -30dBm, Maximum: 23dBm
Fig. 3. From the operator’s point of view, a SF is a specialized Antenna gain after cable loss 15 dBi
tool that can be used for automatic configuration, optimization, UE Antenna gain 0 dBi
diagnosis and healing of cellular networks. White noise power density -174 dBm/Hz
Noise figures eNodeB : 5dB, UE : 9dB
The proposed solution is designed to work in both open
and closed loop mode.
The following KPIs were used in the simulations:
E. Measurement based corrective action evaluation
Using the Evaluator shown on Fig. 3, the impact of • RSRP: Reference Signal Received Power. Measured in
every corrective action is assessed using all applicable KPIs dBm, this is a standard KPI that depicts the strength
collected after every fault corrective action. In other words, of the pilot signals received in the downlink (DL). It
the evaluation results are used to balance the handling of false is defined as the average downlink received power on
positives and false negatives with respect to operator goals. the resource elements that carry cell specific reference
signals from the serving cell within the considered
IV. E VALUATION bandwidth.

In this section, the test network environment, test scenarios • RSRQ: Reference Signal Received Quality. The KPI
and main experimental results of ARCA solution are presented. depicts the quality of the pilot signals received in the
DL. Measured in dB, it is defined as the ratio between
A. Test network configuration RSRP and the wideband received signals from all base
stations in the carrier bandwidth plus thermal noise
The experiment was done using a simulated LTE network (RSSI).
based on [12] with simulation parameters shown in Table
III. In the simulation, supervised learning approach was used • HOSR: Handover Success Rate. Measured in percent-
on 84 cells. The KPIs used and fault distribution data were ages, this KPI describes the network’s ability to enable
collected from manual RCA sessions conducted by experts in users to continue receiving the service and maintain
the field using stratified sampling method. Though not used connection during movement. It is calculated as the
in the training phase, the training data also contains an expert ratio between successful handover attempts to the total
assigned root cause per cell (ground truth). This information is number of handover attempts.
used during the verification of ARCA diagnosis results. Cells • E RAB RET: Used to evaluate network capability to
with normal (HC abbreviation in Table IV) KPIs are filtered out retain the service requested by a user for a desired
by the Profile Manager. In this simulation only 21 cells met the duration once the user is connected to the service. It
prerequisites of the diagnosis phase. The diagnosed faults are is measured in percentages and calculated as the ratio
prioritized by the Pareto Analyzer and fed into respective SFs of connections that have completed successfully to the
that perform the actual corrective actions. The implementation total number of completed connections.
details of SFs which are used to perform corrective actions
is outside the scope of this paper. Details about them can be • SINR: Signal to Interference plus Noise Ratio. Mea-
found in [8][9] under self-optimization chapters. sured in dB, this KPI is defined as the ratio between

139
the power of the desired data signal to the sum of the as early as required, it has higher than usual E RAB RET
inter cell interference powers plus noise. Generally, which means that it is retaining users even with bad quality
this KPI gives an assessment of the long term signal instead of performing handovers to neighbor cells with better
quality in the cell. signal quality.
• AVG THROUGHPUT: Calculated as an average of 5) Excessive antenna uptilt (EAUT): It appears as a side
user throughput (Tu) on the basis of their SINR using effect of excessive uptilting of cell antennas. It is typically
the following equation [4]: experienced when the cell coverage starts reaching far beyond
its planned coverage area often causing islands of coverage
Du
Tu = (1 − BLER(SIN R)). (9) within the coverage area of another cell. During diagnosis,
TTI cells having this fault typically show 50 PERC DIST KPI
where BLER is the block error probability that de- value larger than normal. Consequently, the cell serves more
pends on the SINR of user u, Du is its data block distant users than usual and this leads to degradation of SINR,
payload in bits and TTI is the transmission time RSRQ and AVG THROUGHPUT. While this fault is common
interval. in rural areas, it is also typical in urban areas, for example near
highways and water bodies. In both cases, this happens as an
• 50 P ERC DIST : Depicts where the majority of unwanted side effect of cell coverage extension.
users are located with respect to the center of the
serving cell base station. 6) Cell outage (CO): Appears when a cell ceases to handle
any traffic. Common causes include hardware failure or pa-
B. Experimental scenarios rameter misconfiguration. Troubleshooting information include
hardware alarms and KPIs which typically show very low,
1) Excessive antenna downtilt (EADT): This is a common negligible or missing values for RSRP, RSRQ, SINR. Actual
case especially in urban areas where, due to increased traffic, KPI values depends on the type of outage, it can be partial or
operators tends to focus the antenna beam close to the base complete.
station in order to increase cell capacity. Typical manifestations
of this fault include: high RSRP values close to the base
station while RSRQ and E RAB RET values on average TABLE IV: ARCA Diagnosis Mapping
drop especially for users further away from the base station. Diagnosis ID Fault Name Fault Business
Often frequent handovers of users at the cell edge is also (DID) Abbreviation Priority
experienced and subsequently, the number of users captured 0 Healthy cell HC 0
in 50 PERC DIST also tends to be below average. 1 Excessive antenna downtilt EADT 1
2 Coverage hole CH 1
2) Coverage hole (CH): Represents an area where the
3 Inter-system interference ISI 1.5
signal level of both the serving and allowed neighbor cell are
4 Too late handover TLHO 2.5
below the mandatory level required to maintain the service.
5 Excessive antenna uptilt EAUT 1
This might be caused for example by physical obstructions
6 Cell outage CO 3
such as new buildings in urban areas and hills and mountains in
rural areas. In some situations, unsuitable antenna parameters
or bad radio frequency (RF) planning may also create coverage
holes. The typical symptoms of this fault include low RSRP C. Performance evaluation metrics
reported by users in a certain part of the service area which
results in frequent call drops and radio link failures (RLFs). Table IV is used to map internal diagnosis identifiers
(DIDs) used by the NBC to human readable names for visual-
3) Inter-system interference (ISI): It may occur in areas ization and other purposes. The actual cell KPIs and obtained
covered by overlapping radio access technologies (RATs) such diagnosis for each individual cell in the test scope are shown
as LTE, WCDMA and GSM. Areas experiencing this problem in columns of Table V. Normalization of KPIs is a prerequisite
typically register degradation in SINR, RSRQ, E RAB RET for our implementation of the NBC. KPI values are normalized
and AVG THROUGHPUT KPIs due to excessive interference and mapped from numeric to categorical values such as, low
among different RATs. (L), high (H) and negligible or missing (−−) using if ... then ...
else expert mapping rules or default ranges presented in Table
4) Too late handover (TLHO): It occurs during UE mobil- VI. The preferences are a way that the operator can influence
ity when a source cell initiates a handover too late, or not at all. the operation of ARCA by defining operating preferences that
Typically, this leads to a radio link failure in the source cell. suit their environment. In case ARCA needs more information,
In the modeled case, the handover never happened although such as neighbor or alarm information before completing a
a handover to the neighbor cell would have been possible diagnosis, additional information can be provided using an
and desirable. Even though the ratio of connections that additional input file or data source.
have completed successfully to the total number of completed
connections shows a high HOSR value, the RSRQ degradation The performance of the proposed classifier is evaluated
indicates that the service quality the users are experiencing has using a confusion matrix table, accuracy and the Kappa statis-
degraded while the RSRP in the cell also shows significant tic. Because we are using supervised learning, the confusion
degradation due to too long average distance of UEs from the matrix makes it easy to see if the classifier is misclassifying
serving cell base station center as shown by 50 PERC DIST one class for another. Each column in the table represents the
KPI. Furthermore, since the cell is not performing handovers instance in the predicted class while each row represents the

140
TABLE V: ARCA Performance Evaluation Test Cases
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 C21
E RAB RET L H - - H H H H H - - L L L L L H H H L -
HOSR L H - - H H H H H - - - - - - - H H H L -
RSRP H - H H - H H - - L L H H - H H - H - H -
RSRQ L L L L - - - L - - - H H H - H - - L L L
SINR - L - - - - H - H L L H H L L L - - - - L
AVG THROUGHPUT H H H H H - H - - L H - - L L L H - - - H
50 PERC DIST L L L L H H - H H H L H H H H - H H H H L
Diagnosis (DID) 1 1 1 1 4 4 4 4 4 5 6 2 2 3 3 3 4 4 4 4 6

TABLE VI: ARCA Algorithm Parameter Preferences


Parameter Operator Operator Parameter Description
Policy A Policy B
Scope 21 cells 21 cells Number of cells in test area
Exclusion scope [] [] Ids of cells excluded from analysis
Scope name prefix ARCA MAN ARCA CLOP RCA scope name prefix
Corrective action trigger strategy Manual Automatic Mode of triggering corrective action SON functions
Data statistical validity 1 week 1 week Minimum period of gathering network measurements
KPI summarization period Last hour Last hour KPI summarization period
KPI Thresholds Expert defined thresholds
RSRP [dBm] [-140,-44] Signal strength measurement range required to get LTE service
RSRQ [dB] [-19.5,-3.0] Signal quality measurement range required to maintain LTE service
SINR [dB] [13, >20] Recommended range for signal to interference and noise ratio
HOSR [%] 95.00 Recommended handover success rate threshold
E RAB RET [%] 98.00 Recommended retainability threshold
AVG THROUGHPUT [Kbps] [100, >150] Average cell throughput range threshold
50 PERC DIST [Km] [1.00, 5.00] 50th percentile distance of user location from the base station center
Addition Input
Additional ARCA input file: /home/arca/shared/additional arca input.json Optional input file with alarm, call trace and neighbor information

instances in an actual class (ground truth). According to the 2) Kappa Statistic: This metric is used in order to more
rules of confusion matrices, correct diagnoses lie along the left accurately asses the capability of the classifier by taking into
diagonal of the confusion matrix. All other non zero entries account correct classifications that might occur by chance.
in the table are misclassifications. Though it is easy to derive
various performance metrics from the confusion matrix shown According to the confusion matrix of the random classifier
in Table VII, for the sake of this demonstration, we highlight shown in Table VIII, we can see that only 1 out 4 faults with
only the most important metrics: DID 1 was correctly diagnosed and 4 out 9 faults with DID 4
were correctly diagnosed while the rest were all misdiagnosed.
A simple comparison of the two confusion matrices (Table
TABLE VII: ARCA Classifier Confusion Matrix VII and VIII) clearly show that ARCA has better diagnosis
DID 1 2 3 4 5 6 results (only 1 misdiagnosis) on the test dataset compared to
1 4 0 0 0 0 0 the random classifier.
2 0 2 0 0 0 0
3 0 0 3 0 0 0
4 0 0 0 9 0 0
TABLE VIII: Random Classifier Confusion Matrix
5 0 0 0 0 1 0
DID 1 2 3 4 5 6
6 1 0 0 0 0 1
1 1 0 1 2 0 0
2 0 0 0 2 0 0
1) Accuracy: Describes how often the classifier is correct. 3 1 0 0 2 0 0
It is calculated as: 4 2 1 1 4 0 1
tp + tn 5 0 0 0 1 0 0
P (c) = (10) 6 0 0 0 2 0 0
total instances
where P (c), tp and tn represents classifier accuracy, true
positive and true negative classification counts respectively. Kappa is calculated using:
Therefore, ARCA classification accuracy is calculated as:
20 P (c) − P (r)
P (c) = = 0.95 K= (11)
21 1 − P (r)

141
Fig. 4: Pareto Analysis

where the accuracy of the random classifier is: chart. One interpretation of this behavior would be that from
5 the operator’s point of view, they would lose more money when
P (r) = = 0.24 cells remain in outage for a longer time than when they are
21 over tuned to increase capacity through excessive downtilting
0.95 − 0.24 0.71 of antennas.
K= = = 0.93
1 − 0.24 0.76 In summary, it can be seen that by only fixing TLHO,
CO and ISI from the total number of diagnosed faults, the
where K represents the Kappa statistic, P (c) represents the
business can gain about 80% of the total benefit of fixing all the
accuracy of the ARCA classifier and P (r) represents the accu-
diagnosed faults. From this, we can conclude that the business
racy of the random classifier. A common evaluation reference
can maximize ROI by applying Pareto analysis [13] to RCA
scale of the Kappa statistic is: 0.41-0.60 : Low, 0.61-0.80 :
results.
Good, 0.81-1.00 : High.
Compared to previous studies, for example [4], were the V. R ELATED WORK AND DISCUSSION
authors evaluated their results based on detection error rate or
on percentage of agreement only, we use confusion matrices Self-healing (SH) studies are scarce and fragmented. In
from which all other important metrics, such as percent agree- order to unify the fragmented literature and concepts, R. Barco
ment (P (c) = 0.95) and Kappa (K = 0.93), can be calculated et al [1] proposed among other things, a conceptual framework
to thoroughly evaluate the proposed solution. were the main terms and functions of self-healing are clarified
and unified. The authors also outlined the current state of the
art and concluded by describing the key research challenges
D. Pareto analysis results
still open in the field. In [2], the authors presented an inte-
Fig. 4 shows the ranking of diagnosed faults. From top grated detection and diagnosis framework. They use profiles
left to the right, the first graph shows the total diagnosed constructed from historical network data as the foundation of
faults while the second graph shows the proposed ranking their detection mechanism. Their diagnosis mechanism uses
according to basic (frequency) Pareto analysis respectively. reports. Reports are expert associations of faults to root causes
Using this method, ARCA proposed to fix TLHO, EADT and based on analysis of historical network data. Their work is
ISI in order to achieve about 80% of the cumulative value similar to ours as we both aim to automate RCA. We both use
of fault fixing. The bottom graphs show the ranking where the profile concept to learn expected values. On the other hand,
each fault frequency is multiplied by a weighting factor which our work differs from theirs in many ways, for example, they
corresponds to its priority to the business as defined in the use averages in profile calculations while we use percentiles
last column of Table IV. The severity minimum is 0 and the to make our calculations in order to prevent outliers from
maximum is 3. Using this approach, we note that ARCA distorting the distribution of actual training data values. As
regards CO to be a more severe fault than EADT, consequently, a contrasting example between the two methods, consider the
it proposes fixing this issue first according to the bottom Pareto following user to site distances measured in kilometers in a

142
rural area D = 16, 14, 14, 56. The average is 25km while deviation from expected values captured in statistical profiles.
the 50th percentile is 15km. Between the two results, we can The solution uses a naive Bayesian classifier to perform
calculate a difference of 10km and by visually inspecting the automatic root cause diagnosis of detected faults. Diagnosed
data, we can validate the benefit of using our approach that faults are ranked using Pareto analysis. SON functions are
indeed at least half the users are less than 15km from the then triggered to compensate or eliminate the root cause of
site not 25km. This information can be used, for example, diagnosed faults. Finally, every root cause corrective action
to improve cell capacity and signal quality by downtilting performed is evaluated based on applicable KPIs monitored.
the antenna in question so that the antenna beam covers less
The main benefits of the proposed solution include:
distance while continuing to serve the majority of users in the
service area. Furthermore, we also use a Chi-squared test to • Adaptive continuous learning which increases RCA
detect significant deviations of observed from expected KPI accuracy and expert knowledge reuse.
values learned by the detection mechanism. Unlike having
• Ability to work with missing and uncertain data ex-
to set strict threshold values, this makes the system more
pressed in terms of probabilities.
stable by only triggering when there has been a significant
deviation from expected values. The significance level used in • High performance and accuracy even with large
our research is 5% (α = 0.05). In addition, they use reports datasets.
while we use a probabilistic classifier based on Bayesian
In lab tests, ARCA showed a diagnosis accuracy of 95.2% and
theorem. The NBC makes it possible to build an RCA classifier
a very good Kappa statistic of 93.0%.
that can continuously learn and adapt the diagnosis as more
evidence emerges in order to increase RCA accuracy. Unlike Building on this research, we plan to look into Deep Learn-
in the reports based classifier, the ability to work with missing ing algorithms that can be used to advance the capabilities of
and uncertain data expressed in terms of probabilities makes emerging network optimization tools in LTE-A and 5G.
the NBC more suitable for automated RCA diagnosis were
it is common to have input data missing or incomplete. On ACKNOWLEDGMENT
the other hand, a major shortcoming of NBC is its naive The authors would like to thank Nokia Networks and iSON
assumption of independence among features used to build Manager Self-Optimization team for their support during this
the classification model. This is typically not true in reality, research.
but nevertheless, NBC has yielded very successful results
in various applications which require high performance and R EFERENCES
accuracy with large datasets [3] [6]. One possible explanation [1] R. Barco, P. Lazaro, P. Munóz, A unified framework for self-healing
for its success could be that, in order to function, the NBC in wireless networks. IEEE Communication Magazine, vol. 50 (12),
does not need very accurate probabilities which are anyway 134142 (2012).
very hard to calculate in most real use cases. For NBC based [2] P. Szilágyi and S. Nováczki, An automatic detection and diagnosis
diagnosis, it is enough that the classifier can find the MAP framework for mobile communication systems , IEEE Transactions on
hypothesis based on the given representative probabilities. In Network and Service Management , vol. 9 , no. 2 , pp. 184 197 , 2012.
[4] a self-organizing map (SOM) based root cause analysis [3] B. Benware, C. Schuermyerm, M. Sharma, Determining a Failure
Root Cause Distribution From a Population of Layout-Aware Scan
tool for LTE networks is proposed. From the goal perspective, Diagnosis Results , IEEE Design & Test of Computers , vol. 29 , no.
this work is quite similar to our work. Major differences with 1, pp. 8 18 , 2012.
our solution are related to the architecture and implementation [4] A. Gómez-Andrades, P. Muńoz, I. Serrano, R. Barco, Automatic Root
aspects, for example, we use NBC while they use SOM Cause Analysis for LTE Networks Based on Unsupervised Techniques,
for diagnosis. Their solution uses the concept of silhouette IEEE Transactions on Vehicular Technology , vol. 65 , no. 4 , pp. 2369
index for refining the diagnosis decisions while we use Pareto - 2386 , April 2016.
analysis to do the same. In [5] the authors proposed a solu- [5] O. Iacoboaiea, B. Sayrac, S. B. Jemaa, P. Bianchi, SON Conflict
Diagnosis in Heterogeneous Networks , PIMRC , pp. 1459 - 1463 ,
tion for SON conflict diagnosis in heterogeneous networks. September 2015.
In [7] the authors explain the ideas behind the expectation [6] B. H. Lee, Using Bayes Belief Networks in Industrial FMEA
maximization (EM) algorithm which can be useful for solving Modeling and Analysis, IEEE proceedings on annual reliability and
classification problems. In [6] the author explored usage of maintainability symposium, pp. 7 - 15 , 2001.
Bayes belief networks in industrial failure mode encoding [7] C. B. Do, S. Batzoglou, What is the expectation maximization
analysis (BN-FMEA). The author concluded that BN met the algorithm?, Nature Biotechnology, vol 26 pp. 897 - 899 , 2008.
requirement of industrial FMEA standards and requirements to [8] S. Hämäläinen , H. Sanneck , C. Sartori , LTE Self-Organising
represent both statistical and causal dependencies of a system Networks (SON): Network Management Automation for Operational
Efficiency. John Wiley & Sons, 2011
and its stakeholders. In [3] a root cause deconvolution (RCD)
framework for learning the distribution of root causes of failing [9] J. Ramiro, K. Hamied, Self-Organizing Networks (SON): Self-
Planning, Self-Optimization and Self-Healing for GSM, UMTS and
devices from production test results was developed using a LTE. John Wiley & Sons, October 2011
layout-aware diagnosis solution based on a Bayesian network [10] 3GPP TR 32. 823, Telecommunication management, Self-Organizing
model. Networks (SON), Study on self-healing, v.9.0.0, September 2009
[11] 3GPP TS 32.521, ”Technical Specification, Release 11”, v11.1.0, De-
cember 2012
VI. C ONCLUSION
[12] I. Maravić, ”LTE Simulator source code”, accessed from:
In this paper, we presented a solution for automating root https://github.com/i-maravic/LTE-Simulator, November 2016.
cause analysis of cellular network faults called ARCA. It [13] R. Koch, The 80/20 Principle: The Secret to Achieving More with
uses a Chi-squared test to detect significant cell performance Less. Nicholas Brealey Publishing , London, 2001

143

You might also like