You are on page 1of 13

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2020.3032156, IEEE
Transactions on Network and Service Management
IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 17, NO. 4, DECEMBER 2020 1

Few-shot Learning and Self Training for eNodeB


Log Analysis for Service Level Assurance in LTE
Networks
Shogo Aoki, Kohei Shiomoto, Senior Member, IEEE, and Chin Lam Eng Member, IEEE,

Abstract—With the increasing network topology complexity identify the state of the area that the eNodeB covers. Hundreds
and continuous evolution of the new wireless technology, it is of thousands of eNodeBs are typically deployed to cover a
challenging to address the network service outage with traditional nation-wide service area. Operators need to handle hundreds of
methods. In the long-term evolution (LTE) networks, a large
number of base stations called eNodeBs are deployed to cover millions of KPIs to cover the areas. It is impractical to handle
the entire service areas spanning various kinds of geographical manually such a huge amount of KPI data, and automation of
regions. Each eNodeB generates a large number of key perfor- data processing is therefore desired.
mance indicators (KPIs). Hundreds of thousands of eNodeBs are Machine learning has been extensively used in computer sci-
typically deployed to cover a nation-wide service area. Operators ence for voice, face and image recognition. Recently machine
need to handle hundreds of millions of KPIs to cover the areas.
It is impractical to handle manually such a huge amount of KPI learning has been applied to various network management
data, and automation of data processing is therefore desired. To tasks [31]. The natural question is how we can apply machine
improve network operation efficiency, a suitable machine learning learning to the task of analyzing eNodeB KPIs to assure
technique is used to learn and classify individual eNodeBs into the service level offered to the mobile customer. A typical
different states based on multiple performance metrics during a workflow of applying machine learning comprises of (1)
specific time window. However, an issue with supervised learning
requires a large amount of labeled dataset, which takes costly the pre-training phase where the classification is done using
human-labor and time to annotate data. To mitigate the cost and unsupervised learning, (2) the data annotation phase where the
time issues, we propose a method based on few-shot learning that human operators manually examine the data and annotate the
uses Prototypical Networks algorithm to complement the eNodeB label to the data, and (3) the training phase where the classifier
states analysis. Using a dataset from a live LTE network consists is trained using a set of labeled data using a supervised
of thousand of eNodeB, our experiment results show that the
proposed technique provides high performance while using a low machine learning model. In order to build a good classifier
number of labeled data. using a supervised machine learning model, we need a large
number of quality human-annotated data as training data.
Index Terms—LTE, eNodeB, KPI, service assurance, few-shot
learning, Prototypical Network, Self Training Unfortunately preparing a large number of quality human-
annotated data is a costly task. Human operator examines data,
classifies them, and annotates them with an appropriate label.
I. I NTRODUCTION To minimize the human labor-intensive and time-consuming
N the long-term evolution (LTE) networks, the radio com-
I munication services to the mobile users are provided by
base stations called eNodeBs. A large number of eNodeBs
dataset annotation task, it is thus required to find a data-
efficient learning algorithm to build a classifier model. We
should also note that anomalies are difficult to occur in
are deployed to cover the entire service areas spanning various practice, so the anomaly classes are usually sparse in the
kinds of geographical regions. The state of the areas covered dataset. As such it is extremely important for the operators to
by eNodeBs can be diverse; radio channel condition, user deal with unbalanced data set where some classes have only
mobility, and traffic load condition varies from area to area. handful data samples while others have a lot of data samples.
Each eNodeB generates a large number of key performance Current research areas such as few-shot learning show pos-
indicators (KPIs): radio coverage and channel interference, itive results in reducing the amount of annotated data required
traffic load in control and data planes, system hardware and during training while maintaining the prediction accuracy [25],
software issues, service management issues. The state of [27], [30]. Few-shot learning aims to train the model using a
eNodeB that covers a specific area impacts the service level few samples per each class. In this paper, in order to address
offered to the customer in that area. It is, therefore, crucial the above issues, we propose to use few-shot learning. Taking
to understand the state of the eNodeB that covers the area. advantage of a few-shot learning algorithm, a classifier model
By analyzing the KPIs generated by eNodeB, we expect to is trained with a small number of labeled data. We investigate
the few-shot learning that uses Prototypical Network [30].
S. Aoki was with the Department of Information and Communication
Engineering, Faculty of Knowledge Engineering, Tokyo City University, By using trace data from eNodeBs deployed in production
Tokyo, Japan. networks, we evaluate the proposed few-shot learning method
K. Shiomoto is with the Department of Intelligent Systems, Faculty of that leverages Prototypical Network. We confirm that the
Information Technology, Tokyo City University. e-mail:shiomoto@tcu.ac.jp
C. L. Eng is with Ericsson Japan, Kanagawa, Japan. e-mail: proposed method yields better performance than the existing
chin.lam.eng@ericsson.com methods. We also extend the proposed method to the semi-

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2020.3032156, IEEE
Transactions on Network and Service Management
IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 17, NO. 4, DECEMBER 2020 2

supervised Prototypical Network to further reduce the number samples. Few-shot learning [11], [15], [17], [18], [20] is
of labeled data. In our earlier work [2], we proposed an defined as a type of machine learning problems, where only
eNodeB performance metric analysis method that leverages a limited number of samples with supervised information for
Prototypical Network. We extended the earlier work [2] by the target are available [35].
providing the detailed performance analysis, discussing the Most few-shot learning methods follow a meta learning
design of embedding functions, and introducing Self Training approach that aims to gain experience from other similar prob-
in this paper. lems [12], [24], [28]. In the meta-learning framework, we aim
The remainder of the paper is structured as follows. Sec- to learn how to learn given the task of classification. The meta
tion II presents related work. Section III describes eNodeB log learner extracts meta-knowledge across tasks. The learner uses
data. Section IV introduces few-shot learning and Prototypical the meta-learner for a new task using task-specific information.
Network. Section V presents the proposed method. Section VI In other words, we use a set of classification problems to
presents the experiments and results. Section VII concludes the help solve other unrelated sets. It uses the prior knowledge to
paper. constrain the hypothesis space so that the standard machine
learning models can be applied with the small number of
II. R ELATED W ORK labeled samples. Embedding learning [5], [32], [34] falls in
A. Applications of Machine Learning to Mobile Network Man- this approach.
agement The embedding function is mainly learned from prior
knowledge, and can additionally use task-specific information
Operators need to offer high quality services while dealing
from the training dataset. Embedding learning embeds each
with the complexity of managing and operating networks [4].
sample into low-dimensional representation space such that
Machine learning has been applied to management of mobile
similar samples are close together while dissimilar samples are
networks [1], [3], [21]. Several work applied machine learning
far apart. In this low-dimensional space, one can then construct
to analyze minimization of Driving Tests (MDT) measure-
a smaller hypothesis space, which subsequently requires fewer
ments data. Zoha et al. [39] proposed a local outlier factor
training samples. Matching Net [34] uses different embedding
based detector and a support vector machine based detector
functions for training and test data. Prototypical Network [30]
to analyze MDT data for autonomous cell outage detection.
uses a embedding function for only training data and calculates
Onireti et al. [22] applied a 𝑘-nearest neighbor (𝑘-NN) and a
a simple prototype per class for test data. A semi-supervised
local outlier factor anomaly detector to analyze MDT data to
variant of Prototypical Network assigns unlabeled samples to
detect control and data plane problems in cells. Chernogorov
augment the training data via soft-assignment during learn-
et al. [7] proposed a MDT data mining method that leverages
ing [25].
𝑁-gram analysis and 𝑘-NN anomaly score algorithm to detect
Self Training [10], [26], [36]–[38] is a classic and simple
radio access channel (RACH) malfunction. Masood et al. [19]
method of semi-supervised learning, which increases the num-
proposed a deep learning based detection method of sleeping
ber of labeled data while learning by repeating learning and
cells. MDT reports are collected from the end users and an
creating labeled data. Yarowsky [36] tackled a self-training
autoencoder is used to train the normal network model. They
problem in the domain of natural language processing, word
demonstrated that the autoencoder obtains higher detection
sense disambiguation. Rosenberg et al. [26] built a self-training
accuracy and precision than support vector machine.
model for object detection in image recognition.
Several work applied machine learning to analyze measure-
ment data obtained from networks. Chernogorov et al. [6]
proposed a Diffusion Map (DM) data mining technique to C. Positioning of Our Work
reduce false detection. They reduced dimension of high-
The above related work on applications of machine learning
dimensional radio signal power and quality data obtained from
to mobile network management can be classified from two-
the network and perform anomaly detection by a 𝑘-means
dimensions of analysis method and data used in analysis. In
methods. Parwez et al. [23] utilize mobile call detail record
terms of analysis method, 𝑘-NN [7], [22], local outlier factor
(CDR) to analyze anomalous behavior of mobile wireless
based detector [39], support vector machine [39], diffusion
network. Deb et al. [9] proposed an automatic policy learning
mapping [6], deep learning [13], [19], semi-supervised learn-
method that uses eNodeB performance metric for predict-
ing [14], and 𝑘-means [23] have been used. In terms for data
ing and mitigating network service impairments. Hussain et
used in analysis, call data record (CDR) [6], [13], [14], [23],
al. [13], [14] proposed a semi-supervised statistical-based
high-dimensional radio signal power and quality data obtained
anomaly detection technique that mines area traffic data to
from the network [6], Minimization of Driving Tests (MDT)
identify sleeping cell, an undetected cell outage case which
data [7], [19], [22], [39], and eNodeB data [9] are used.
makes mobile service unavailable for subscribers.
Few-shot learning methods including Matching Net and Pro-
totypical Network have been applied to the task of recognition
B. Few-shot Learning and Self Training of handwritten character and image [25], [30], [34]. To the best
Existing supervised machine learning algorithms need a of our knowledge, this paper is the first work on application of
plenty of training data while humans can recognize new object few-shot learning to the task of mobile networks management.
classes from very few samples. The goal of few-shot learning This paper studies the applications of a few-shot learning that
is to classify new data having seen only a few training data uses Prototypical Network to reduce the number of required

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2020.3032156, IEEE
Transactions on Network and Service Management
IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 17, NO. 4, DECEMBER 2020 3

labeled data in the task of mobile networks management. has only 19 samples whereas other labels have from 119 to
Moreover this paper also introduces Self Training, which has 696 samples. Normal operating condition (label_05) has 583
been successfully applied to natural language processing [36] samples whereas high transmission of uplink and downlink
and object detection in image recognition [26]. data (label_02), weak radio wave coverage (label_01), high
traffic condition (label_00), high transmission of uplink data
III. E N ODE B L OG DATA (label_04), and service setup issue (label_11) are the top-
A. eNodeB Log 5 anomaly samples in terms of the number of samples in
the dataset. Anomalies are difficult to occur in practice, so
In the long-term evolution (LTE) networks, the radio com-
the anomaly classes (e.g., label_10) are usually sparse in the
munication services to the mobile users are provided by base
dataset as shown in Fig. 2. The operators need to deal with
stations called eNodeBs. eNodeB consists of Base Band Unit
unbalanced data set where a few class has only handful data
(BBU) and Remote Radio Head (RRH). The BBU is connected
samples while others have a lot of data samples.
to the core network through the S1-interface and the adjacent
eNodeB through the X1-interface [29]. The BBU converts
the digital baseband signals from RRH to the IP packets
and sends it to the core network. The BBU converts the
IP packets from the core network to the digital baseband
signals and sends them to the RRH. The BBU also performs
signal processing, call processing, monitoring and eNodeB
equipment management. The RRH is a device that transmits
and receives wireless signals to/from User Equipment (UE).
The RRH converts the digital baseband signals from BBU
into Radio Frequency (RF) signals and transmits them to User
Equipment (UE). The RRH amplifies the RF signals received
from the UE, converts them into digital baseband signals, and
transmits to the BBU. Fig. 1. Example of eNodeB KPI record data.
The state of eNodeB that covers a specific area impacts
the service level offered to the customer in that area. It is,
therefore, crucial to understand the state of the eNodeB that TABLE I
L ABEL ASSOCIATED WITH E N ODE B STATUS .
covers the area. Each eNodeB generates a large number of key
performance indicators (KPIs): the number of users, connec- Status
tion attempts, data/signaling packet rate, resource occupancy label_00 High traffic condition
and radio/channel quality, etc. By analyzing the performance label_01 Weak radio wave coverage
label_02 High transmission of uplink and downlink data
metric generated by eNodeB, we expect to identify the state label_03 High transmission of downlink data
of the area that the eNodeB covers. Hundreds of thousands of label_04 High transmission of uplink data
eNodeBs are typically deployed to cover a nation-wide service label_05 Normal operating condition
label_06 Uplink control channel interference
area. Operators need to handle hundreds of millions of KPIs label_07 Uplink traffic channel interference
to cover the areas. It is impractical to handle manually such a label_08 Both control and traffic channel interference
huge amount of KPI data, and automation of data processing label_09 Uplink control channel radio quality
label_10 System hardware processor load
is therefore desired. label_11 Service setup issue
label_12 Control channel signaling
B. Dataset and Issues
Each eNodeB reports 33 KPIs including the number of
users, connection attempts, data/signaling packet rate, resource IV. F EW- SHOT L EARNING
occupancy, and radio/channel quality. Each eNodeB generates
those 33 KPIs every hour, a record of each eNodeB per day A. Few-shot Learning
consists of 792 KPIs. Fig. 1 shows an example of KPI data Supervised learning models require a large amount of la-
record generated by eNodeB per day (normalized zero to one beled data, a dataset of high quality and quantity of human-
for each). annotated data for training. It takes costly human-labor and
The state of the eNodeB is categorized into 13 labels as time to annotate data for the set of labeled data. We should
listed in Table I. The list of the eNodeB status includes nor- also note that anomalies are difficult to occur in practice, so
mal operating condition (label_05), traffic-related conditions the anomaly classes are usually sparse in the dataset. As such
(labels_00, 02, 03, 04), radio-related conditions (labels_01, it is extremely important for the operators to deal with an
06, 07, 08, 09), service, system, control related conditions unbalanced data set.
(label_10, 11, 12). We collected 5,052 KPI data samples from Current research areas such as few-shot learning show pos-
eNodeBs deployed in production networks. The number of itive results in reducing the amount of annotated data required
labeled data for each eNodeB status is shown in Fig. 2. during training while maintaining the prediction accuracy [25],
We notice that label_10 (System hardware processor load) [27], [30]. Few-shot learning aims to train the model using a

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2020.3032156, IEEE
Transactions on Network and Service Management
IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 17, NO. 4, DECEMBER 2020 4

Fig. 3. Left: Embedding function. Right: Prototype vector calculation using


training set.

Fig. 2. The number of labeled data for each eNodeB status.

Fig. 4. Left: Embedding function (same as Fig. 3). Right: Loss calculation
few samples per each class. Few-shot learning is a task that according to distance using test set.
requires adapting the classifier to accommodate new classes
not given in training, given only a few examples. The purpose
is to train a model with high accuracy when the data of based on the distances between x∗ and each prototype, as
the target task is small by using prior knowledge. Taking follows:
advantage of the few-shot learning algorithm, a classifier exp (− ∥ ℎ 𝜙 (x∗ ) − p𝑐 ∥22 )
model can be trained with a minimum number of labeled data. 𝑝(𝑐|x∗ , {p𝑐 }) = ∑ ∗
(2)
𝑐′ exp (− ∥ ℎ 𝜙 (x ) − p𝒄 ′ ∥2 )
2
Thus we propose the eNodeB performance metric analysis
method that leverages few-shot learning called Prototypical The loss function used to update Prototypical Network is then
Network [30]. simply the average negative log-probability of the correct class
assignments, for all test set examples:
1 ∑
B. Prototypical Network − log 𝑝(𝑦 ∗𝑖 | x∗𝑖 , {p𝑐 }) (3)
𝑇×𝑁 𝑖
Prototypical Network [30] is a few-shot learning algorithm
that uses neural networks (NNs). Prototypical Network is Training proceeds by minimizing the average loss, iterating
based on the idea that the data for each class is distributed near over training episodes and performing a gradient descent
the average of themselves. It uses the training set to extract a update for each.
1) Embedding Function: Embedding function maps a dis-
prototype vector from each class as shown in Fig. 3, and then
crete object eNodeB KPIs to a vector of real numbers so that
classifies the inputs in the training set based on their distance
similar objects are close. This design is extremely important.
to the prototype of each class as shown in Fig. 4. Prototypical
We investigate which neural network is suitable for classifying
Network learns an embedding function ℎ 𝜙 (𝑥), parameterized
status of eNodeB in Prototypical Network in Section VI-E. We
𝜙 in a NN, that maps examples into space where examples
focus on three types of neural networks: multilayer perceptron
from the same class are close and those from different classes
(MLP) [13], two-dimensional convolutional neural networks
are far.
(2D-CNN) [16] and one-dimensional convolutional neural
Prototypical Network uses an episodic training [24] to split
networks (1D-CNN).
data into training set S = {(x1 , 𝑦 1 ), ..., (x 𝐴×𝑁 , 𝑦 𝐴×𝑁 )} and
2) Semi-supervised Prototypical Network: While the orig-
test set Q = {(x∗1 , 𝑦 ∗1 ), ..., (x𝑇∗ ×𝑁 , 𝑦𝑇∗ ×𝑁 }), where 𝐴 denotes
inal Prototypical Network is supervised learning, it can be
the number of training set observations of each class, 𝑇 denotes
extended to semi-supervised learning [25]. A semi-supervised
the number of test set observations of each class, and 𝑁
learning algorithm draws lines separating classes using a small
denotes the number of classes. They are used for computing
number of labeled data and modifies those lines using a large
prototype vectors and losses as explained below.
number of unlabeled data in training. Unlabeled data does
To compute prototype p𝑐 of each class 𝑐, a per-class average
not require costly human-labor annotation tasks. We hope that
of the embedded examples is performed:
we could improve the performance of the classifier with the

𝑖 ℎ 𝜙 (x𝑖 )𝑧 𝑖,𝑐
support of unlabeled data in the training dataset. This is a
p𝑐 = ∑ , (1) fundamental motivation for semi-supervised learning.
𝑖 𝑧 𝑖,𝑐
Fig. 5 illustrates a comparative example of semi-supervised
where 𝑧 𝑖 is one-hot vector. These prototypes are used as a learning and supervised learning when there is a small amount
predictor for the class of any test set example x∗ , which of labeled data. We are trying to solve a two-class classification
assigns a probability over any class 𝑐 through softmax function task, where each data is represented by a set of two features.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 17, NO. 4, DECEMBER 2020 5

We have a set of labeled data in the training dataset as shown class has only handful data samples while others have a lot of
in Fig. 5. Since data associated with labels are scattered, we data samples.
may want to draw the straight line (in red) as shown in Fig. 5.
Then since we have a large amount of “unlabeled data” in
the training dataset as shown in Fig. 5, we will draw a curve
(in green) that nicely separates the entire training dataset into
two clusters as shown in Fig. 5. Note that semi-supervised
learning does not require large amount of labeled data sample
that requires costly human-labor tasks.

Labeled data(Class 1)

Labeled data(Class 2)
v v
Unlabeled data Fig. 6. Inference Mode of the Proposed Management Method.

Supervised learning
Semi-supervised learning
B. Prototypical Network
In order to address the above issues in applying machine
Fig. 5. Description of the behavior of the semi-supervised learning ap- learning algorithms to the eNodeB performance metric anal-
proaches. ysis, we propose to use Prototypical Network [30]. In the
training phase, we repeat to train the classifier with a handful
Prototypical Network is naturally suited to the concept of of labeled data in a episodic training. To further reduce the
semi-supervised learning. Semi-supervised Prototypical Net- number of labeled data, we can optionally take advantage of
work uses soft 𝑘-means to refine the prototype using unlabeled semi-supervised version of Prototypical Network that trains
data. All procedure for training is the same as supervised a classifier with a small amount of labeled data and re-train
Prototypical Network except using prototype 𝑝˜ 𝑐 instead of 𝑝 𝑐 the classifier with a number of unlabeled data to reflect the
as follows: structure of distribution of dataset.
∑ ∑
𝑖 ℎ 𝜙 (x𝑖 )𝑧 𝑖,𝑐 + 𝑗 ℎ 𝜙 ( x̃ 𝑗 ) 𝑧˜ 𝑗,𝑐 Prototypical Network uses an embedding function to map
𝑝˜ 𝑐 = ∑ ∑ , (4) a discrete object to a vector of real numbers so that similar
𝑖 𝑧 𝑖,𝑐 + 𝑗 𝑧˜ 𝑗,𝑐
objects are close. The design of the embedding functions is
where extremely important. We use the output dimension of the
exp (− ∥ ℎ 𝜙 ( x̃ 𝑗 ) − p𝑐 ∥22 )
𝑧˜ 𝑗,𝑐 = ∑ embedding function denoted by 𝑀 = 30. We evaluated three
𝑐′ exp (− ∥ ℎ 𝜙 ( x̃ 𝑗 ) − p𝒄′ ∥22 ) types of neural network architecture: multilayer perceptron
𝑧˜ 𝑗,𝑐 indicates a partial assignment of unlabeled data to each (MLP), two-dimensional convolutional neural networks (2D-
class and refines the prototype vectors accordingly. CNN),and one-dimensional convolutional neural networks
(1D-CNN).
MLP is one of the most famous machine learning algorithm
V. P ROPOSED METHOD
for data classification. MLP is composed of an input layer,
A. Classifier of eNodeB Log Data hidden layers, and an output layer with many nodes connected.
A large number of eNodeBs are deployed to cover the entire The MLP architecture we used in our experiments is shown
service areas spanning various kinds of geographical regions. in Fig. 7. The first stage consists of a fully-connected Affine
Each eNodeB generates a large number of KPIs. Fig. 6 shows layer that receives 792 inputs and has 512 neuron units, and
the proposed service quality management in the LTE network. a rectified linear unit (ReLU) activation function followed by
The operators leverage machine learning to build a model that a drop-out layer that drops connections with a drop rate of
classifies the eNodeB KPIs into the states of the areas that the 0.75. The second stage consists of a fully-connected Affine
eNodeB covers. By analyzing the KPIs generated by eNodeB, layer that receives 512 inputs and has 256 neuron units, and a
the operators identify the state of the area. Once they find that ReLU activation function followed by a drop-out layer with a
the area has some problems to be solved, the operators address drop rate of 0.7. The third stage consists of a fully-connected
the problems. In order to build a good classifier, supervised Affine layer that receives 256 inputs and has 128 neuron units,
learning models require a large amount of labeled data, a and a ReLU activation function followed by a drop-out layer
dataset of high quality and quantity of human-annotated data with a drop date of 0.5. The fourth stage consists of a fully-
for training. It takes costly human-labor and time to annotate connected Affine layer that receives 128 inputs and 30 neuron
data for the set of labeled data. In addition, anomalies are units.
difficult to occur in practice, so the anomaly classes are usually Convolutional neural network (CNN) is a neural network al-
sparse in the dataset. As such it is extremely important for the gorithm and frequently used for image recognition. It consists
operators to deal with an unbalanced data set where a few of an input and an output layer, and multiple hidden layers.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2020.3032156, IEEE
Transactions on Network and Service Management
IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 17, NO. 4, DECEMBER 2020 6

Fig. 7. Parameters of MLP.

Fig. 9. Parameters of 1D-CNN.

The hidden layers consist of a series of convolutional layers,


the activation function, pooling layers, fully connected layers
and normalization layers. The main idea of 2D-CNN is to data. Self Training is a classic and simple method of semi-
capture partial features of two dimensional data by convolution supervised learning, which increases the number of labeled
computation. It performs convolution of input data by two data while learning by repeating learning and creating labeled
dimensional filters which have random number and generate data. First, Self Training performs supervised learning with
feature maps. Vertical and horizontal dimensions represent a small amount of labeled data. Then it infers the class of
KPIs and hours in our application. 1D-CNN has inputs data unlabeled data using the trained model of feed-forward neural
and filters for convolution in one dimension. The architecture network that outputs a set of probabilities that the input data is
and parameter for 2D-CNN and 1D-CNN are shown in Fig. 8 likely to belong to individual classes. The probability is called
and Fig. 9 respectively. The parameters used in the 2D-CNN confidence. If the highest confidence value among all classes
is explained in what follows. The first stage consists of a is higher than the threshold 𝑝 𝑡 ℎ𝑟 𝑒𝑠ℎ , the inference is assumed
convolution layer that receives 792 inputs and 32 2 × 2- to be correct. As such the input unlabeled data is labeled with
filters, a rectified liner unit (ReLU) activation function, and the class. Fig. 10 illustrates the idea of Self Training. Suppose
2 × 2 Max-Pool layer followed by a drop-out layer that drops that the threshold 𝑝 𝑡 ℎ𝑟 𝑒𝑠ℎ is equal to 0.9. On the left-hand side
connections with a drop rate of 0.75. The second stage consists of Fig. 10, since the value of class 𝐷 exceeds the threshold,
of a convolution layer that receives 792 inputs and 64 2 × 2- labeling is performed as a result of Self Training. Conversely,
filters, a rectified liner unit (ReLU) activation function, and on the right-hand side, the class with the highest confidence
2 × 2 Max-Pool layer followed by a drop-out layer that drops value has no more than 0.3 and the one with the second-highest
connections with a drop rate of 0.70. The third stage consists 0.25; as such we could only perform ambiguous classification.
of a convolution layer that receives 792 inputs and 128 2 × 2- As a result, the data remains unlabeled data.
filters, a rectified liner unit (ReLU) activation function, and
2 × 2 Max-Pool layer followed by a drop-out layer that drops
connections with a drop rate of 0.5. The fourth stage consists
of a fully-connected Affine layer that receives 128 inputs and
64 neuron units followed by a ReLU activation function. The
fifth stage consists of a fully-connected Affine layer that has
30 neuron units.

Fig. 10. Explanation of confidence in Self Training.

Once we obtain a sufficient number 𝐾 of new labeled data,


that is converted from unlabeled to labeled, we re-train the
Fig. 8. Parameters of 2D-CNN.
classifier with the new set of labeled data again. We repeat
the re-training of Self Training up to 𝐿 times. This method
has the challenges of how to set the threshold 𝑝 𝑡 ℎ𝑟 𝑒𝑠ℎ . If the
C. Self Training threshold 𝑝 𝑡 ℎ𝑟 𝑒𝑠ℎ is too high, no data will be added. If the
We also leverage another semi-supervised learning called threshold 𝑝 𝑡 ℎ𝑟 𝑒𝑠ℎ is too low, we may wrongly label unlabeled
Self Training [10], [36] to augment the number of labeled data.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2020.3032156, IEEE
Transactions on Network and Service Management
IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 17, NO. 4, DECEMBER 2020 7

VI. E VALUATION Recall is computed as the ratio of data that actually predicted
to the class to all data that belongs to a class.
A. Neural Network Implementation
𝑇𝑃
We developed a Python program that implements the Pro- 𝑅𝑒𝑐𝑎𝑙𝑙 = (6)
totype Network using Keras/Tensorflow [8]. As previously 𝑇𝑃 + 𝐹𝑁
mentioned, Prototypical Network uses episodic training [24] F1-Score is computed as harmonic mean of precision and
that selects 𝐴 samples in training set S and 𝑇 samples in recall.
test set Q for each class from the training dataset. We use 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1-𝑆𝑐𝑜𝑟𝑒 = 2 × (7)
( 𝐴, 𝑇) = (3, 3) in the following evaluation. We iterate 𝑚 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
episodes in the training the proposed Prototypical Network
by randomly picking those samples to yield the result until D. Effect of Prototypical Networks
we confirm its convergence (𝑚 = 20, 000 for MLP and 1D- The aim of the experiment is to investigate the minimum
CNN, and 𝑚 = 3, 500 for 2D-CNN). We compute the average number of labeled data required to build a few-shot learning
of the results of five experiments. classifier that leverages Prototypical Network. Prototypical
Network uses an embedding function to map a discrete object
to a vector of real numbers so that similar objects are close. We
B. Dataset
use the output dimension of the embedding function denoted
By using trace data from eNodeBs deployed in production by 𝑀 = 30. We use multilayer perceptron (MLP) [13] as the
networks, we evaluate the proposed method. We collected embedding function. We tested various parameters of MLP
5,052 KPI data samples from eNodeBs deployed in a produc- and found that the MLP architecture shown in Fig. 7 produce
tion networks. Each eNodeB generates 33 KPIs every hour, the best performance as far as we tested.
a record of each eNodeB per day consists of 792 KPIs as In order to investigate the effect of Prototypical Network, we
shown in Fig. 1. The state of the eNodeB is categorized into also evaluate the performance of normal MLP-based classifiers
13 labels as listed in Table I. The number of labeled data for as the baseline method. We use the same neural network
each eNodeB status is shown in Fig. 2. architecture of the baseline method as the embedding function
In order to evaluate the supervised and semi-supervised of Prototypical Network.
learning methods, we prepared the training dataset using Fig. 12 shows the F1-Scores that both few-shot learn-
labeled and unlabeled data as follows. We pick at most 𝐺 ing methods and baseline methods yield. The vertical vars
samples per class from the training dataset 𝐶𝑡𝑟 𝑎𝑖𝑛 and make represent the confidence interval with a confidence level

the labeled training dataset 𝐶𝑡𝑟 𝑎𝑖𝑛 . Note the number of labeled of 95% (The same applies to the rest of paper). We ob-

data per class in 𝐶𝑡𝑟 𝑎𝑖𝑛 can be less than 𝐺 since some classes serve that the proposed few-shot learning method (labeled
have a small number of data samples in the original dataset. “proto_mlp” in Fig. 12) outperforms the baseline method

We combine the unlabeled data with the labeled dataset 𝐶𝑡𝑟 𝑎𝑖𝑛 (labeled “base_mlp”). The difference is getting larger when
to make the entire set of training data 𝐶𝑡𝑟 𝑎𝑖𝑛 (See Fig. 11). the number of labeled data per class 𝐺 is small; in particular,
when 𝐺 ≤ 10, the proposed method significantly outperforms
the baseline method. As 𝐺 increases, the proposed method
continues to outperform the baseline method even though the
difference is getting small. As such we confirm the effect of the
proposed few-shot learning method; it significantly outperform
the baseline method when we have only a handful labeled data
per class.

Fig. 11. Evaluation Data Flow.

C. Performance Measure
To compute the performance measure of the classifier we
use the following four metrics: true positive (TP), false positive
(FP), true negative (TN), and false negative (FN). Precision is
computed as the ratio of data that actually belongs to the class Fig. 12. Comparison between Prototypical Network and baseline (MLP).
to all data predicted to a class.
𝑇𝑃 Fig. 13 shows the confusion matrix of the baseline method
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (5) that uses MLP with 𝐺 = 5. From Fig. 13, we observe that
𝑇 𝑃 + 𝐹𝑃

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2020.3032156, IEEE
Transactions on Network and Service Management
IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 17, NO. 4, DECEMBER 2020 8

the baseline method poorly classifies labels; the Recall of 𝑀. We evaluate the architecture of the neural network used as
label_2, label_3, label_4, label_5, label_7, label_11, and la- the embedding function and the output dimensions 𝑀 of the
bel_12 achieve are less than 0.3. Only three classes (label_01, embedded function of Prototypical Network.
label_08, label_09) achieve the Recall larger than 0.5. 1) Neural Network Architecture: Prototypical Network uses
an embedding function to map a discrete object to a vector of
real numbers so that similar objects are close. This design
is extremely important. First, we investigate which neural
network is suitable for the embedding function of Prototyp-
ical Network. We focus on three types of neural networks:
multilayer perceptron (MLP) [13], two-dimensional convolu-
tional neural networks (2D-CNN) [16] and one-dimensional
convolutional neural networks (1D-CNN). We tested various
parameters of 2D-CNN and 1D-CNN and found that the
architecture and parameter for 2D-CNN and 1D-CNN shown
in Fig. 8 and Fig. 9 produce the best performance as far as
we tested.
We evaluate the F1-score of the Prototypical Network with
embedding functions of different NN architectures whose
output dimension denoted by 𝑀 = 30. Fig. 15 shows the
results. We notice that the Prototypical Network that uses 1D-
Fig. 13. Confusion matrix of baseline MLP (𝐺 = 5). CNN outperforms the Prototypical Networks that use MLP
and 2D-CNN. 1D-CNN captures correlation in only horizontal
Fig. 14 shows the confusion matrix of the Prototypical
axis (time correlation in time for each KPI) while 2D-CNN
Network that uses MLP as the embedding function with 𝐺 = 5.
does correlation in both vertical axis (i.e., correlation between
From Fig. 14, we observe that the proposed few-shot learning
adjacent KPIs) and horizontal axis (i.e., correlation in time).
method classifies labels far better than the baseline method; the
The correlation between adjacent KPIs is dependent on the
lowest class (label_03) achieves the Recall of 0.49. label_01,
order how the KPIs are arranged in column. We use an
label_07, label_09, and label_10 achieve Recall higher than
arbitrary order of the KPIs in the column; as such there is not
0.8. As such we confirm the effect of the proposed few-shot
necessarily correlation between adjacent KPIs in the column.
learning method in case we have only a handful labeled data
On the other hand, the correlation in the row represents time
per class (e.g.,𝐺 = 5).
correlation of the same KPI. We consider that this is a reason
why the Prototypical Network that uses 1D-CNN outperforms
the Prototypical Network that uses 2D-CNN. We observe that
the performance of the Prototypical Network that uses MLP
more quickly degrades than the Prototypical Networks that use
2D-CNN and 1D-CNN as the number of labeled data per class
𝐺 decreases, in particular, when the number of labeled data
for each class 𝐺 is less than 50. As such we conclude that the
Prototypical Network that leverages 1D-CNN as an embedding
function yields better performance than those leverage 2D-
CNN and MLP.

Fig. 14. Confusion matrix of the Prototypical Network that uses MLP as the
embedding function with 𝐺 = 5.

E. Design of Embedding Function


Prototypical Network uses an embedding function imple-
mented as a neural network to map a discrete object to
a vector of real numbers so that similar objects are close.
The design of Prototypical Network, i.e., the structure that
the neural network should have, is extremely important. One Fig. 15. NN architecture comparison with 𝑀 = 30.
important parameter is the architecture of the neural network
and another is the output dimension of the embedding function Fig. 16 shows the F1-Scores that both few-shot learning

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2020.3032156, IEEE
Transactions on Network and Service Management
IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 17, NO. 4, DECEMBER 2020 9

methods and baseline methods yield. Compared with the base-


line methods, the proposed few-shot learning methods yield
better performance. We observe that a trend the Prototypical
Network yields, on average, slightly higher F1-score than the
baseline method when the number of labeled data 𝐺 is small.

Fig. 18. Visualization by t-SNE with 𝐺 = 5.

Fig. 16. Comparison between Prototypical Network and baseline (1D-CNN).

In order to investigate the mechanism how the Prototypical


Network successfully classifies data, we visualize how well
the embedding function maps a discrete object eNodeB KPIs
to a vector of real numbers so that similar objects are close.
For visualization of multidimensional embedding points, we
employed T-distributed Stochastic Neighbor Embedding (t-
SNE) [33] that can embed high dimensional data into low-
dimensional data for visualization. t-SNE is a nonlinear di-
mensionality reduction technique well-suited for embedding
high-dimensional data for visualization in a low-dimensional Fig. 19. Visualization by t-SNE with 𝐺 = 50.
space. Specifically, it models each high-dimensional object by
a two- or three-dimensional point in such a way that similar
2) Dimension of Embedding Function: We evaluate the
objects are modeled by nearby points and dissimilar objects are
output dimension 𝑀 of embedded function of Prototypical
modeled by distant points with high probability. Figs. 17, 18
Network that uses 1D-CNN as an embedding function. We
and 19 show how the data points of each class are distributed
used the output dimension of the embedded function of Pro-
when we apply Prototypical Network that uses 1D-CNN as
totypical Networks 𝑀=1, 3, 5, 30, and 50. Fig. 20 shows F1-
an embedding function with 𝐺 = 2, 5 and 50. The boundaries
score as a function of the output dimension of the embedding
between classes are obscure when the number of labeled data
function 𝑀. We observe that the F1-score is low with 𝑀 = 1
is small, e.g., 𝐺 = 2 (See Fig. 17). We notice that, as the
compared with 𝑀 =3, 5, 30, and 50 for all 𝐺. We observe
number of labeled data increases, the boundaries between
that there is small difference among F1-score among 𝑀 =3,
classes become conspicuous (See Figs. 18 and 19).
5, 30, and 50. If 𝐺 ≥ 10, F1-score with 𝑀 =3, 5, 30, and 50 is
higher than 0.8. As such we conclude that 𝑀 should be larger
than 3 and that there is small difference among F1-score if
𝑀 ≥ 3.
In order to further investigate how many labeled data per
class (𝐺) is needed, we evaluate how the data points of
each class are distributed with Prototypical Network that uses
different dimensions (𝑀) of the embedding functions. Fig. 21
shows how the data points of each class are distributed with
Prototypical Network that uses 1D-CNN as an embedding
function with 𝐺 = 2, 5, 10(𝑀 = 3). When the number of
labeled data is small (𝐺 = 2), the boundaries between classes
are obscure. On the other hand, when the number of labeled
data is large (𝐺 = 10), they become conspicuous.
Next we investigate the impact of the dimension 𝑀 on
Fig. 17. Visualization by t-SNE with 𝐺 = 2. how the data points of each class are distributed when the
number of data per class is extremely small; 𝐺 = 2. Fig. 22

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2020.3032156, IEEE
Transactions on Network and Service Management
IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 17, NO. 4, DECEMBER 2020 10

5 while keeping the total number of labeled and unlabeled


data to nine. Recall that Prototypical Network uses an episodic
training [24] that selects 𝐴 samples in training set S and 𝑇
samples in test set Q for each class from the training dataset.
We used one or three data per class for computing prototype
vectors and losses in order to compare the performance of
Prototypical Network and that of semi-supervised Prototypical
Network, namely, we use ( 𝐴, 𝑇) = (1,1) and (3,3).
The semi-supervised Prototypical Network yields, on aver-
age, slightly higher F1-score than the supervised Prototypical
Network for both ( 𝐴, 𝑇) = (1, 1) and (3, 3). By leveraging
the unlabeled data, the semi-supervised Prototypical Network
improves the supervised version. When we compare ( 𝐴, 𝑇) =
(3, 3) and for ( 𝐴, 𝑇) = (1, 1), the former achieves, on average,
Fig. 20. F1-score as a function of the output dimension of the embedding
function.
the higher F1-score than the latter for both the supervised and
semi-supervised Prototypical Networks. In few-shot learning,
the number of data used in the episode affects the performance.

Fig. 21. Visualization of class classification using t-SNE (𝑀 = 3).

shows how the data points of each class are distributed with
Prototypical Network that uses 1D-CNN as an embedding
function with 𝐺 = 2. Compared with Fig. 19, Fig. 22 does
not show conspicuous boundaries between classes when the
number of data per class is extremely small, 𝐺 = 2, with the
embedding functions of any dimensions; 𝑀 = 5, 10, and 50.
As such accurate classification is not possible for 𝐺 = 2 with Fig. 23. Performance of semi-supervised Prototypical Network.
the embedding functions with any dimensions.
Next, we investigate the effect of the dimension of em-
bedding function 𝑀 on the performance of semi-supervised
Prototypical Networks. We use 𝐺 =2, 3, 4, and 5 while keeping
the total number of labeled and unlabeled data to nine. We
evaluate the F1-score with ( 𝐴, 𝑇) = (3,3) and the embedding
function with different dimension 𝑀 =1, 3, 5, 30, and 50.
Fig. 24 shows the results. We observe that the F1-score is
low with 𝑀 = 1 compared with 𝑀 =3,5,30, and 50 for
Fig. 22. Visualization of class classification using t-sne dimension (𝐺 = 2). all 𝐺 = 2, 3, 4, 5. We observe that there is small difference
among F1-score among 𝑀 =3,5,30, and 50. We conclude that
𝑀 should be larger than 3 and that there is small difference
F. Semi-supervised Prototypical Network among F1-score if 𝑀 ≥ 3.
In order to further reduce the required number of labeled
data, we evaluated the performance of semi-supervised Proto- G. Effect of Self Training
typical Network. We used the same number of unlabeled data Self Training [10], [36] is a classic and simple method
in each class for semi-supervised learning. We investigate the of semi-supervised learning, which increases the number of
impact of the number of labeled data 𝐺 on the performance of labeled data while learning by repeating learning and creating
both Prototypical Network and semi-supervised Prototypical labeled data. If the highest confidence level among all classes
network. We use Prototype Network that uses 1D-CNN as is higher than the confidence threshold 𝑝 𝑡 ℎ𝑟 𝑒𝑠ℎ , the inference
an embedding function. We iterates 𝑚 episodes in the training is assumed to be correct, the unlabeled data is labeled as the
the proposed Prototypical Network by randomly picking those label and is added to the training dataset. Once a number
samples to yield the result until we confirm its convergence of new labeled data are obtained, we re-train the classifier
(𝑚 = 20, 000 for supervised learning as pre-training, and with the new set of labeled data. The caveat of Self Training
𝑚 = 15, 000 for semi-supervised). We use 𝐺 =2, 3, 4, and is that the accuracy could be degraded if unlabeled data

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2020.3032156, IEEE
Transactions on Network and Service Management
IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 17, NO. 4, DECEMBER 2020 11

pre-labeled data (𝐺=10, 50, 100). We notice that self-training


offers diminishing returns with respect to gained accuracy
when 𝐺=10, 50, and 100. To obtain practical performance
with smaller number of pre-labeled data (𝐺 ≤ 10) with Self
Training, we need to investigate optimization of data selection
and re-training. We raise this issue as future work.

Fig. 24. F1-score comparison for each dimension with Semi-supervised


Prototypical Networks.

wrongly labeled in the inference phase is added to the new


training dataset. This could happen when the classifier is not
trained well sufficiently to yield the accurate results with high
confidence due to scarcity of labeled dataset for initial training.
As demonstrated earlier, Prototypical Network classifies data Fig. 25. Effect of Self Training and labeling accuracy of Prototypical
Networks on results.
accurately even with small number of labeled data. As such we
believe that Prototypical Networks compensates the drawback
of Self Training.
Performance of a combination of Self Training and semi- TABLE II
supervised Prototypical Network is evaluated. Semi-supervised N UMBER OF NEWLY LABELED DATA .
Prototypical Network prototype Set S and loss set Q were
set such that ( 𝐴, 𝑇) = (1,1). We use 𝐺 =2, 3, 4, 5, 10, 𝐾 = 10 𝐾 = 100
𝐺 Total Wrong Total Wrong
50 and 100 while keeping the total number of labeled and 2 50 21.25 396.75 159
unlabeled data to nine. We conducted two experiments with 3 50 14.5 402 94.25
different number of additional data in Self Training 𝐾 =10 4 50 12 388.5 94.75
5 50 12 401.5 90.25
and 100. The iterations of re-training of Self Training 𝐿 is 10 42.2 5.4 413.5 54.30
set to 5 for both experiments. First we train the classifier 50 41.1 2.2 385.6 15.53
using the semi-supervised Prototypical Network. Then we train 100 33.5 0.5 346.5 9.61
the classifier using the Self Training with different confidence
threshold 𝑝 𝑡 ℎ𝑟 𝑒𝑠ℎ =99.9%. Performance of the classifier are
shown in Fig.25. In Fig.25, we assume that Self Training We investigate the effect of confidence threshold 𝑝 𝑡 ℎ𝑟 𝑒𝑠ℎ .
with 𝑝 𝑡 ℎ𝑟 𝑒𝑠ℎ = 100% achieves the same performance as the We expect that Self Training correctly labels the data when
semi-supervised Prototypical Network. We observe that Self use the higher confidence threshold 𝑝 𝑡 ℎ𝑟 𝑒𝑠ℎ . We use different
Training improves the average performance of semi-supervised confidence thresholds 𝑝 𝑡 ℎ𝑟 𝑒𝑠ℎ =99.9% and 80% with the num-
Prototypical Network for 𝐺 = 2 and 3 when we increase ber of additional data in Self Training 𝐾 = 100. The iterations
the additional data in Self Training 𝐾 from 10 to 100. We of re-training of Self Training 𝐿 is set to 5. Fig. 26 shows
observe that the improvement is marginal when we apply the relationship between the performance and the confidence
Self Training with 𝐾 = 10 for 𝐺 = 2. We conjecture that thresholds 𝑝 𝑡 ℎ𝑟 𝑒𝑠ℎ . We observe that Self Training degrades the
the number of newly labeled data is not enough to improve performance by lowering the threshold 𝑝 𝑡 ℎ𝑟 𝑒𝑠ℎ from 99.9% to
the performance of the classifier with the small number of 80%. We conjecture that the number of wrongly labeled data
𝐾 = 10. We calculated the number of data that are labeled samples increases with the lower threshold 𝑝 𝑡 ℎ𝑟 𝑒𝑠ℎ = 80%.
by the Self Training. The number of new labeled data is We calculated the number of data that are labeled by the
shown in Table II. Self Training with 𝐾 = 10 newly labels 50 Self Training. The number of newly labeled data is shown
data samples (average of four simulation runs) including 21.25 in Table III. Self Training with 𝑝 𝑡 ℎ𝑟 𝑒𝑠ℎ = 80% newly labels
wrongly labeled data samples when 𝐺 = 2. Self Training with 472.5 data samples (average of four simulation runs) including
𝐾 = 100 newly labels 396.75 data samples (average of four 215.5 wrongly labeled data samples when 𝐺 = 2. Even though
simulation runs) including 159 wrongly labeled data samples Self Training with 𝑝 𝑡 ℎ𝑟 𝑒𝑠ℎ = 80% increases the number of
when 𝐺 = 2. Au such Self Training with 𝐾 = 100 successfully newly labeled data samples, it also increases the number of
increases the additional data and improves the performance. To wrongly labeled data samples, which could be harmful for the
achieve practical F1-score, we would use higher numbers of (re-)training.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2020.3032156, IEEE
Transactions on Network and Service Management
IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 17, NO. 4, DECEMBER 2020 12

• How to select data to be labeled to further reduce the


required number of labeled data per class.

ACKNOWLEDGMENT
The authors would like to thank Atsushi Morohoshi and
Sebastian Backstad for useful discussion. The authors are
grateful to Yuki Shibuya, Hiromi Shimizu, Shouta Yoshida,
Yuta Moroishi, and Souta Shimizu for assistance with the
numerical simulations.

R EFERENCES
[1] O. G. Aliu, A. Imran, M. A. Imran, and B. Evans. A survey of self
organisation in future cellular networks. IEEE Communications Surveys
Fig. 26. Relationship between 𝑝𝑡 ℎ𝑟 𝑒𝑠ℎ and performance. Tutorials, 15(1):336–361, 2013.
[2] S. Aoki, K. Shiomoto, C. L. Eng, and S. Backstad. Few-shot learning
for enodeb performance metric analysis for service level assurance in
TABLE III lte networks. In NOMS 2020 - 2020 IEEE/IFIP Network Operations
R ELATIONSHIP BETWEEN 𝑝𝑡 ℎ𝑟 𝑒𝑠ℎ AND NUMBER OF NEWLY LABELED and Management Symposium, pages 1–4, 2020.
DATA . [3] A. Asghar, H. Farooq, and A. Imran. Self-healing in emerging cellular
networks: Review, challenges, and research directions. IEEE Communi-
cations Surveys Tutorials, 20(3):1682–1709, 2018.
𝑝𝑡 ℎ𝑟 𝑒𝑠ℎ = 99.9% 𝑝𝑡 ℎ𝑟 𝑒𝑠ℎ = 80%
[4] R. Barco, P. Lazaro, and P. Munoz. A unified framework for self-healing
𝐺 Total Wrong Total Wrong
in wireless networks. IEEE Communications Magazine, 50(12):134–142,
2 396.75 159 472.5 215.5
December 2012.
3 402 94.25 480.5 204.5
[5] Luca Bertinetto, João F. Henriques, Jack Valmadre, Philip H. S. Torr,
4 388.5 94.75 474.75 175.5
and Andrea Vedaldi. Learning feed-forward one-shot learners. In
5 401.5 90.25 480.25 136
Proceedings of the 30th International Conference on Neural Information
10 413.5 54.3 484.6 101.4
Processing Systems, NIPS’16, pages 523–531, Red Hook, NY, USA,
50 385.6 15.5 486.6 48.1
2016. Curran Associates Inc.
100 346.5 9.6 486.6 42.46
[6] F. Chernogorov, J. Turkka, T. Ristaniemi, and A. Averbuch. Detection
of sleeping cells in lte networks using diffusion maps. In 2011 IEEE
73rd Vehicular Technology Conference (VTC Spring), pages 1–5, May
2011.
VII. C ONCLUSION [7] Fedor Chernogorov, Sergey Chernov, Kimmo Brigatti, and Tapani Ris-
taniemi. Sequence-based detection of sleeping cell failures in mobile
We investigate machine learning algorithms to analyze eN- networks. Wireless Networks, 22(6):2029–2048, Aug 2016.
odeB performance metric for service level assurance in LTE [8] François Chollet et al. Keras. https://keras.io, 2015.
networks. In order to reduce the costly human-labor and time [9] Supratim Deb, Zihui Ge, Sastry Isukapalli, Sarat Puthenpura, Shobha
Venkataraman, He Yan, and Jennifer Yates. Aesop: Automatic policy
for data annotation and deal with unbalanced data set, we learning for predicting and mitigating network service impairments. In
proposed a few-shot learning method that uses Prototypical Proceedings of the 23rd ACM SIGKDD International Conference on
Network. By using trace data from eNodeBs deployed in pro- Knowledge Discovery and Data Mining, KDD ’17, pages 1783–1792,
New York, NY, USA, 2017. ACM.
duction networks, we evaluate the proposed few-shot learning [10] Virginia R. DeSa. Learning classification with unlabeled data. In
method. Proceedings of the 6th International Conference on Neural Information
The proposed method outperforms the baseline method Processing Systems, NIPS’93, pages 112–119, San Francisco, CA, USA,
1993. Morgan Kaufmann Publishers Inc.
when the number of labeled data per class 𝐺 decreases; when [11] Michael Fink. Object classification from a single example utilizing class
𝐺 ≤ 10, the proposed method significantly outperforms the relevance metrics. In L. K. Saul, Y. Weiss, and L. Bottou, editors,
baseline method. By comparing confusion matrix, we confirm Advances in Neural Information Processing Systems 17, pages 449–456.
MIT Press, 2005.
that the proposed method classifies labels far better than the [12] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-
baseline method. We investigate the design of embedding func- learning for fast adaptation of deep networks. In Proceedings of the 34th
tion that Prototypical Network leverages. We confirm that 1D- International Conference on Machine Learning - Volume 70, ICML’17,
pages 1126–1135. JMLR.org, 2017.
CNN is the best suited among the neural network architectures [13] B. Hussain, Q. Du, and P. Ren. Deep learning-based big data-
we evaluated (1D-CNN, 2D-CNN, and MLP). We also confirm assisted anomaly detection in cellular networks. In 2018 IEEE Global
that the dimension of embedding function output 𝑀 should be Communications Conference (GLOBECOM), pages 1–6, Dec 2018.
[14] B. Hussain, Q. Du, and P. Ren. Semi-supervised learning based big
larger than three. We further reduce the number of labeled data data-driven anomaly detection in mobile wireless networks. China
by using the semi-supervised Prototypical Network. Finally Communications, 15(4):41–57, April 2018.
We confirm that a combination of Self Training and semi- [15] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese
neural networks for one-shot image recognition. In ICML deep learning
supervised Prototypical Network improves the performance workshop, volume 2. Lille, 2015.
over the semi-supervised Prototypical Network with a handful [16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet
labeled dataset. Self Training with the higher 𝐾 and/or the classification with deep convolutional neural networks. In Proceedings
of the 25th International Conference on Neural Information Processing
higher 𝑝 𝑡 ℎ𝑟 𝑒𝑠ℎ achieves the better performance. Systems - Volume 1, NIPS’12, pages 1097–1105, USA, 2012. Curran
Future direction of our research includes Associates Inc.
[17] Brenden M. Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B.
• How to handle unknown classes since we addressed the Tenenbaum. One shot learning of simple visual concepts. Cognitive
task of classification of only known classes in this paper. Science, 33, 2011.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2020.3032156, IEEE
Transactions on Network and Service Management
IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 17, NO. 4, DECEMBER 2020 13

[18] Li Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object cate- [38] Xiaojin Zhu. Semi-supervised learning literature survey. Technical
gories. IEEE Transactions on Pattern Analysis and Machine Intelligence, Report 1530, Computer Sciences, University of Wisconsin-Madison,
28(4):594–611, 2006. 2008.
[19] U. Masood, A. Asghar, A. Imran, and A. N. Mian. Deep learning based [39] A. Zoha, A. Saeed, A. Imran, M. A. Imran, and A. Abu-Dayya. Data-
detection of sleeping cells in next generation cellular networks. In 2018 driven analytics for automated cell outage detection in self-organizing
IEEE Global Communications Conference (GLOBECOM), pages 206– networks. In 2015 11th International Conference on the Design of
212, 2018. Reliable Communication Networks (DRCN), pages 203–210, March
[20] E. G. Miller, N. E. Matsakis, and P. A. Viola. Learning from one 2015.
example through shared densities on transforms. In Proceedings IEEE
Conference on Computer Vision and Pattern Recognition. CVPR 2000
(Cat. No.PR00662), volume 1, pages 464–471 vol.1, 2000.
[21] D. Mulvey, C. H. Foh, M. A. Imran, and R. Tafazolli. Cell fault man-
agement using machine learning techniques. IEEE Access, 7:124514–
124539, 2019.
[22] O. Onireti, A. Zoha, J. Moysen, A. Imran, L. Giupponi, M. Ali Imran,
and A. Abu-Dayya. A cell outage management framework for dense
heterogeneous networks. IEEE Transactions on Vehicular Technology,
65(4):2097–2113, 2016.
[23] M. S. Parwez, D. B. Rawat, and M. Garuba. Big data analytics for user- Shogo Aoki received his Bachelor of Engineering
activity analysis and user-anomaly detection in mobile wireless network. from Tokyo City University in 2019. His research
IEEE Transactions on Industrial Informatics, 13(4):2058–2065, Aug focused on applying machine learning to mobile
2017. network management. Currently, he researches data
[24] Sachin Ravi and Hugo Larochelle. Optimization as a model for few- science methods for the analysis of web user be-
shot learning. In International Conference on Learning Representations havior log data at Waseda University as a Master
(ICLR), 2017. of Management Engineering student graduating in
2021.
[25] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swer-
sky, Joshua B. Tenenbaum, Hugo Larochelle, and Richard S. Zemel.
Meta-learning for semi-supervised few-shot classification. In Proceed-
ings of 6th International Conference on Learning Representations ICLR,
2018.
[26] C. Rosenberg, M. Hebert, and H. Schneiderman. Semi-supervised self-
training of object detection models. In 2005 Seventh IEEE Workshops
on Applications of Computer Vision (WACV/MOTION’05) - Volume 1,
volume 1, pages 29–36, 2005.
[27] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regulariza-
tion with stochastic transformations and perturbations for deep semi-
supervised learning. CoRR, abs/1606.04586, 2016.
[28] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, Kohei Shiomoto [M’90, SM’15] is a Professor at
and Timothy Lillicrap. Meta-learning with memory-augmented neural Tokyo City University, Tokyo Japan. He has had
networks. In Proceedings of the 33rd International Conference on been engaged in R&D in the data communication
International Conference on Machine Learning - Volume 48, ICML’16, industry for over 28 years since he joined NTT
pages 1842–1850. JMLR.org, 2016. Laboratories in 1989. He has been active in the areas
[29] S. Sesia, I. Toufik, and M. Baker. Probabilistic Reasoning in Intelligent of network virtualization, data-mining for network
Systems. Wiley., 2nd edition, 2011. management, traffic & QoE management since he
[30] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks joined Tokyo City University in 2017. He served
for few-shot learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wal- as Guest Co-Editor for a series of special issues
lach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in established in IEEE TNSM on Management of Soft-
Neural Information Processing Systems 30, pages 4077–4087. Curran wareized Networks. He has served in various roles
Associates, Inc., 2017. organizing IEEE ComSoc high profile conferences such as IEEE NOMS, IEEE
[31] Ion Stoica, Dawn Song, Raluca Ada Popa, David A. Patterson, IM, and IEEE NetSoft.
Michael W. Mahoney, Randy H. Katz, Anthony D. Joseph, Michael
Jordan, Joseph M. Hellerstein, Joseph Gonzalez, Ken Goldberg, Ali
Ghodsi, David E. Culler, and Pieter Abbeel. A berkeley view of
systems challenges for ai. Technical Report UCB/EECS-2017-159,
EECS Department, University of California, Berkeley, Oct 2017.
[32] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M.
Hospedales. Learning to compare: Relation network for few-shot
learning. In 2018 IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 1199–1208, 2018.
[33] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using
Chin Lam Eng is a principal solution architect in
t-sne. Journal of Machine Learning Research, 9:2579–2605, 2008.
Ericsson Japan, he received the Bachelor in En-
[34] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu,
gineering from Nanyang Technological University,
and Daan Wierstra. Matching networks for one shot learning. In
Singapore and M.Sc from National University of
Proceedings of the 30th International Conference on Neural Information
Singapore, all in electrical engineering. His current
Processing Systems, NIPS’16, pages 3637–3645, Red Hook, NY, USA,
research interests lie in the area of application of
2016. Curran Associates Inc.
machine learning techniques in intelligent wireless
[35] Yaqing Wang, Quanming Yao, James T. Kwok, and Lionel M. Ni. communication networks; including energy efficient
Generalizing from a few examples: A survey on few-shot learning. ACM network, cognitive radio network systems, self-
Comput. Surv., 53(3), June 2020. organizing heterogeneous radio access networks. He
[36] David Yarowsky. Unsupervised word sense disambiguation rivaling is a member of the IEEE.
supervised methods. In Proceedings of the 33rd Annual Meeting on
Association for Computational Linguistics, ACL’95, pages 189–196,
USA, 1995. Association for Computational Linguistics.
[37] Y. Zhou, M. Kantarcioglu, and B. Thuraisingham. Self-training with
selection-by-rejection. In 2012 IEEE 12th International Conference on
Data Mining, pages 795–803, 2012.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

You might also like