You are on page 1of 4

Network Security Situation Factor Extraction Based on

Random Forest of Information Gain


Yongcheng Duan Xin Li*
College of Information Technology and Network Security, College of Information Technology and Network
People’s Public Security University of China Security, People’s Public Security University of China
Beijing 100038, China Beijing 100038, China
443130851@qq.com ndlixin@sina.com
Xue Yang Le Yang
College of Information Technology and Network Security, College of Information Technology and Network
People’s Public Security University of China Security, People’s Public Security University of China
Beijing 100038, China Beijing 100038, China
812334777@qq.com yl0x00@icloud.com

ABSTRACT master the entire network security situation. In order to cope with
Aiming at the problem of situational element extraction, a method the increasingly complex and hidden network threats and grasp
based on random forest of information gain for network security the security status of the entire network in a macroscopic way,
situation factor extraction is proposed. First, the importance of the network security situation awareness has received more and more
attribute is determined by the information gain. After the attention. Network security situation awareness is a cognitive
threshold is set, the attribute is reduced and the redundant attribute process for the security state of network systems [1], including
is deleted. Secondly, the processed data is classified using the situation factor extraction, situation assessment, and situation
random forest classifier. Finally, in order to verify the efficiency prediction. The extraction of situation factors is the basis of
of the algorithm, the improved method is tested by the intrusion network security situation awareness and an important part of the
detection data set. Compared with the traditional method, the whole situation awareness. The quality of the situation factors
experimental results show that the algorithm effectively improves directly affects the performance of the entire security system. The
the accuracy and achieves efficient extraction of network security main purpose of situation factor extraction is to delete redundant
situation elements. data, extract high-potential situation elements, and provide data
foundation for situation assessment and situation prediction.
CCS Concepts At present, there are two research hotspots for the extraction of
• Security and privacy ➝Intrusion/anomaly detection and situation factors at home and abroad [1]. The method based on
malware mitigation ➝Intrusion detection systems • prior knowledge is based on expert experience and knowledge to
Computing methodologies ➝Machine learning ➝Machine define the knowledge base, such as the scene-based method, and
learning approaches ➝Classification and regression trees. the method based on prior knowledge is passed. Data mining,
machine learning and other techniques analyze the relationship
Keywords between behaviors, such as similarity methods, causal association
Situational awareness; situational extraction; random-forest; methods and cross-correlation methods. BASS [2] proposed the
information gain. concept of network security situation awareness in 1999, the
purpose of which is to correlate IDS detection results to analyze
1. INTRODUCTION and analyze attack information for evaluation of network security.
In recent years, the network security field has developed rapidly. The National Center for Advanced Security Systems Research [3]
With the continuous expansion of the network scale, the topology developed a system of security event fusion tools that combines
structure has become increasingly complex, and the attack security data with professional cognitive capabilities to provide a
behavior has also increased rapidly. It is increasingly necessary to secure visualization of the network. The Lawrence Berkeley
National Laboratory developed the "The Spinning Cube of
Permission to make digital or hard copies of all or part of this work for Potential Doom" system in 2003 [4], which uses points in three-
personal or classroom use is granted without fee provided that copies are dimensional space to represent the analysis of network
not made or distributed for profit or commercial advantage and that connectivity to achieve network security situation feature
copies bear this notice and the full citation on the first page. Copyrights
extraction. Domestically, we mainly study the extraction of
for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, or situation factors from the aspect of intrusion detection, which not
republish, to post on servers or to redistribute to lists, requires prior only can remove redundant features, but also effectively detect
specific permission and/or a fee. attacks. Li Dongyin [5] proposed a situation feature extraction
Request permissions from Permissions@acm.org. model based on particle swarm optimization and logistic
ICBDC 2019, May 10–12, 2019, Guangzhou, China regression, neighborhood rough set and MapReduce distributed
© 2019 Association for Computing Machinery framework. The global optimization ability of improved particle
ACM ISBN 978-1-4503-6278-8/19/05…$15.00 swarm optimization algorithm is used to calculate the parameters
http://doi.org/10.1145/3335484.3335486 of Logistic regression model, to improve the accuracy of factor
acquisition. Wang Huiqiang [6] proposed a method based on

194
neural network combined with evolutionary strategy, optimized uncertainty of the information of the class Y is reduced. Different
neural network parameters through evolutionary strategy, and then features have different information gains, and features with large
used neural network to extract situation elements, greatly information gain have stronger classification capabilities. The
improving classification accuracy. In view of the problem that information gain is used to analyze the correlation of the data, and
network security situation information cannot be uniformly the information gain of each attribute is calculated as the size of
described, shared and reused, Si Cheng [7] proposed a solution the impact on the classification result. The information gain size is
based on ontology-based network security situation factor used to measure the impact of each attribute on the class label.
knowledge base model, first classifying and extracting situation
elements, followed by the principle of ontology construction Its calculation formula is as follows: Gain(S, A) = H(S) −
𝐶𝑖 𝐶𝑖
builds the knowledge ontology model of network security H(S|A), H(S) = − ∑𝑚 𝑖=1 𝐷 log 2 𝐷 ,, H(S) is the information
situation factors. Liu Xiaowu [8] proposed a fusion-based network entropy of the sample set S, H(S|A) represent the information
security situation awareness control model, which combines the entropy after the sample set S is divided by the feature A, among
deduction of security event threat level and threat element them, S represents a collection of samples, 𝐶𝑖 represents the
relationship with multi-source fusion algorithm, which overcame number of samples in the i-th category(i=1,2, … … ,m),D
the shortcomings of dealing with complex affiliation between represents the number of samples in the sample set S.
network components in the process of acquiring situation
elements. 2.2 Random Forest
The above method requires a large amount of prior knowledge in The random forest is an integrated learning algorithm based on
the process of extracting situation factors, and has strong decision tree based on the basic subspace method of Leo Breiman
subjectivity. It has some effects for specific fields, but when the based on Bagging integrated learning theory [14].Decision trees
sample base is too large, the subjective analysis method will lose are a simple but widely used classification technique [15]. Each
its effect. Therefore, this paper proposes a method based on node of the decision tree is continuously classified by the splitting
random forest of information gain for situation feature extraction, criterion before the conditions for the decision tree to terminate
which introduces information gain into the network security growth are reached. The most important step in the process of
situation factor extraction, uses information gain to calculate constructing a decision tree is to determine the optimal splitting
attribute weights, selects attributes with higher importance, and feature based on the splitting criteria. Common decision tree
deletes redundant attributes. Random forest classification is algorithms are ID3 decision tree algorithm [16], C4.5 decision tree
carried out, which effectively improves the accuracy. algorithm [17], CART decision tree algorithm [18], in which ID3
decision tree algorithm is based on Shannon information theory
2. EXTRACTION OF SITUATION information gain [19], C4. The decision tree algorithm is based on
the gain ratio, while the CART decision tree algorithm is based on
FACTORS BASED ON RANDOM FOREST Gini impureness.
OF INFORMATION GAIN
A random forest is equivalent to a combination of multiple
2.1 Information Gain decision trees. The random forest uses the bootstrap method to
The primary task of situation factor extraction is to accurately extract the same size samples from the original data set to form M
discover abnormal behaviors in a complex network environment. sub-data sets, and then construct the corresponding M decision
The essence is the process of screening the situation feature sets. tree to form the decision forest, but there is no correlation between
By reducing the dimensions of attributes, deleting redundant each decision tree. When each node splits, the decision tree
attributes, and selecting key situation elements [9].In a complex randomly extracts a feature subset (usually log2K) from all K
network environment, network situation element information has features, and then selects an optimal split feature from the subset
the characteristics of large data volume and diverse attribute types to build the tree. When a new sample is entered, each decision tree
[10]. Therefore, reducing the dimension, cutting down redundancy, in the forest is judged, and the final classification result is
and extracting the key situation feature sets became an important determined by the output of the M decision trees. Random forests
process for the extraction of situation elements for a large number are less likely to fall into overfitting than ordinary decision trees,
of situation element information. The current attribute reduction and have good noise immunity, fast training speed, and the ability
algorithms include principal component analysis (PCA) method to process very high dimensional data. Figure 2 is a schematic
[11], singular value decomposition (SVD) method, rough set (RS) diagram of the classification of random forests.
[12, 13] and information gain. The information-based attribute
reduction process is shown in Figure 1.

Figure 2. Is a schematic diagram of the classification of


Figure 1. Situation element extraction process based on random forests.
information gain.
Information gain is a feature selection algorithm. The information
gain represents the information of the known feature X, so that the

195
2.3 Situation Factor Extraction Method Based The steps of the situation element extraction algorithm proposed
in this paper are as follows:
on Random Forest of Information Gain
The importance of different attributes and the role of classification Step 1 The raw data is subjected to a pre-processing process such
are different, so the use of traditional random forests without as continuous attribute discretization to form a set of situation
feature dimensionality reduction and noise reduction data will elements.
have an impact on the classification effect.Through the
Step 2 Calculate the information gain for each attribute and set the
information gain, the large-scale data can be reduced in dimension,
threshold.
so as to reduce the amount of data while retaining the key
information in the data, and the noise in the data can be deleted. Step 2 Reducing the attributes that are less than the threshold,
Therefore, this paper proposes a method based on random forest delete the redundant attributes, and determine the optimized
of information gain to extract situation elements. Combine subset of the situation elements.
information gain with random forests to improve accuracy.The
method is shown in Figure 3. Step 3 Using the random forest classifier to classify the optimized
subset of situation elements, and obtain a random forest classifier
for the extraction of situation elements.
Step 4 Testing the random forest classifier and get the test results.

3. EXPERIMENT AND ANALYSIS


3.1 Experimental Data
The KDD-99 dataset is the most classic intrusion detection dataset
in many publicly available datasets and is currently recognized as
the most influential intrusion detection dataset. The NSL-KDD
data set is the version that removes redundant data records. The
data set contains 41 feature attributes and 1 tag attribute. The tag
attributes are divided into five categories: Normal, Probe, DoS,
Figure 3. random forest situation factor extraction method User to Root (U2R) and Remote to Local (R2L).Table 1 shows
based on information gain. the overall distribution of the NSL-KDD data set.
Table 1. Data Distribution of NSL-KDD
Data set Normal Dos Probe U2R R2L
Training set 67343 45927 11656 52 995

Test set 9711 7456 2421 200 2756

The optimal threshold obtained from Table 2 was set at 0.5. After
3.2 Experimental Methods and Results the attribute selection, you can change the attribute category from
Analysis 41 to 14 respectively: protocol, service, flag,src_byte, wrongfrag,
The experiment uses Weka 2.7.2, an open source machine count,rerror,srverror,login.src_bytes,service,diff_srv_rate,flag,dst
learning tool, which is widely used in data mining and other fields. _bytes,same_srv_rate,dst_host_diff_srv_rate,count,dst_host_srv_c
This paper experiments with a CART decision tree to construct a ount,dst_host_same_srv_rate,dst_host_serror_rate,serror_rate,dst_
random forest. There is a lot of data in the NSL-KDD data set that host_srv_serror_rate, srv_serror_rate.
is continuous, so it is necessary to discretize continuous data. Table 3 shows the comparison results of the classification of the
Calculating the information gain of each attribute on the NSL-KDD dataset using the traditional random forest
discretized data, such as Gain(service)=0.8601,Gain(flag)=0.7057. classification model and the information forest-based random
The threshold for the design is set to 0.5 When the redundancy forest classification model.
attribute is removed, the data is imported into the traditional Table 3. Comparison of classification effects
random forest classification model and the improved random
forest classification model, and the improved threshold is Classification Random forest Random forest of
debugged to select the best threshold. The effect of the threshold information gain
on the classification effect is shown in Table 2. Normal 0.927 0.97
Table 2. The effect of threshold on classification Probe 0.617 0.663
Threshold 0 0.25 0.5 0.75 1 Dos 0.766 0.819
Accuracy 0.744 0.761 0.769 0.747 0.716 U2R 0.02 0.035
rate R2L 0.099 0.068
Error rate 0.256 0.239 0.231 0.253 0.284 Weighted Avg 0.731 0.769

196
It can be seen from the experiment that the random forest Table 4 shows the detection of NSL-KDD datasets using
algorithm based on information gain on the NSL-KDD data set is traditional naive Bayes, KNN, SVM, random forest classifiers,
superior to the traditional random forest algorithm in the accuracy and naive Bayes, KNN, SVM, and random forest classifiers after
of the four classifications. attribute reduction. The classification effect is compared with the
result.

Table 4. Comparison of classification effects


Classification Naive Bayes KNN SVM Random forest IG- Naive Bayes IG-KNN IG-SVM IG- Random forest
Accuracy rate 0.715 0.742 0.741 0.744 0.703 0.759 0.743 0.769
Error rate 0.285 0.258 0.259 0.256 0.297 0.241 0.257 0.231

It can be seen from Table 4 that compared with the traditional Awareness[P]. Internet Computing in Science and
algorithm, the classification accuracy of the random forest Engineering, 2008. ICICSE '08. International Conference
classifier is the highest; compared with the classifier after the on,2008.
information gain attribute reduction, the accuracy of the [7] Si Cheng,Zhang Hongqi,Wang Yongwei,Yang
information gain random forest classifier is also the highest. Yingjie.Research on Knowledge Base Model of Network
In summary, based on the experimental results, we can conclude Security Situational Elements Based on
that the information forest-based random forest algorithm Ontology[J].Computer Science,2015,42(05):173-177.
proposed in this paper can effectively improve the accuracy and [8] LIU Xiao-Wu, WANG Hui-Qiang,LüHong-Wu,YU Ji-
achieve efficient extraction of network security situation elements. Guo,ZHANG Shu-Wen Fusion-Based Cognitive Awareness-
Control Model for Network Security Situation[J]. Journal of
4. CONCLUSION Software, 2016, 27(8): 2099-2114(in
In this paper, combining information gain with random forest, an Chinese).DOI=http://www.jos.org.cn/1000-9825/4611.htm.
algorithm based on random forest of information gain for situation
extraction is proposed. Firstly, the information gain is used to [9] 9Guo Jian. Research on the acquisition technology of
reduce the dimensionality of the set of situation elements, the situational factors in network security situational
redundant situation elements are deleted, and then the set of awareness[D]. Northeastern University, 2011.
situation elements after reduction is trained by using random [10] 10Lai Jibao,Wang Ying,Wang Huiqiang,Zheng
forest classifier to extract important situation elements.The Fengbin,Zhou Bing.Research on the Structure of Network
experimental results show that compared with the traditional Security Situation Awareness System Based on Multi-source
random forest algorithm, the algorithm improves the accuracy Heterogeneous Sensors[J].Computer
effectively ,realizes the efficient extraction of network security Science,2011,38(03):144-149+158.
situation elements, and provides data basis for situation
[11] LIN Weining, CHEN Mingzhi, ZHAN Yunqing, et al.
assessment and situation prediction, but there is still optimization
Research on an Intrusion Detection Algorithm Based on PCA
space, improving the efficiency of the algorithm and reducing the
and Random-forest Classification[J].Netinfo
demand for storage space is the next research focus.
Security,2017(11):50-54.
5. ACKNOWLEDGMENT [12] Liang Ying,Wang Huiqiang,Lai Jibao.A Network Security
Support by National Key R&D Program of China (No. Situational Awareness Method Based on Rough Set
2017YFC0803700) and Construction and development of key Theory[J].Computer Science,2007(08):95-97+147.
laboratory of the Ministry of Public Security [13] Li Hong. Research on Extraction of Network Security
Situation Factors Based on Rough Sets[D].Hebei Normal
6. REFERENCES University, 2017.
[1] Gong J, Zang XD, Su Q, Hu XY, Xu J. Survey of Network
Security Situation Awareness[J]. Journal of Software, 2017, [14] Kwok SW, Carter C. Multiple decision trees[EB/OL].[2018-
28(4): 1010-1026(in 01-31].https://arxiv.org/abs/1304.2363
Chinese).DOI=http://www.jos.org.cn/1000-9825/5142.htm. [15] Ho TK. The random subspace method for constructing
[2] Bass T, Gruber D. A glimpse into the future of id[J]. login:: decision forests[J]. IEEE transactions on pattern analysis and
the magazine of USENIX & SAGE, 1999, 24:págs. 40-45. machine intelligence, 1998, 20(8):832-44
[3] Yurcik W. Visualizing NetFlows for security at line speed: [16] Quinlan JR. Induction of decision trees[J]. Machine learning,
the SIFT tool suite[C]// Conference on Systems 1986, 1(1):81-106
Administration. DBLP, 2005:169-176. [17] Quinlan JR. C4. 5: programs for machine
[4] Lau S. The Spinning Cube of Potential Doom.[J]. Comm learning[M].Elsevier, 2014
Acm, 2004, 47(6):págs. 25-26. [18] Breiman L,Friedman J,Stone C J,et al.Classification and
[5] Li D, Liu Z. Situation element extraction of network security Regression Trees[M]. CRC press, 1984
based on Logistic Regression and Improved Particle Swarm [19] QI Ben, WANG Mengdi. A Method Using Information Gain
Optimization[C]//Natural Computation (ICNC), 2013 Ninth and Naive Bayes to Extract Network Situation Information
International Conference on. IEEE, 2013: 569-573. [J]. Netinfo Security, 2017(9):54-57
[6] Huiqiang Wang, Ying Liang, Haizhi Ye. An Extraction
Method of situation Factors for Network Security situation

197

You might also like