Role of Data Mining in Cyber Security Detection

ROLE OF DATA MINING IN CYBER SECURITY 1NH18MCA24
CHAPTER 1
INTRODUCTION
1.1 GENERAL INTRODUCTION

Data mining is the process of posing queries and extracting patterns, often previously
unknown from large quantities of data using pattern matching or other reasoning
techniques. Cyber security is the area that deals with cyber terrorism. We are hearing
that cyber attacks will cause corporations billions of dollars. For example, one could
masquerade as a legitimate user and swindle say a bank of billions of dollars. Data
mining and web mining may be used to detect and possibly prevent security attacks
including cyber attacks. For example, anomaly detection techniques could be used to
detect unusual patterns and behaviors. Link analysis may be used to trace the viruses to
the perpetrators. Classification may be used to group various cyber attacks and then use
the profiles to detect an attack when it occurs. Prediction may be used to determine
potential future attacks depending in a way on information learnt about terrorists
through email and phone conversations. Also, for some threats non real-time data
mining may suffice while for certain other threats such as for network intrusions we may
need real-time data mining. Many researchers are investigating the use of data mining
for intrusion detection. While we need some form of real-time data mining, that is, the
results have to be generated in real-time, we also need to build models in real-time. For
example, credit card fraud detection is a form of real-time processing. However, here
models are built ahead of time. Building models in real-time remains a challenge. Data
mining can also be used for analyzing web logs as well as analyzing the audit trails. Based
on the results of the data mining tool, one can then determine whether any
unauthorized intrusions have occurred and/or whether any unauthorized queries have
been posed. There has been much research on data mining for intrusion detection. Data
mining may also be applied for Biometrics related applications. Finally data mining has
applications in national security including detecting and preventing terrorist activities.
The presentation will provide an overview of data mining and security threats and then
discuss the applications of data mining for cyber security and national security including
Department of MCA, NHCE 2020-21 1

in intrusion detection and biometrics. Privacy considerations including a discussion of
privacy preserving data mining will also be given.

CHAPTER 2
ALGORITHMS AND METHODOLOGIES
2.1 K – MEANS CLUSTERING ALGORITHM
This algorithm can be defined as analysis of clusters in that number of observations is

divided into K clusters. Thus, each observation related to nearest mean cluster. Verona cells
were formed by the outcome of partitioning of data space.
K-means is considered as easiest learning algorithms which provide solution to the well-
known clustering problem. It works for dataset with respect to number of clusters with
fixed priority. After defining the centroids, there should be some calculative way to place
these centroids as different positions leads to different results. So, first priority should be to
put them distant from one another. In the other step we will consider every individual point
that belongs to dataset given & joined with the immediate centroid, where not a single
point is left unpaired, first step is over and an initial phase of group is completed.
Once the k new centroids received & a new positioning of similar data set pints & adjacent
newly created centroid has to be done, that leads to loop 2016 1st International Conference
on Innovation and Challenges in Cyber Security (ICICCS 2016) 45 generation. Due to the
outcome of generated loop, positions of k centroids will change one by one till the next
variations are made. That means centroids are static. It can be described in below steps:
 Put all K Points into the space presented by the given objects which has to be
clustered and early group can be presented by these points.
 A nearest centroid can be allocated by each object.
 The re-evaluation of centroids can be done once all the objects have been allotted to
their respective position.
 till the centroid is at same position repeat Steps 2 and 3. Hence, this results will divide
different objects into groups through which minimized matrix can be evaluated. NP
hard is difficult to resolve. The commonly used heuristic algorithm must evaluate local
optimum quickly. Heuristic algorithms and expectation maximization algorithm have

one thing in common, that both uses combination of Gaussian distribution with
iterative refinement approach. Hence these both techniques uses clustering Of
centre; also the approach of K means clustering is used to find compatible clusters in
spatial extent. Whereas different shapes of clusters allows the expectation and
minimization.
2.2 EM ALGORITHM FOR PRIVACY

An EM algorithm is a redundant method for finding maximum likelihood or maximum a
posteriori (MAP) estimates of parameters in statistical models, where the model depends
on unobserved hidden variables. When the satisfactory result of K –means algorithms is
achieved, then EM algorithm is applied. Repetition of an Expectation Maximization
algorithm switches between an E-step performing which is used to evaluate expectation of
log by using latest estimate of parameters and Maximization. The probability of each cluster
belongs to the probability distribution which is assigned by EM Algorithm. This algorithm
can be used to identify the number of clusters to generate by cross validation process or
priority to generate them.
2.3 HIERARCHICAL CLUSTERING

In this clusters are building up in hierarchy and they can be analyzed sequentially.
Hierarchical clustering is performed in 2 categories: Divisive & Agglomerative.
2.3.1 Agglomerative (bottom up) - In Agglomerative clustering bottom up approach is

applied. In this method clustering is done one by one by pairing up clusters hierarchically
 We start with single point with one point which is known as singleton.
 In the second step, two or more clusters are added recursively as one move up the
hierarchy. This method terminates when k no. of clusters are received by the
combination of many clusters.
2.3.2 Divisive - In this approach clustering will be start from one end, recursively we can
separate clusters one by one hierarchically. Generally we can say that splitting and merging
are evaluated in greedy way. Large data sets become slow by agglomerative clustering.

2.3.4 Top – down approach of divisive clustering.
 The first step is initiated by a big cluster.
 2. In the second step, large clustering sets can be partitioned in smaller sets one by one.
When K no. of clusters is achieved the process gets terminated one by one partitioned
into clusters.
In hierarchical clustering a data set of N items is given which is to be cluster and N*N
distance matrix is prepared based on the distance between data point

CHAPTER 3
ISSUES IN CYBER SECURITY
We can discuss cyber terrorism here related to the spoofing of confidential information.
This can happen by security breach and access by unauthorized user. Vicious software and
viruses like Trojan horse are the reason behind the violation in security which can leads to
antisocial activities in the world of cyber crime. There are few more applications which are
included in cyber security to analyze data for auditing computer applications. We can build
a data ware house that contains data to audit and then by using different existing data
mining tools we can analyze whether potential anomalies are present or not. By using data
mining techniques we can restrict confidential information or data to the legitimate users
and unauthorized access could be stopped. For detection and prevention of cyber attacks
data mining technique can be used effectively, also, Data mining can be used to detect and
prevent cyber attacks, data mining also aggravate security issues like privacy and
interference. Security model shown in below figure.
Figure 3.1: Security Model in Data Mining

3.1 MALICIOUS CODE AND INTRUSION DETECTION
It can be explained as unauthorized attack on availability of resources, integrity & data
confidentiality. We can categorize these attacks in two different types: network base attack
and host based attack. In Host-based attack, a system can be targeted and an unauthorized
access on that system or machine target a machine was tried to accomplish. Primarily this
detection scheme uses simple routines to get data system call from audit process that is
used to chase system calls performed by every user.
The other type of attack is Network-based attack which does not allow authorized users to
work on different existing networks services in a meaningful way. In this type of attack
detection can be possible by using network traffic data and continuously monitoring of
traffic address of the system nodes. It can be categorize in 2 different groups: misuse
detection systems and anomaly detection groups.
3.2 MALICIOUS INTRUSIONS IN DATA MINING
This includes servers, web clients, operating systems, networks & databases. Most of the
cyber attacks and terrorism happened because of malicious intrusion. In malicious intrusion
things will process like someone without nay authorization tries to attack in the safe
network and get the confidential information. This might be any vicious automated
software or robot made by human or any human intruder. Cyber attacks or malicious
intrusions is often beneficial to show analogies of non cyber computing world i.e.
confidential relevant to cyber terrorism— and apply these attacks on computer world or
networking. Cyber terror increases day by day worldwide which is shown in below figure
Figure 3.2 Graph of Increasing cyber Terror Worldwide.

3.3 EXTERNAL ATTACKS, INSIDER THREATS AND CYBER-TERRORISM
Cyber Attacks is the major concern of today. As we all are aware of this cyber threat which
is increasing day by day with the help of information available on the Internet.
Cyber threats and cyber attacks occurred on existing networks and computer framework
could lead the disruption of business. By cyber terrorism it could estimated that millions of
dollars can caused. Cyber Threats occurred from inside or outside the organization. If
someone from outside the organization attacks on the computer is known as outside cyber
attack. In this hackers breakdown the system and cause quos in the organization.

CHAPTER 4
DATA MINING FOR NETWORK SECURITY
Data mining has many applications in security including in national security (e.g.,
surveillance) as well as in cyber security (e.g., virus detection). The threats to national
security include attacking buildings and destroying critical infrastructures such as power
grids and telecommunication systems. Data mining techniques are being used to identify
suspicious individuals and groups, and to discover which individuals and groups are
capable of carrying out terrorist activities. Cyber security is concerned with protecting
computer and network systems from corruption due to malicious software including
Trojan horses and viruses. Data mining is also being applied to provide solutions such as
intrusion detection and auditing. In this paper we will focus mainly on data mining for
cyber security applications. Data mining is one of the four detection methods used today
for detecting malware. The other three are scanning, activity monitoring, and integrity
checking. When building a security app, developers use data mining methods to improve
the speed and quality of malware detection as well as to increase the number of detected
zero-day attacks.
There are five strategies for detecting malware:
 Anomaly detection
 Misuse detection
 Hybrid detection
 Text classification technique
 Cluster based technique

4.1 ANOMALY DETECTION
Anomaly detection (also outlier detection)is the identification of items, events or
observations which is significantly different from the remaining data. anomalies are also
referred to as outliers, deviants or abnormalities in the data mining and statistics
literature. In most situations, the data is created by one or more generating processes,
which are able to not only represent activity in the system but also observations collected
of entities.
Figure 4.1 Anomaly Detection
4.1.1 Anomaly detection techniques
 Supervised Anomaly Detection: This kind of anomaly detection techniques have

the assumption that the training data set with accurate and representative labels
for normal instance and anomaly is available. In such cases, usual approach is to
develop a predictive model for normal and anomalous classes.
 Unsupervised Anomaly Detection: These techniques do not need training data set
and thus are most widely used. Unsupervised anomaly detection methods can
"pretend “that the entire data set contains the normal class and develop a model of
the normal data and regard deviations from then normal model as anomaly.

 Semi-Supervised Anomaly Detection : This kind of technique assume that the
train data has labeled instances for just the normal class. Since they do not ask for labels
for the anomaly, they are widely applicable than supervised techniques.
HT Track allows users to download World Wide Web sites from the Internet to a local
computer. By default, HT Track groups the downloaded site by the original site's
relative link-structure. By using HTTRACK we are downloading web WWW sites from
the Internet to our local system that we are going to attack. Get All Request, we will get
all requests of the users, it will get IP address, request and time of the user who
accessed web sites and how many times Perform Training, we are applying training on
that accessed file, it will open access log file and store the observable IP, request and
timestamp in to trained.dat file. Attack on web Document, open the file, in that file url
is present that we are going to attack. Perform Testing, applying testing on that
accessed file, it will open access log file and store the observable IP, request into file.
Calculate Difference, it will open file. Then testing is created and verify with
am_test.dat. Then it will calculate document rank and show the result.
 Software of anomaly detection technique

ELKI is an open-source Java data mining toolkit that contains several anomaly detection
algorithms, as well as index acceleration for them.
Figure 5.2 Real World use Cases of Anomaly Detection

4.1.2 WORKING OF ANOMALY DETECTION

Result of Anomaly Detection Among all these applications, the data has a "normal" model ,
and anomalies are recognized as deviations from this normal model. The output of
anomalies can be spliced into two types Anomaly Scores many anomaly detection
algorithms output a score qualifying the level of "outlierness" of each data point. this kind
of output can contain variety of parameters related to the data point Binary labels binary
label indicates whether a data point is an anomaly or not. Despite the fact that some
anomaly detection algorithms return binary labels directly, outlier scores can be converted
into binary labels. A binary label contains less information than a scoring system. However,
it is the final result that is usually needed for decision making.
4.2 MISUSE DETECTION

It is also called signature detection, is an approach in which attack patterns or unauthorized
and suspicious behaviors are learned based on past activities and then the knowledge about
the learned patterns is used to detect or predict subsequent similar such patterns in a
network.
4.2.1 MISUSE DETECTION TECHNIQUE
The term `misuse' is herein defined in a broad sense as the use or behavior of a network
environment in any way that is not consistent with the system's expected functionality, as
perceived by the provider of the network service. Misuse detection is also sometimes
referred to as signature-based detection because alarms are generated based on specific
attack signatures. This work focuses on the detection of such misuse events. The misuse is
often that of unauthorized access of the system or using the system in an unauthorized
way. In this case, the detection of such protection mechanism is called an Intrusion
Detection System (IDS).

Figure 4.3 Misuse Detection Systems with Pattern Matching
4.2.2 WORKING OF MISUSE DETECTION TECHNIQUE

Input: Measurement from network traffic data and Threshold value for similarity Output:
Detected or null Assumptions:
The parameters for network intrusion are assumed which form the bases for the input
The existence of trained normal data set in the experiment conducted, we have assumed
the data of one timing is chosen as the normal trained set)
Step 1: Identify and collect relevant data from network traffic.
Step 2:Convert the quantitative feature of the data in step 1 into fuzzy sets
Step 3: Define membership function for fuzzy variable
Step 4: Apply genetic algorithm to identify the best set of rules.
Step 5: For each of the rules identified in the step 4 do
 Apply the fuzzy association rule algorithm to mine the correlation among them
 Apply fuzzy frequency algorithm to mine sequential patterns
Step 6: For each test case generate new patterns using the fuzzy association algorithm for
same parameters
Step 7: For each new pattern, compare it with normal patterns created by Training data for
similarity Step 8: IF the similarity > the threshold value Then report “Detected” and the
pattern.
As a result, the computer systems protected solely by misuse detection systems face the
risk of being comprised without detecting the attacks. In addition, due to the requirement
of explicit representation of attacks, misuse detection requires the nature of the attacks to

be well understood. This implies that human experts must work on the analysis and
representation of attacks, which is usually time consuming and error prone.
4.3 HYBRID INTRUSION DETECTION SYSTEM (H-IDS)
The H-IDS designed within this paper is based on an original approach, where the outputs of
an anomaly-based detector and a signature-based detector are collected. The parameters
of the detectors are controlled by a centralized node. This node is referred to as hybrid
detection engine (HDE). The design goal of this intrusion detection system is to enhance the
overall performance of
DDoS attack detection, by shortening the detection delay, while increasing the detection
accuracy.
Figure 4.4 Hybrid Intrusion Detection (H-IDS)
The block diagram of the proposed H-IDS is shown in Figure. As can be seen from this figure,
the observed data containing normal traffic and DDoS attacks is processed to extract some
features; then processed data is linked to signature-based and anomaly-based detector

blocks to detect attacks. Outputs of these detectors are examined by a decision combiner
and an alarm gets produced according to sensitivity parameter.
4.3.1 WORKING OF HYBRID INTRUSION DETECTION SYSTEM (H-ID)
For this work, we decided to use log patterns within a Linux Operating system as our base
for data collection. After creating a virtual machine of Ubuntu, we allowed the system to
run for a certain amount of time with no attacks and stored the log file. Afterwards we
performed simulated attacks of an Internet Control Message Protocol which disables a
computer by sending large amounts of “pings”. We also conducted brute force password
attacks. We formatted the normal log and attack log files (see Figure 1 for a sample) to run
through our Data Mining tool. The normal log would be used to set definitions of any
normal activities in a normal log database so that the pattern generator will ignore them as
findings. Any patterns that do not appear on the normal logs but do appear on the attack
logs will be sent to the user for analysis and if confirmed will be placed in the attack
database.
4.4 TEXT CLASSIFICATION TECHNIQUE

In this research, the author created the text classification system to detect which
documents contain the information related to cyber terrorism. In addition, this research
focuses on the text classification analysis that has the capacity to be utilized in Web mining.
The analysis includes the performance comparison of Naïve Bayes, Nearest Neighbor,
Support Vector Machine (SVM), Decision Tree, and Multilayer Neural Network Perceptron in
the term of cyber terrorism.
4.4.1 WORKING OF TEXT CLASSISFICAION TECHNIQUE
Data acquisition is the first phase of conducting this experiment. The data sets used in this
research study is based on English textual document which is downloaded manually from
the internet. In doing the experiment, Holdout Method was chosen by author in performing
the text classification. Thus, the data sets were separated into two sections for the training
set and test set, the distribution of the dataset is defined as follow:
Training Set, consists of 400 samples (200 Cyber Terrorism samples + 200 Non-Cyber
Terrorism samples)

Test Set, consists of 200 samples (100 Cyber Terrorism Samples + 100 Non-Cyber Terrorism
samples)
After collecting documents as the data set there are actually three main phases within this
research. These three phases are text pre-processing, Training, and Classification. In text
preprocessing phase author conducts:
Tokenization, it is a process of handling text document by breaking its stream of characters
into words, or more precisely, tokens
Feature Selection, in feature selection phase, the author studied and created the list of
terms (dictionary) related to cyber terrorism. In addition, the author also applied Best First
algorithm in order to conduct the research and compare the performance of classifier.
Vector Generation, there are two types of vector representation in this research, which are
term frequency vector and binary vector representation. Hence, the result of the classifiers
will be compared in order to find out which algorithm performs the best in relation to this
research topic.
4.5 CLUSTERING BASED TECHNIQUE
To understand the topology of cyber terrorist networks and discover their operation
methods, firstly the Identification of their sub-committees or cells should be conducted
using cyber communities detection approaches
This would assist investigators to extract valuable knowledge from a vast amount of
gathered data about the structures and strategies of cyber terrorist groups. The efficient
use of organizational data contributes considerably to develop the network map which
describes the cyber terrorist group structure, as well as to understand individuals roles
within the group. In addition, the detection of some actors (nodes) in every subgroup of the
cyber terrorist group would allow achieving the clustering process. In fact, this process aims
to group a set of objects, sharing characteristics and following same criteria, in groups,
called clusters.

4.5.1 WORKING OF CLUSTER BASED TECHNIQUE
 To introduce a theoretical framework which aims two manifolds : first to discover

criminal organizations in networks issued from phone calls records and second to
help investigators to analyze their structure and to discover hidden roles and
relationships. Thus, this work contributes to terrorist network visualization from
mobile phone calls and to community detection and analysis. Authors in [12],
proposed a novel framework which aims to classify and predict terrorist groups
activities. Practically, its is based on four classifiers namely; nave bays (NB), K
nearest neighbor (KNN), Iterative Dichotomies 3 (ID3) and decision stump (DS). To
combine the aforementioned classifiers, authors applied a majority vote-based
ensemble technique.
 Clustering approach in order to detect the closeness of a citizen to terrorism
concept or to a terrorist based on specific personality characteristics in particular
social, intellectual and regional orientations. In another recent work, authors
introduced a technique for terrorist network destabilization based on two main
steps. First, community’s detection and second the analysis of these groups by the
Identification of key actors (nodes) and links.
 We focused on partitioning the data set on three main sets namely, the training set
which is used in the automatic learning of the CECM algorithm. This data set must
contains prior.
 Knowledge about some terrorist sub-groups to be used in identifying similarities
with them in the clustering process. CECM is the extension of the ECM algorithm
by applying Must Link and Cannot Link constraints to define similarities and
dissimilarities respectively.

Figure 4.5 Cluster Based Technique
 Evidential C-Means algorithm which is based on belief functions and on credal

partition to achieve detects potential terrorist sub-communities. This algorithm is
flexible and offers the possibility to compute, for each cluster (community), a set of
individuals that certainly belong to it, and a set of others that possibly belong to it.
Two main constraints have been used to enhance the clustering process namely
Must-link and Cannot-link. In fact, based on these two constraints, a new version of
the algorithm is proposed and applied to target the aforementioned goal, named
constrained Evidential C-Means (CECM) with an interesting accuracy around 86%.
To the best of our knowledge, both ECM and CECM, were applied on traditional
data sets and not on graphs, an enhanced version of CECM to detect communities
in a given network is an excellent perspective of our work. Indeed, our approach
can effectively exploit relationships between individuals in the JJATT and can be
refined by a link mining combined approach which takes into account semantics of
links between nodes, to make the clustering process more effective. Another
interesting perspective of our work, is to detect potential key players within sub-
communities using SNA measures and other data mining techniques. Finally,
another noteworthy point is that, the used algorithm cannot efficiently handle
large scale networks, thus a new version of this algorithm which enables the
processing of complex graphs will be an interesting contribution.

CHAPTER 5
IMPLEMENTATION ADVANTAGES AND DISADVANTAGES
5.1 IMPLEMENTATION
In this research paper, I have shown the concept of data mining techniques to identify
cyber-attacks. My focus of attention would be on “finding patterns” in a log file (records
that occur in the system) which shows the sequence of events. From this log file i identify
patterns. To start with, I use the clustering technique to discover the type of cyber-crime,
Denial of service (DoS) attacks. As we know that clustering is grouping of data that has
similar features. So this grouping helps to discover similar patterns of data that occur
constantly in the log file. Step 1: Evaluate the log file. Step 2: Mine the date with time Step
3: Scan the data Step 4: Add the found data in the main file. When the above procedure is
carried out, we will record that data which contains normal patterns and also abnormal
patterns (malicious). By using the clustering technique we identify the data that occur
repeatedly [9]. System Configuration: In order to run our obtained data, we use the
Windows Server to maintain the database. Initially we run the data that contains zero
attacks and then add them to the master file or log file. The ICMP (Internet Control Message
Protocol) will make the system inactive by sending voluminous amount of “ping” command.
Now the data that contains the normal activities and the data that contains attacks are
passed through the technique that we have proposed. If the observations of the log file
show normal behavior then they will be ignored. If the observations show multiple requests
of the same transaction, then this data will be directed through our algorithm “Apriori” and
will be shown in the attack logs. This algorithm will detect if similar patterns of requests
exist in the normal records prior to consider it as attack. If the algorithm finds out the
pattern and or finds the number of request for the same transaction more than the
threshold value it is considered as an attack and it sends signal or message to the
administrator about the suspected attack.

Figure 5.1 Sample log File
In the fig. as shown above, we could see the DoS attack that has been made by the
anonymous user (intruder) initially by gaining the access to the system (server) by posing as
a authenticated user. In denial of service attack, the attacker gains the access through the
vulnerabilities present in the system and copies the message sent by an authenticated user
and makes multiple copies of the same request or query and sends it to the server. So, the
server will process the same query or the request sent by a user for multiple times. In this
way, the server is kept busy by processing the same request multiple times. This is called as
denial of service attack. Another example is the “ping” attack where multiple ping requests
will be sent from one user or multiple users and the server is again overloaded with
processing the same request. This type of attack is severe. We apply data mining techniques
to identify these types of attacks by finding similar patterns or request from the users. In
our approach, we define a threshold of minimum support (5). If the same request is
received to the server more than the threshold value, it assumes it as an attack and notifies
the administrator. In some cases, based on the working environment, the threshold value
could be set accordingly.
Procedures:
Step 1: Start

Step 2: Let the Count=0, set the threshold value. The threshold value can be set based on
the working environment.
Step 3: Check if the counts of matched rules have crossed the threshold value.
If true, intimate the administrator assuming as an attack.
If false, continue.
Step 4: Check whether new event is recorded in log file.
If no new event found, wait
If event found, go to step 2
5.2 ADVANTAGES AND DISADVANTAGES

5.2.1 ADVANTAGES
Using data mining in cyber security lets you
 Process large datasets faster.

 Create a unique and effective model for each particular use case.
 Apply certain data mining techniques to detect zero-day attacks. It was originally
published on https://www.apriorit.com/

5.2.2 DISVANTAGES
While this list of the benefits is impressive, there are also certain drawbacks
you need to know about:
 Data mining is complex, resource-intensive, and expensive
 Building an appropriate classifier may be a challenge
 Potentially malicious files need to be inspected manually
 Classifiers need to be constantly updated to include samples of new malware
 There are certain data mining security issues, including the risk of unauthorized
disclosure of sensitive information
Data mining helps you quickly analyze huge datasets and automatically discover hidden
patterns, which is crucial when it comes to creating an effective anti-malware solution that’s

able to detect previously unknown threats. However, the final result of using data mining
methods always depends on the quality of data you use.
When using data mining in cyber security, it’s crucial to use only quality data. However,
preparing databases for analysis requires a lot of time, effort, and resources. You need to
clear all your records of duplicate, false, and incomplete information before working with
them. Lack of information or the presence of duplicate records or errors can significantly
decrease the effectiveness of complex data mining techniques. Only using accurate and
complete data can ensure high quality of analysis.It was originally published on
https://www.apriorit.com/

CHAPTER 6
CONCLUSIONS
In prospect job context, this study suggests to scheme regarding continuance of work
about mining the security risks in massive datasets. For instance, this work just examined
the deep links hijacking- risks. Since the deep links could be castoff to outbreak cell phone
browser or applications hence, this paper suggests toiling upon the exposure of exposed
communications applications plus malevolent deep-links over the web. Additionally, there
is a need to refine the carried-out methods to construct it farther applied pro diverse
actual world apps. For instance, an updating of the existing data-leak exposure scheme
plus manifestos to make the procedure flowing network interchange proficient.
It can be concluded this research has developed a proof-of concept of a methodology to
detect documents which contain information related to cyber terrorism using text
classification techniques based on English textual document. In addition, by applying
feature selection known as Best First algorithm it can avoid computational expensive and
cut the execution time without decreasing the performance of the classifiers and even
improve its level of accuracy.
Last, by comparing the result of each classifier, it shows that Support Vector Machine
algorithm has the best by achieving 100% of accuracy based upon term-frequency
representation with feature selection. This result proves that the capability of Support
Vector Machine in high dimensional input space. As the future works in relation to this
research is the used of TF-IDF (Term Frequency- Inverse Document Frequency) as the
vector representation

CHAPTER 7
TEXT REFERENCES
[1]. Data Mining for Security Applications : Bhavani Thuraisingham, Latifur Khan,
Mohammad M. Masud, Kevin W. Hamlen
[2]. Rakesh Agrawal, Tomasz Imieliski, and Arun Swami. Mining association rules
between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD
international conference on Management of data,
[3]. Daniel Barbara and Sushil Jajodia, editors. Applications of Data Mining in
Computer Security. Kluwer Academic Publishers
[4]. Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and J Sander. Lof:
identifying density-based local outliers. In Proceedings of the 2000 ACM SIG-MOD
international conference on Management of data, pages
[5]. Varun Chandola and Vipin Kumar. Summarization {compressing data into an
informative representation. In Fifth IEEE International Conference on Data Mining,
pages.
WEB REFERENCES
[6]. https://ieeexplore.ieee.org/document/5946881
[7]. https://sci-hub.se/.
[8].https://www.apriorit.com/dev-blog/527-data-mining-cyber-security#:~:text=Data
%20mining%20has%20great%20potential,known%20and%20zero%2Dday
%20attacks.
[9].https://www.cs.odu.edu/~mukka/cs795sum10dm/Lecturenotes/Day7/Barbara
%20Jajodia%20Data%20Mining%20Book.pdf


Role of Data Mining in Cyber Security Detection

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Role of Data Mining in Cyber Security Detection

Uploaded by

Copyright:

Available Formats

ROLE OF DATA MINING IN CYBER SECURITY 1NH18MCA24

1.1 GENERAL INTRODUCTION

Department of MCA, NHCE 2020-21 1

Department of MCA, NHCE 2020-21 2

ALGORITHMS AND METHODOLOGIES

2.1 K – MEANS CLUSTERING ALGORITHM

This algorithm can be defined as analysis of clusters in that number of observations is

Department of MCA, NHCE 2020-21 3

2.2 EM ALGORITHM FOR PRIVACY

2.3 HIERARCHICAL CLUSTERING

2.3.1 Agglomerative (bottom up) - In Agglomerative clustering bottom up approach is

Department of MCA, NHCE 2020-21 4

Department of MCA, NHCE 2020-21 5

ISSUES IN CYBER SECURITY

Figure 3.1: Security Model in Data Mining

Department of MCA, NHCE 2020-21 6

Figure 3.2 Graph of Increasing cyber Terror Worldwide.

Department of MCA, NHCE 2020-21 7

Department of MCA, NHCE 2020-21 8

DATA MINING FOR NETWORK SECURITY

Department of MCA, NHCE 2020-21 9

Figure 4.1 Anomaly Detection

4.1.1 Anomaly detection techniques

 Supervised Anomaly Detection: This kind of anomaly detection techniques have

Department of MCA, NHCE 2020-21 10

 Software of anomaly detection technique

Figure 5.2 Real World use Cases of Anomaly Detection

Department of MCA, NHCE 2020-21 11

4.1.2 WORKING OF ANOMALY DETECTION

4.2 MISUSE DETECTION

Department of MCA, NHCE 2020-21 12

Figure 4.3 Misuse Detection Systems with Pattern Matching

4.2.2 WORKING OF MISUSE DETECTION TECHNIQUE

Department of MCA, NHCE 2020-21 13

4.3 HYBRID INTRUSION DETECTION SYSTEM (H-IDS)

Figure 4.4 Hybrid Intrusion Detection (H-IDS)

Department of MCA, NHCE 2020-21 14

4.4 TEXT CLASSIFICATION TECHNIQUE

Department of MCA, NHCE 2020-21 15

4.5 CLUSTERING BASED TECHNIQUE

Department of MCA, NHCE 2020-21 16

 To introduce a theoretical framework which aims two manifolds : ﬁrst to discover

Department of MCA, NHCE 2020-21 17

Figure 4.5 Cluster Based Technique

 Evidential C-Means algorithm which is based on belief functions and on credal

Department of MCA, NHCE 2020-21 18

IMPLEMENTATION ADVANTAGES AND DISADVANTAGES

Department of MCA, NHCE 2020-21 19

Figure 5.1 Sample log File

Department of MCA, NHCE 2020-21 20

5.2 ADVANTAGES AND DISADVANTAGES

 Process large datasets faster.

Department of MCA, NHCE 2020-21 21

Department of MCA, NHCE 2020-21 22

Department of MCA, NHCE 2020-21 23

Department of MCA, NHCE 2020-21 24

Department of MCA, NHCE 2020-21 25

You might also like