You are on page 1of 4

2016 1st International Conference on Innovation and Challenges in Cyber Security (ICICCS 2016)

Data-Mining a Mechanism against Cyber


Threats: A Review
Shipra Ravi Kumar Prof. J.S.Jassi Suman Avdhesh Yadav Ravi Sharma
Computer Science &Engineering Mechanical Engineering Information Technology Information Tech.
shipra.chaudhary85@gmail.com jsjassi@gn.amity.edu sumanavdheshyadav@gmail.com rsharma@gn.amity.edu

Abstract - Data mining is the process in that analyzing According to Vladimir Estivill-Castro, there are
of data from different perspective and summarizes that several type of clusters which can’t be generalize so,
data into some useful information which can be used to various machine learning algorithms are used
enhance the revenue generation, cost cutting etc.. In depending upon the type of clusters[1]. Different
data mining, cluster formation plays a vital role which researchers proposed different models for different
is data can be divided into different groups. Clustering types of datasets supported by various machine
is the technique in which grouping is based on similar
learning algorithms. So, a different cluster varies in
type of data relevant to different attributes. WEKA is
their properties significantly.
the most important tool of data mining which is used to
allocate and clustering of data with use of various
There are three different types of clustering
machine learning algorithms. The purpose of this paper
is to compare different algorithms of machine learning
algorithms in data mining which preserves security:
on the subject of types of data set, their size, number of Hierarchical clustering algorithm, Expectation
clusters and cyber privacy platform. We also discuss Maximization & K-means Algorithm.
different types of cyber threats in computing world.
II. K – MEANS CLUSTERING ALGORITHM
Keywords – MAP, WEKA, Cyber Threats, Intrusion
Detection This algorithm can be defined as analysis of
clusters in that no. of observations is divided into K
I. INTRODUCTION clusters. Thus, each observation related to nearest
mean cluster. Verona cells were formed by the
Clustering can be defined as the collection of outcome of partitioning of data space.
objects according to the similarities of their
attributes. This collection of data is called as cluster. K-means is considered as easiest learning
Clustering can be used for recognition of patterns, algorithms which provide solution to the well-known
bioinformatics, and recovery of information and clustering problem. It works for dataset w.r.t number
statistical analysis of data. of clusters with fixed priority. After defining the
centroids, there should be some calculative way to
Different algorithms can be used in clustering place these centroids as different positions leads to
which have different properties and suitable for different results. So, first priority should be to put
different circumstances, some of the attributes on them distant from one another. In the other step we
which clustering algorithms are distance between the will consider every individual point that belongs to
data points, dense areas and distribution. Suitable dataset given & joined with the immediate centroid,
parameters or attributes like distance function, type where not a single point is left unpaired, first step is
of distance function for e.g Euclidian distance, over and an initial phase of group is completed.
minoski distance or the threshold function depend on
the datasets. Clustering is an automatic process but Once the k new centroids received & a new
needs changes in preprocessing sometimes. positioning of similar data set pints & adjacent newly
created centroid has to be done, that leads to loop

978-1-5090-2084-3/16/$/31.00©2016 IEEE

45
2016 1st International Conference on Innovation and Challenges in Cyber Security (ICICCS 2016)

generation. Due to the outcome of generated loop, clustering is performed in 2 categories: Divisive &
positions of k centroids will change one by one till Agglomerative.
the next variations are made. That means centroids Agglomerative (bottom up) - In Agglomerative
are static. It can be described in below steps: clustering bottom up approach is applied. In this
1. Put all K Points into the space presented by the method clustering is done one by one by pairing up
given objects which has to be clustered and early clusters hierarchically
group can be presented by these points. 1. We start with single point with one point which is
2. A nearest centroid can be allocated by each object. known as singleton.
3. The re-evaluation of centroids can be done once all 2. In the second step, two or more clusters are added
the objects have been allotted to their respective recursively as one move up the hierarchy.
position. This method terminates when k no. of clusters are
4. till the centroid is at same position repeat Steps 2 received by the combination of many clusters.
and 3. Hence, this results will divide different objects
into groups through which minimized matrix can be Divisive: In this approach clustering will be start
evaluated. NP hard is difficult to resolve. The from one end, recursively we can separate clusters
commonly used heuristic algorithm must evaluate one by one hierarchically. Generally we can say that
local optimum quickly. Heuristic algorithms and splitting and merging are evaluated in greedy way.
expectation maximization algorithm have one thing Large data sets become slow by agglomerative
in common, that both uses combination of Gaussian clustering. Top – down approach of divisive
distribution with iterative refinement approach. clustering.
Hence these both techniques uses clustering of 1. The first step is initiated by a big cluster.
centre; also the approach of K means clustering is 2. In the second step, large clustering sets can be
used to find compatible clusters in spatial extent. partitioned in smaller sets one by one. When K no. of
Whereas different shapes of clusters allows the clusters is achieved the process gets terminated one
expectation and minimization. by one partitioned into clusters.
In hierarchical clustering a data set of N items is
III. EM ALGORITHM FOR PRIVACY given which is to be cluster and N*N distance matrix
is prepared based on the distance between data
An EM algorithm is a redundant method for points.
finding maximum likelihood or maximum a
posteriori (MAP) estimates of parameters in
statistical models, where the model depends on
unobserved hidden variables. When the satisfactory
result of K –means algorithms is achieved, then EM
algorithm is applied. Repetition of an Expectation-
Maximization algorithm switches between an E-step
performing which is used to evaluate expectation of
log by using latest estimate of parameters and
Maximization. The probability of each cluster
belongs to the probability distribution which is
assigned by EM Algorithm. This algorithm can be
used to identify the number of clusters to generate by
cross validation process or priority to generate them.

IV. HIERARCHICAL CLUSTERING

In this clusters are building up in hierarchy and


they can be analyzed sequentially. Hierarchical

46
2016 1st International Conference on Innovation and Challenges in Cyber Security (ICICCS 2016)

Table 1: Comparison of Algorithms for Security Applications in


Data Mining

Fig 1: Security Model in Data Mining

VI. MALICIOUS CODE AND INTRUSION


DETECTION

It can be explained as unauthorized attack on


availability of resources, integrity & data
confidentiality. We can categorize these attacks in
two different types: network base attack and host
V. ISSUES IN CYBER SECURITY based attack. In Host-based attack, a system can be
targeted and an unauthorized access on that system or
We can discuss cyber terrorism here related to the machine target a machine was tried to accomplish.
spoofing of confidential information. This can Primarily this detection scheme uses simple routines
happen by security breach and access by to get data system call from audit process that is used
unauthorized user. Vicious software and viruses like to chase system calls performed by every user.
Trojan horse are the reason behind the violation in
security which can leads to antisocial activities in the The other type of attack is Network-based attack
world of cyber crime. which does not allow authorized users to work on
different existing networks services in a meaningful
There are few more applications which are way. In this type of attack detection can be possible
included in cyber security to analyze data for auditing by using network traffic data and continuously
computer applications. We can build a data ware monitoring of traffic address of the system nodes. It
house that contains data to audit and then by using can be categorize in 2 different groups: misuse
different existing data mining tools we can analyze detection systems and anomaly detection groups.
whether potential anomalies are present or not.
VII. MALICIOUS INTRUSIONS IN DATA
MINING
By using data mining techniques we can restrict
confidential information or data to the legitimate This includes servers, web clients, operating
users and unauthorized access could be stopped. For systems, networks & databases. Most of the cyber
detection and prevention of cyber attacks data mining attacks and terrorism happened because of malicious
technique can be used effectively, also, Data mining intrusion. In malicious intrusion things will process
can be used to detect and prevent cyber attacks, data like someone without nay authorization tries to attack
in the safe network and get the confidential
mining also aggravate security issues like privacy and
information. This might be any vicious automated
interference. Security model shown in below figure. software or robot made by human or any human

47
2016 1st International Conference on Innovation and Challenges in Cyber Security (ICICCS 2016)

intruder. Cyber attacks or malicious intrusions is For cyber security and national security data
often beneficial to show analogies of non cyber mining is a very wide & active area to research. To
computing world i.e. confidential relevant to cyber identify abnormal patterns various other data mining
terrorism— and apply these attacks on computer
techniques can be used like association rule mining
world or networking. Cyber terror increases day by
day worldwide which is shown in below figure. and link analysis. Data mining helps users to make all
kinds of correlations and leads to privacy concerns.

REFERENCES

[1] Masud, M. M., Khan, L, Thuraisingham, B., Wang, X., Liu, P.,
and Zhu, S., “A Data Mining Technique to Detect Remote
Exploits”, In Proc. IFIP WG 11.9 International Conference
on Digital Forensics, Japan, Jan 27-30, 2008.
[2] Khan, L., Awad, M. and Thuraisingham, B. “A New Intrusion
Detection System using Support Vector Machines and
Hierarchical Clustering”, The VLDB Journal:
ACM/Springer-Verlag, 16(1), page 507-521, 2007.
[3] Lazarevic, A., et al., “Data Mining for Computer Security
Applications”, Tutorial Proc. IEEE Data Mining Conference,
2003.
Fig 2: Graph of Increasing cyber Terror Worldwide. [4] Abedin, M., Nessa, S., Khan, L., Thuraisingham, B.,
“Detection and Resolution of Anomalies in Firewall Policy
Rules”, In Proc. 20th IFIP WG 11.3 Working Conference on
VIII. EXTERNAL ATTACKS, INSIDER
Data and Applications Security (DBSec 2006),
THREATS AND CYBER-TERRORISM SpringerVerlag, July 2006, Sophia Antipolis, France, page
15-29.
Cyber Attacks is the major concern of today. As [5] S. Hofmeyr, S. Forrest, and A. Somayaji, ``Intrusion Detection
we all are aware of this cyber threat which is Using Sequences of System Calls'', Journal of Computer
Security Vol. 6, pp. 151-180 (1998).
increasing day by day with the help of information
[6] Thuraisingham B., “Database and Applications Security”, CRC
available on the Internet.
Press, 2005.
[7] R. Agrawal and R. Srikant, "Privacy-Preserving Data
Cyber threats and cyber attacks occurred on Mining", Proc. of the ACM SIGMOD Conference on
existing networks and computer framework could Management of Data, Dallas, May 2000.
lead the disruption of business. By cyber terrorism it [8] M. Atallah, M., E. Bertino, E., A. K. Elmagarmid, A.K., M.
could estimated that millions of dollars can caused. Ibrahim, and V. S. Verykios, ``Disclosure Limitation of
Cyber Threats occurred from inside or outside the Sensitive Rules'', In Proceedings of 1999 IEEE Knowledge
organization. If someone from outside the and Data Engineering Exchange Workshop (KDEX'99) pp.
organization attacks on the computer is known as 45-52, November 1999, Chicago, IL.
outside cyber attack. In this hackers breakdown the [9] Dakshi Agrawal and Charu C. Aggarwal, ``On the design and
quantification of privacy preserving data mining algorithms'',
system and cause quos in the organization.
in Proceedings of the twentieth ACM SIGMOD_SIGACT-
SIGART symposium on principles of Database Systems on
CONCLUSION: Principles of database systems, 2001.
[10] S. Rath, D. Jones, J. Hale, S. Shenoi, ``A Tool for Inference
We can conclude that, by the use of data mining Detection and Knowledge Discovery in Databases'',
for potential intrusive purpose, it is possible to in Proceedings of the 9th IFIP WG11.3 Workshop on
Database Security.
identify confidential data. To preserve cyber security
from malicious software, above mentioned clustering
techniques could be used. EM algorithm ensures
privacy without compromising accuracy of results on
the computation and the communication cost. For
real world data EM gives generally appropriate
results.

48

You might also like