You are on page 1of 7

Tentative title – Detecting interesting patterns

Introduction and problem area


Anomalies or intrusions are activities that somehow go against the security policy of the
system. The best example of this is when a system automatically detects a compromised or
hacked account, using anomaly detection algorithms. These algorithms learn to differentiate
between the behavior of the user versus that of a hacker. In other domains, anomaly detection
helps in finding patterns in the data that might be useful in some sense (Prasad et al., 2009).
Hence anomalies can also be addressed as outliers, noise, exceptions, or deviations from a
standard.
Anomalies are occurrences in the data that are very rare, and their features differ a lot from
the norm. These are inconsistent observations from the remaining data points (Zerbet and
Nikulin, 2003). Outliers can also be in the form of a drift or gradual change in the values of a
dataset, or a change in events or sudden dramatic changes in the established behavior of the
system. 
Detection of anomalies is usually paired with other protection mechanisms, such as
secondary authentication system and controlling the account access. Anomalies prove to be
the main line of defense for systems without any sort of protection or systems that have bugs
that can make accounts vulnerable to hacking.
Anomaly detection compares all activities to “normal” activities, i.e., a user’s usual activities.
This process has its advantages. Anomaly detection can create a “normal” model which is
used to detect attacks or patterns, which are deviations from the created model. Another
advantage is the customization potential of these models for every user, in every system. At
the same time, anomaly detection algorithms make the system more complex, which then
leads to an increase in false alarms or no detection at all (Koren et al., 2022).
It is evident that anomaly detection cannot act as a replacement for other security measures,
although it can easily complement those measures. Other challenges that one might face in
finding anomalies are:
 Anomalies could appear like normal data points unless there is some form of human
intervention, the best example of this is when hackers adapt their behavior to look
indistinguishable from normal user behavior.
 Established anomalies or outliers can be useful at one point in time but might not have
the same impact later on, or vice versa. For example, the object in different business
systems changes with time due to external factors.
 The detection algorithms have to be changed or varied to suit different fields. For
example, medical and business domains would treat outliers in a vastly different
manner.
 Model training and validation are completely dependent upon the availability of
reliable data.
 The boundary between what needs to be tagged as normal, and outliers must be
precise, which is not the case for most detection problems. The main issue caused by
this is a false classification of the data points as outliers or normal. 

Real-world data or simulated data


When it comes to the quality of available datasets for training models, simulation is
becoming an appealing option. Simulated data along with real-world data, can be used as a
valid gauge of competence in multiple domains (Isaak et al., 2018). Still, simulation is
becoming an acceptable assessment criterion for performance-based tasks. Many
professionals consider simulation to be on par with real-world assessment, as activity patterns
can be similar in both real-world and simulation. For example, in the medical field(Cook et
al., 2014) simulation-based patient assessment has been proven to be on par with the usual
oral assessment and portfolio-based evaluation scores. But resuscitation scores continue to
baffle professionals, as simulation-based evaluations don’t always correlate with real-world
scores. 
Limitations come into play when the available datasets are small, which teaches the algorithm
to be biased towards a particular aspect or prediction. This is very common in fields where
datasets are generated through experimental methods (Datta et al., 2004; McCluney et al.,
2007).
 Simulation-based model training and outcome prediction leads to the creation of a safe
environment that can be used to practice and learn more about individual variables. This
process is difficult and time-consuming in real-world situations. But as explained through the
resuscitation example, real-world data accumulation is important in rare situations (Zimek
and Filzmoser, 2018). 
Real-world data is quite dynamic, which is why a lot of information is generated in a short
time in this case. The prime example of this could be traffic monitoring, communication
systems, production processes, customer behavior analyses, and so on. What counts as
anomalous data points change rapidly in this case, which is why unsupervised methods are
the aptest approach in such cases.
Computational methods to be used to analyze data
Anomalies play a vital role in detecting abnormal changes, drifts, fraudulent activities, and
diseases (from medical data), or in handling class imbalances. For doing this simple statistical
methods can be employed such as quantiles, median, and mean, or it can be done through
visual representation of the data and exploratory data analysis methods.
According to the literature, there are three categories under which different categories of
anomaly detection methods are classified. These are Unsupervised, Supervised, and Semi-
supervised anomaly detection techniques (Laskov et al., 2005; Omar et al., 2013; Ruff et al.,
2019). 
Unsupervised and supervised anomaly detection methods
Unsupervised techniques are the most common methods employed in outlier detection. Here,
an uncategorized dataset is used to train the machine learning model, which then fits into the
normal/standard behavior. Data scientists make a crucial assumption here, and that is to
consider that most data points are a part of normal behavior. This assumption then leads to
the isolation of data points that don’t conform to the norm, and these points can be labeled as
outliers.
In supervised anomaly detection, researchers use labeled datasets for training a classifier.
These datasets are categorized into normal or abnormal classes. Hence, when a new data
point is encountered, the classifier can be employed to sort it into these two cases.
Of course, there are advantages and drawbacks to both methods. But supervised methods can
work on only very large datasets, having a substantial amount of anomalous data points. And
since anomalies are rarely encountered, obtaining such datasets becomes hard. Also, since
supervised methods can be modeled on specified abnormalities, often it is tough to train the
algorithm for multiple anomalous classes. This is why some professionals prefer
unsupervised methods, as modeling normal behavior is far easier than the outliers (Laskov et
al., 2005).
Nevertheless, the supervised method remains the ideal technique for a lot of researchers.
Neatly labeled datasets are used by data scientists in this case. In response, all anomalous
data points are known before an in-depth study is performed. Even though data points are
classified in supervised learning but are not identified for training the model. Popular
algorithms for the supervised method are:
 Decision trees
 KNN or k-nearest neighbors
 SVM or support vector machine
 Bayesian networks
The unsupervised method works on unlabeled data which contains both “Normal” and
“Anomalies”. Therefore, in this setting, a set of methods are employed for finding order in
the unstructured data. The main goal now becomes, finding clusters in the data. This leads to
isolating groups that don’t belong to most data. Popular methods in this category are:
 K-means
 One-class SVM
 SOM or Self-organizing maps
 EM or expectation maximization
 C-means
 ART or Adaptive resonance theory
In semi-supervised learning, the data, in this case, is thought to be “Normal”. There are labels
for both standard and outlier data points. The training set although, consists of only normal or
standard data points. The testing set has both classes of points. But this has a direct impact on
the performance of the model eventually as a small percentage of anomalous data is included
in the unlabeled data points. Many better algorithms have been created to counteract this
issue (Villa-Pérez et al., 2021).
As can be seen by now, each type of detection has its advantages. When the data is unlabeled
unsupervised detection works best. When the data has labels, utilizing labels for both
categories of data points can help maximize the performance of the model. This can be done
by both supervised and semi-supervised methods. However, semi-supervised detection is
better in various situations, such as when the data doesn’t have a very balanced structure or
has multiple subgroups of outlier data points. 

Pattern detection, what do anomalies look like in your area?


Machine learning algorithms are especially quite adept at finding patterns in the data. This is
done to find connections between the data at hand. So essentially, the objective of these
algorithms is to classify a data point as either an anomaly or more like the set standard or the
existing pattern. This helps in finding deviations from the existing pattern.
Ultimately the goal of machine learning is to understand the data and draw conclusions from
it. Rather than focusing on trying to create a system that follows a set of rules, machine
learning creates a system that can learn from the data over a period. Machine learning is not
limited by the number of possible outcomes (Smith and Martinez, 2011).
Anomalies can hence be classified broadly into the following categories:
 Contextual anomalies
If the data being studied has anomalies in a certain context, such as if the point is anomalous
in a periodic context for example.
 Collective anomalies
When a collection of data points are anomalous if considered in a group. The important thing
to note here is the fact that individual points might not look anomalous, but when grouped it
will look like outliers exist.
 Point anomalies
Individual data points can be said to be anomalous when they differ from other data points.
Here an object can be studied as an outlier, individually. Therefore, this category is the
simplest and the most used anomaly.
As such, we now have so many fields where machine learning and artificial intelligence can
be applied to find patterns or anomalies in the data. Fraud detection is one such area.
Fraudulent activities are located by finding anomalous actions made by a user. The system
can then flag it and a human can review it later. Another example is using time series data for
making predictions. Time series data is widely employed for anomaly detection (Aggarwal,
2017). Depending on the business model, it can help find useful anomalies in churn rate, cost
per click, website views, active users, app installations, customer retention rate, transaction
volume, and so on. Anomaly detection using machine learning algorithms helps in seamlessly
correlating data with the performance of the application, product, user experience, or cloud
cost management. These algorithms are of use in various domains and are now considered to
be an important aspect of unsupervised machine learning. Initially, anomaly detection was
devised for intrusion detection systems back in 1986 (Jabez and Muthukumar, 2015). The
most common use of anomalies or outliers is the identification and removal of these data
points, which then leads to an increase in the accuracy of the model, specifically when it
comes to supervised learning (Smith and Martinez, 2011). 

Conclusion 
Anomaly detection methods are essential for drawing vital inferences from the data, be it
real-world data or simulated data. This article has already explained how anomaly/outlier
detection has become indispensable in the field of medicine and fraud detection. Anomalies
are also important for making the predictions made by the models more accurate.
Finding outliers in a dataset is difficult when the analyst is unable to establish the definition
for the same. This is why we have so many different methods to classify data points now.  
Further studies involving anomaly detection techniques are needed. The focus should be on
performance and accuracy at the same time. Data scientists are creating new and improved
algorithms and statistical techniques for utilizing the anomalies present in natural or synthetic
datasets.

References:

Aggarwal, C.C. (2017), “An Introduction to Outlier Analysis”, Outlier Analysis, Springer
International Publishing, pp. 1–34.
Cook, D.A., Zendejas, B., Hamstra, S.J., Hatala, R. and Brydges, R. (2014), “What counts as
validity evidence? Examples and prevalence in a systematic review of simulation-based
assessment”, Advances in Health Sciences Education, Springer Science and Business
Media Netherlands, Vol. 19 No. 2, pp. 233–250.
Datta, V., Bann, S., Beard, J., Mandalia, M. and Darzi, A. (2004), “Comparison of bench test
evaluations of surgical skill with live operating performance assessments”, Journal of the
American College of Surgeons, J Am Coll Surg, Vol. 199 No. 4, pp. 603–606.
Isaak, R., Chen, F., … S.M.-… : (2018). “Validity of Simulation-Based Assessment for
Accreditation Council for Graduate Medical Education Milestone Achievement”,
Pubmed.Ncbi. Nlm.Nih.Gov, available at: https://pubmed.ncbi.nlm.nih.gov/29373383/
(accessed 13 November 2022).
Jabez, J. and Muthukumar, B. (2015), “Intrusion detection system (ids): Anomaly detection
using outlier detection approach”, Procedia Computer Science, Elsevier B.V., Vol. 48 No.
C, pp. 338–346.
Koren, O., Koren, M. and Peretz, O. (2022), “A procedure for anomaly detection and
analysis”, Engineering Applications of Artificial Intelligence, Elsevier Ltd, Vol. 117,
available at: https://doi.org/10.1016/j.engappai.2022.105503.
Laskov, P., Düssel, P., Schäfer, C. and Rieck, K. (2005), “Learning intrusion detection:
Supervised or unsupervised?”, Lecture Notes in Computer Science (Including Subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 3617
LNCS, pp. 50–57.
McCluney, A.L., Vassiliou, M.C., Kaneva, P.A., Cao, J., Stanbridge, D.D., Feldman, L.S. and
Fried, G.M. (2007), “FLS simulator performance predicts intraoperative laparoscopic
skill”, Surgical Endoscopy and Other Interventional Techniques, Vol. 21 No. 11, pp.
1991–1995.
Omar, S., Ngadi, A., Computer, H.J.-I.J. (2013), “Machine learning techniques for anomaly
detection: an overview”, Researchgate.Net, Vol. 79 No. 2, pp. 975–8887.
Prasad, N.R., Almanza-Garcia, S. and Lu, T.T. (2009), “Anomaly detection”, Computers,
Materials and Continua, Vol. 14 No. 1, pp. 1–22.
Ruff, L., Vandermeulen, R.A., Görnitz, N., Binder, A., Müller, E., Müller, K.-R. and Kloft,
M. (2022). “Deep semi-supervised anomaly detection”, Arxiv.Org, available at:
https://arxiv.org/abs/1906.02694 (accessed 13 November 2022).
Smith, M.R. and Martinez, T. (2011), “Improving classification accuracy by identifying and
removing instances that should be misclassified”, Proceedings of the International Joint
Conference on Neural Networks, pp. 2690–2697.
Villa-Pérez, M., … M.A.-C.-K.-B. (2021). “Semi-supervised anomaly detection algorithms:
A comparative summary and future research directions”, Elsevier, available at:
https://www.sciencedirect.com/science/article/pii/S0950705121001416 (accessed 13
November 2022).
Zerbet, A. and Nikulin, M. (2003), “A new statistic for detecting outliers in exponential
case”, Communications in Statistics - Theory and Methods, Vol. 32 No. 3, pp. 573–583.
Zimek, A. and Filzmoser, P. (2018), “There and back again: Outlier detection between
statistical reasoning and data mining algorithms”, Wiley Interdisciplinary Reviews: Data
Mining and Knowledge Discovery, Wiley-Blackwell, Vol. 8 No. 6, available
at:https://doi.org/10.1002/WIDM.1280.

You might also like