You are on page 1of 2

2017 IEEE 33rd International Conference on Data Engineering

Fast Memory Efficient Local Outlier Detection


in Data Streams (Extended Abstract)

Mahsa Salehi∗ , Christopher Leckie† , James C. Bezdek† , Tharshan Vaithianathan† and Xuyun Zhang‡
∗ IBM Research Melbourne, Victoria 3006, Australia, Email: mahsalehi@au1.ibm.com
† University of Melbourne, Victoria 3010, Australia, Email: {caleckie, jbezdek, tva}@unimelb.edu.au
‡ University of Auckland, Auckland 1010, New Zealand, Email: xyzhanggz@gmail.com

Abstract—Outlier detection is an important task in data min-


ing. With the growing need to analyze high speed data streams,
the task of outlier detection becomes even more challenging as
traditional outlier detection techniques can no longer assume
that all the data can be stored for processing. While the well-
known Local Outlier Factor (LOF) algorithm has an incremental
version (called iLOF), it assumes unbounded memory to keep all
previous data points. In this paper, we propose a memory efficient
incremental local outlier (MiLOF) detection algorithm for data
streams, and a more flexible version (MiLOF F), both have an
accuracy close to iLOF but within a fixed memory bound. In
addition MiLOF F is robust to changes in the number of data
points, underlying clusters and dimensions in the data stream.
Fig. 1: Three phases of MiLOF
I. I NTRODUCTION
Outlier detection has received considerable attention in the outlier detection techniques [3] have become popular as they
field of data mining because of the need to detect unusual are intuitive and do not assume an underlying distribution for
events in a variety of applications. Many outlier detection the data. These techniques can be categorized into ‘global’ and
algorithms has been proposed for use on static data sets which ‘local’ approaches. The advantages of local outlier detection
have a finite number of samples. However, outlier detection on approaches is their ability to detect outliers in datasets with
streaming data is particularly challenging, since the volume of nonhomogeneous densities. However, the literature pays less
data to be analyzed is effectively unbounded and cannot be attention to finding local outliers (LOF) [4] in data streams.
stored indefinitely in memory for processing [1]. Data streams While researchers in [5] proposed the iLOF technique, it needs
are also generated at a high data rate, hence making the task of all previous data points to detect outliers precisely and the time
outlier detection even more challenging. This is a significant complexity is not linear. Hence, such an approach has limited
problem that arises in many real applications. Motivated by the use in practice. In contrast both MiLOF and MiLOF F use
aforementioned challenges, we propose an unsupervised mem- only a fixed size memory while detecting outliers in near linear.
ory efficient algorithm and its extension to tackle this problem.
In particular, we have developed a novel algorithm called II. M I LOF: M EMORY E FFICIENT I LOF A LGORITHM AND
memory efficient incremental local outlier detection (MiLOF) THE M I LOF F VARIANT
for data streams in environments with limited memory. The
goal is to compute Local Outlier Factor (LOF ) [4] values that Problem Definition: Given data vector pt ∈ D collected
are similar to those computed by iLOF [5], within the given at time t ∈ T , the goal of MiLOF and MiLOF F is to assign
memory constraints. The algorithm summarizes the previous an LOF value to pt , under the constraint that the available
data points that exceed the memory constraint, taking into memory stores only a fraction m << n of the n points
consideration their LOF scores. We have also proposed a more that have been observed up to time T . In other words, it is
efficient approach which is an extension for MiLOF called impractical to store all n data vectors and their corresponding
MiLOF F. This approach is capable of dynamically finding LOF values in memory. Hence we need to choose a strategy
the number of summaries to keep in memory by a novel to summarize the previous data points so that the LOF values
adaptive clustering based approach for building summaries of of new points can be calculated. Specifically, the goal of
data points. In addition, it is more stable in terms of changes MiLOF and MiLOF F is to detect outliers for the whole stream
to the available memory size and more accurate in detecting duration and not just for the m last data points where the
outliers. Our experiments on both synthetic and various real- available memory is limited to m. MiLOF F is an extension
life data sets show that our MiLOF/MiLOF F can successfully to MiLOF that aims for greater flexibility (it does not require
detect outliers while decreasing both the memory requirements the number of clusters) and higher detection accuracy.
and time complexity of iLOF.
MiLOF Algorithm: MiLOF consists of three phases: 1)
Related work: A number of approaches have been pro- Revised Insertion, 2) Summarization and 3) Merging. Fig.
posed to detect outliers in data streams [2]. Distance based 1 shows these phases on a 2-dimensional data stream. First

2375-026X/17 $31.00 © 2017 IEEE 53


51
DOI 10.1109/ICDE.2017.32
we assume that the memory limit allows m data points with 100 1.5

their corresponding LOF values to be stored in memory. We 98

allocate one part of the memory to store b data points, and the
1

Time(mins)
96

AUC(%)
remaining part to store c summaries of older data points. For 94
0.5
each incoming data point the LOF value is computed using 92 MiLOF
MiLOF_F
MiLOF
MiLOF_F

the revised insertion algorithm of iLOF [5]. Once the memory 90


iLOF Baseline
0
iLOF Baseline

limit is reached, the algorithm summarizes the first 2b data 100 200 300 400
b
500 600 700 100 200 300 400
b
500 600 700

points and deletes them from memory. The summary of these (a) AUC (b) Computation Time
deleted points is in the form of a set of c prototype vectors 0.2

in D . There are a virtually unlimited number of choices 0.15

for the clustering algorithm [6]. MiLOF uses the c-means

Memory(MB)
0.1
algorithm. If the number of clusters is unknown (and possibly
evolving with time), we try to overestimate c. Here we are not 0.05
MiLOF

necessarily looking to find the apparent cluster substructure MiLOF_F


iLOF Baseline
0
in the evolving data. Instead, we want to find similar data 100 200 300 400
b
500 600 700

points and average them to keep only one point (the cluster (c) Memory Consumption
center) as a representative of each cluster. MiLOF F introduces
a new flexible clustering algorithm based on c-means which Fig. 2: Comparison of MiLOF/MiLOF F with iLOF baseline
summarizes the historical data points in a better way. The
proposed algorithm leads to a higher detection accuracy and
is independent of the number of clusters. It is called Flexible
(Fig. 2b) and memory consumption (Fig. 2c) over increasing
c-means and it is described in the next paragraph. If there
parameter values of b for MiLOF and MiLOF F respectively.
exists a set of summary prototypes from a previous window in
There are also dashed horizontal lines in each figure which
time, the two sets of prototypes are merged to build a single
are the results of iLOF. In Fig. 2a, the AUC of iLOF is
summary of past data points. The algorithm continues until it
99% while the AUC for different values b for MiLOF and
reaches the end of the data stream. This insures that the number
MiLOF F are slightly lower than the baseline. However, the
of data points stored in memory to detect outliers is no more
difference between iLOF’s baseline and MiLOF’s mean value
than m = b+c. The choice of b depends on memory limitation
is 1.64% whereas the difference between iLOF’s baseline and
of the implementation platform. Time complexity of MiLOF
MiLOF F’s mean value is 1.48%. These results suggest that
has been further investigated theoretically as a theorem in [7].
the detection accuracy of our methods are comparable with
Flexible c-means in MiLOF F To increase the detection iLOF. The mean AUC of MiLOF F is slightly better than
accuracy in MiLOF, we store previous inputs more selectively, MiLOF, and is less sensitive to changes in b (based on their
by retaining those points that are most useful for subsequent standard deviations). On the other hand, Figs. 2b and 2c illus-
outlier detection. To achieve this end, we consider the density trate that the computation time and memory consumption in
distribution of the previous data stream. There may be some MiLOF and MiLOF F are always lower than the iLOF baseline
regions in the previous data stream with a high probability in the chosen b range, which is a considerable improvement.
of containing outliers. Flexible c-means prunes these regions The same experiment was done on the other six real life
during summarization, while keeping the less probable ones. data sets. The computation time and memory consumption
The intuition behind this approach is density reduction of results for the six data sets are similar to the Vowel data
outlier regions. Since the local outlier factor shows whether a set’s diagrams. Experiments highlighting the comparability of
data point is similar to its neighbors in terms of local density, MiLOF/MiLOF F with iLOF, stability of MiLOF/MiLOF F
by pruning the regions with high probability of being outliers, with respect to choosing different values for b and c, and
while keeping the summary of inliers, we can further decrease robustness of MiLOF F to changes in the number of data
the similarity of a future outlier data point and its neighbors, points, the number of underlying clusters and the number of
and hence increase the LOF . This means, the probability dimensions in the data stream can be found in [7].
of assigning higher LOF values to future outliers increases, R EFERENCES
which make them more distinguishable from inliers.
[1] S. Sadik and L. Gruenwald, “Research issues in outlier detection for
III. R ESULTS AND E VALUATION data streams,” ACM SIGKDD Explorations Newsletter, vol. 15, no. 1,
We chose three synthetic data sets and seven real life pp. 33–40, 2014.
data sets with different properties including five publicly [2] C. C. Aggarwal, “Outlier Analysis,” 2013.
available datasets from the UCI machine learning repository [3] F. Angiulli and F. Fassetti, “Detecting distance-based outliers in streams
of data,” in CIKM, 2007, pp. 811–820.
(Vowel, Pendigit, Letter, KDD Cup 1999 and Forest Cover).
We change the parameter b and investigate its effect on the [4] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: identifying
density-based local outliers,” in SIGMOD, vol. 29, no. 2, 2000, pp. 93–
Area Under ROC Curve (AUC), time and memory of MiLOF 104.
and MiLOF F over all real life data sets. In the experiments [5] D. Pokrajac, A. Lazarevic, and L. J. Latecki, “Incremental local outlier
we increase b until the time consumption of MiLOF exceeds detection for data streams,” in CIDM, 2007, pp. 504–515.
iLOF. We have also applied iLOF on each real life data sets [6] C. C. Aggarwal and C. K. Reddy, Data Clustering: Algorithms and
once, using the assumption of unlimited memory. Figs. 2a-c Applications. CRC Press, 2013.
show the results on the Vowel data. The horizontal axis in all [7] M. Salehi, C. Leckie, J. C. Bezdek, T. Vaithianathan, and X. Zhang,
three views is MiLOF/MiLOF F’s step size b. The two graphs “Fast memory efficient local outlier detection in data streams,” TKDE,
in the three views show AUC (Fig. 2a), computation time vol. 28, no. 12, 2016.

52
54

You might also like