Professional Documents
Culture Documents
Fast Memory Efficient Local Outlier Detection
Fast Memory Efficient Local Outlier Detection
Mahsa Salehi∗ , Christopher Leckie† , James C. Bezdek† , Tharshan Vaithianathan† and Xuyun Zhang‡
∗ IBM Research Melbourne, Victoria 3006, Australia, Email: mahsalehi@au1.ibm.com
† University of Melbourne, Victoria 3010, Australia, Email: {caleckie, jbezdek, tva}@unimelb.edu.au
‡ University of Auckland, Auckland 1010, New Zealand, Email: xyzhanggz@gmail.com
allocate one part of the memory to store b data points, and the
1
Time(mins)
96
AUC(%)
remaining part to store c summaries of older data points. For 94
0.5
each incoming data point the LOF value is computed using 92 MiLOF
MiLOF_F
MiLOF
MiLOF_F
limit is reached, the algorithm summarizes the first 2b data 100 200 300 400
b
500 600 700 100 200 300 400
b
500 600 700
points and deletes them from memory. The summary of these (a) AUC (b) Computation Time
deleted points is in the form of a set of c prototype vectors 0.2
Memory(MB)
0.1
algorithm. If the number of clusters is unknown (and possibly
evolving with time), we try to overestimate c. Here we are not 0.05
MiLOF
points and average them to keep only one point (the cluster (c) Memory Consumption
center) as a representative of each cluster. MiLOF F introduces
a new flexible clustering algorithm based on c-means which Fig. 2: Comparison of MiLOF/MiLOF F with iLOF baseline
summarizes the historical data points in a better way. The
proposed algorithm leads to a higher detection accuracy and
is independent of the number of clusters. It is called Flexible
(Fig. 2b) and memory consumption (Fig. 2c) over increasing
c-means and it is described in the next paragraph. If there
parameter values of b for MiLOF and MiLOF F respectively.
exists a set of summary prototypes from a previous window in
There are also dashed horizontal lines in each figure which
time, the two sets of prototypes are merged to build a single
are the results of iLOF. In Fig. 2a, the AUC of iLOF is
summary of past data points. The algorithm continues until it
99% while the AUC for different values b for MiLOF and
reaches the end of the data stream. This insures that the number
MiLOF F are slightly lower than the baseline. However, the
of data points stored in memory to detect outliers is no more
difference between iLOF’s baseline and MiLOF’s mean value
than m = b+c. The choice of b depends on memory limitation
is 1.64% whereas the difference between iLOF’s baseline and
of the implementation platform. Time complexity of MiLOF
MiLOF F’s mean value is 1.48%. These results suggest that
has been further investigated theoretically as a theorem in [7].
the detection accuracy of our methods are comparable with
Flexible c-means in MiLOF F To increase the detection iLOF. The mean AUC of MiLOF F is slightly better than
accuracy in MiLOF, we store previous inputs more selectively, MiLOF, and is less sensitive to changes in b (based on their
by retaining those points that are most useful for subsequent standard deviations). On the other hand, Figs. 2b and 2c illus-
outlier detection. To achieve this end, we consider the density trate that the computation time and memory consumption in
distribution of the previous data stream. There may be some MiLOF and MiLOF F are always lower than the iLOF baseline
regions in the previous data stream with a high probability in the chosen b range, which is a considerable improvement.
of containing outliers. Flexible c-means prunes these regions The same experiment was done on the other six real life
during summarization, while keeping the less probable ones. data sets. The computation time and memory consumption
The intuition behind this approach is density reduction of results for the six data sets are similar to the Vowel data
outlier regions. Since the local outlier factor shows whether a set’s diagrams. Experiments highlighting the comparability of
data point is similar to its neighbors in terms of local density, MiLOF/MiLOF F with iLOF, stability of MiLOF/MiLOF F
by pruning the regions with high probability of being outliers, with respect to choosing different values for b and c, and
while keeping the summary of inliers, we can further decrease robustness of MiLOF F to changes in the number of data
the similarity of a future outlier data point and its neighbors, points, the number of underlying clusters and the number of
and hence increase the LOF . This means, the probability dimensions in the data stream can be found in [7].
of assigning higher LOF values to future outliers increases, R EFERENCES
which make them more distinguishable from inliers.
[1] S. Sadik and L. Gruenwald, “Research issues in outlier detection for
III. R ESULTS AND E VALUATION data streams,” ACM SIGKDD Explorations Newsletter, vol. 15, no. 1,
We chose three synthetic data sets and seven real life pp. 33–40, 2014.
data sets with different properties including five publicly [2] C. C. Aggarwal, “Outlier Analysis,” 2013.
available datasets from the UCI machine learning repository [3] F. Angiulli and F. Fassetti, “Detecting distance-based outliers in streams
of data,” in CIKM, 2007, pp. 811–820.
(Vowel, Pendigit, Letter, KDD Cup 1999 and Forest Cover).
We change the parameter b and investigate its effect on the [4] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: identifying
density-based local outliers,” in SIGMOD, vol. 29, no. 2, 2000, pp. 93–
Area Under ROC Curve (AUC), time and memory of MiLOF 104.
and MiLOF F over all real life data sets. In the experiments [5] D. Pokrajac, A. Lazarevic, and L. J. Latecki, “Incremental local outlier
we increase b until the time consumption of MiLOF exceeds detection for data streams,” in CIDM, 2007, pp. 504–515.
iLOF. We have also applied iLOF on each real life data sets [6] C. C. Aggarwal and C. K. Reddy, Data Clustering: Algorithms and
once, using the assumption of unlimited memory. Figs. 2a-c Applications. CRC Press, 2013.
show the results on the Vowel data. The horizontal axis in all [7] M. Salehi, C. Leckie, J. C. Bezdek, T. Vaithianathan, and X. Zhang,
three views is MiLOF/MiLOF F’s step size b. The two graphs “Fast memory efficient local outlier detection in data streams,” TKDE,
in the three views show AUC (Fig. 2a), computation time vol. 28, no. 12, 2016.
52
54