Professional Documents
Culture Documents
Abstract-In this paper, a novel empirical data analysis sets or streams and converting the data into useful
approach (abbreviated as EDA) is introduced which is entirely in formati on.
data-driven and free from restricting assumptions and pre-
defined problem- or user-specific parameters and thresholds. It is Traditional approaches are based on a number of
well known that the traditional probability theory is restricted by restrictive asswnptions which usually do not hold in reality.
strong prior assumptions which are often impractical and do not This applies to the fuzzy sets theory [I] with its subjective
on
hold in real problems. Machine learning methods, on the other way of defining membership functions, assumption of smooth
hand, are closer to the real problems but they usually rely on and pre-defmed membership functions. It also applies to the
problem- or user-specific parameters or thresholds making it probability theory [2] and statistical learning [3], [10]. They
rather art than science. In this paper we introduce a theoretically are an essential and widely used tool for quantitative analysis
sound yet practically unrestricted and widely applicable of data representation of stochastic type of uncertainties.
approach that is based on the density in the data space. Since the However, they rely on a nwnber of strong assumptions
data may have exactly the same value multiple times we including pre-defined smooth and "convenient to use" types of
distinguish between the data points and unique locations in the probability distribution, infinite amount of observations/data
si
data space. The number of data points, k is larger or equal to the points, independence between data points (so called iid -
number of unique locations, I and at least one data point occupies independent and identically distributed data), etc. However, in
each unique location. The number of different data points that
most practical problems, these assumptions are not satisfied.
have exactly the same location in the data space (equal value), f
Till now, several alternative methods [5], [6] have been
can be seen as frequency. Through the combination of the spatial
er
density and the frequency of occurrence of discrete data points, a
proposed aiming to avoid the problem of unrealistic prior-
new concept called multimodal typicality, t'fM is proposed in this assumptions and get closer to the data rather than stick to the
paper. It offers a closed analytical form that represents ensemble theoretical prior assumptions, but these methods still use at
properties derived entirely from the empirical observations of some point asswnption of (albeit local) Gaussian/normal
data. Moreover, it is very close (yet different) from the distribution.
histograms, from the probability density function (pdf) as well as
lV
In this paper, we introduce a novel empirical data analysis
from fuzzy set membership functions. Remarkably, there is no
(EDA) approach. It is a further development of the recently
need to perform complicated pre-processing like clustering to get
introduced TEDA (typicality and eccentricity data analytics)
the multimodal representation. Moreover, the closed form for the
case of Euclidean, Mahalanobis type of distance as well as some
framework [7]-[9]. It does not require any prior assumptions
other forms (e.g. cosine-based dissimilarity) can be expressed which are usually unrealistic and restrictive, or parameters.
recursively making it applicable to data streams and online Instead, it is entirely based on the empirical observations of
algorithms. Inference/estimation of the typicality of data points discrete data points and their mutual position fonning a unique
that were not present in the data so far can be made. This new pattern in the data space. It starts with calculating (recursively)
na
concept allows to rethink the very foundations of statistical and the cumulative proximity, Jr, then the standardized
machine learning as well as to develop a series of anomaly eccentricity, &; and the density, D and finally, the multimodal
detection, clustering, classification, prediction, control and other typicality, F. Moreover, unlike the pdfwhere identifying the
algorithms. number and position of modes is a well known problem
usually solved by clustering, the multimodal typicality, F is
Keywords-empirical data analysis; multimodal typicality; data-
derived from the data automatically and without clustering or
driven; recursive calculation; inference; estimation.
other additional algorithm. It is often quite close to the
Fi
histograms but is not the same since it does take into account
I. INTRODUCTION the mutual position of the data as well as the frequency of
Most of the human activities have already been largely their occurrence,!
changed in the recent decades because of the very fast The advantages of multimodal typicality, F are obvious
development of infonnation technologies and the Internet.
because it combines pdf, histogram and mutual distributions of
Astronomical and ever increasing amount of data is being
the data points together and has a closed analytical form that
generated every day. As a result, data analysis is a rapidly
can be manipulated further. In addition, multimodal typicality,
growing field due to the strong need of processing large data
F can be calculated recursively, which makes it suitable for
applications to online data streams.
This work was supported by The Royal Society (Grant number
IE I41329/2014)
Moreover, one can infer the typicality , z(x) for any 1) cumulative proximity, "
feasible value x by interpolation or extrapolation and the
typicality can be used in a manner similar to probability 2) standardized eccentricity, &
because it has same properties of being between 0 and J and 3) density, D , and finally
summing up to J. It is demonstrated in this paper that it can be
used to build a naIve typicality-based EDA Classifier [10] 4) typicality, T.
which not only provides better results than the naiVe Bayes In addition, the calculations can be recursive and the
classifier (due to taking into account the actual distribution of multimodal form of the typicality is also introduced.
the data) but also in comparison with the SVM classifier [11].
In addition to that, the naIve EDA Classifier does not require A. Theoretical Basis
any prior assumption of the distribution of the data selection First of all, let us consider the real Hilbert space R d and
of the type of the kernel or threshold constants or iterative assume a particular data set or stream denoted as
optimization as the SVM approach does .
{Xl ' X 2 , ••• , X k } E Rd , where the subscripts denote the time
Another very interesting feature is that this new method is instance at which the data point arrives. Within the data
free from some paradoxes that the traditional probability set/stream, some of the data points may repeat more than once,
density function has [12]. Additionally, it combines effectively namely,::ii,j I x , = X j . The set of unique data point locations at
the two different types of randomness representation: a) the
one based on the ratio of number of times a discrete random time instance k can be defined as {up U 2 , •• . , u,}E R d and the
on
corresponding number of times {.t; ,.t;,... , f,} different data
variable occurs ( p = lim 1::... ; where k* is the number of times points occupy the same unique locations (;; can be viewed as
k -,>oo k
of occurrence and k is the total number of data frequency if it is divided by k). Obviously, always I ~ k , but
points/observations), see Fig.2 and b) the one based on pdf, more often in real problems because of having exactly the
see Fig. 1(d). In the probability theory and statistics literature same values many times, I << k. Based on {up u 2 , .. . , U , }
the transition between a) and b) is often neglected. In fact, a)
and {J; ' /2 "'" It}, we can reconstruct the data set
si
(the multuimodal version, j'4M) is suitable for games such as
dices, coins, balls etc. or "pure" random processes, white noise {Xl ' X 2 , •• • , X k } exactly if needed regardless of the order of
type while b) (the unimodal typicality, r) is applicable to most arrival of the data points. Further in this paper, all derivations
of the real processes which are a more complex mixture of are conducted at the k'h time instance by default if there is no
deterministic and random or, more correctly, complex special declaration.
er
phenomena. The proposed typicality works well for both a)
and b) and combines them in a unique closed form 1) Cumulative proximity
representation without assuming k-+CXJ a type of the Cumulative proximity is a measure indicating the degree
distribution, or any user- or problem- specific parameters. of closeness/similarity of a particular data point to all other
This novel non-parametric and assumption-free empirical existing data points [7]-[9] .
lV
The rest of this paper is organized as follows: section II u and u j • The distance can be of Euclidean, Mahalanobis
1
introduces the novel empirical data analysis (EDA) framework type, based on cosine [12] any other metric.
including the theoretical basis, multimodal typicality and the 2) Standardized eccentricity
recursive calculations. Section III describes the method of
making inference within this novel framework. An additional Eccentricity was introduced in [7]-[9] to represent the
example of multimodal typicality and inference of a association of the data point with the tail of the distribution
and the property of being an outlier/anomaly [8]. The
Fi
SMC_2016 000053
2016 IEEE International Conference on Systems, Man, and Cybernetics' SMC 20161 October 9-12, 2016' Budapest, Hungary
,
It is interesting to note that L &~ (u;)= 21 . or in terms of the distance:
~ t d'(U"UJt( t d'(u"uJJT
i =!
on
r'
data occupy the same unique location, {.t; ,/z,... , h }. The
j ~J
t
multimodal typicality, ~ is defined as:
)( 10 5
2.5.,
,.,,@o
0
si 0
0
er
0
0
0
% 0 o~
'~ 1.5 1 0
~o 0 ~o
0°,8::co 'w o 0
~_n c/kJ
0
~
":L
0 08:,0 00 0 0
~<P~~'boo
..
OOOCD 0
;0> _Q.:
~.D _
lV
-10
10
~40
20
1 ""---
-10 =--
---- 40
x 10· 3
'C&
1 1
0.8
'''L
0.6
MW,8)~~~n '~
Fi
OV 0
0.2
vv~
:sgO<p 0
O'~dP
0
0
-1 0 0~40 0
-10
~
40
10
30 Windspeed (mph) 30 Windspeed (mph)
Temperature (CO) Temperature (CO)
SMC_2016 000054
2016 IEEE International Conference on Systems, Man, and Cybernetics' SMC 20161 October 9-12,2016' Budapest, Hungary
0.2
I I Ul U2
To further investigate the advantages of the multimodal
typicality, the 3-dimensional multimodal typicality, F of the
climate dataset [13] is shown in Fig. 3. A comparison of F
0.1 with the traditional histogram [4], [lO] as well as with the
normal (Gaussian) pdf[4], [10] is also made in a 2-D graph for
0
1 visual clarity, shown in Figure 4.
The advantages of the multimodal typicality, r MM can be
Fig.2. A simple example of multi modal typicality
summarized as follows:
on
1. This typicality takes the spatial density, D into
consideration.
2. The frequency of the occurrence, f of a certain data
0.015
0.01
0.07
,
,~
si
0.06
0.005
, , 0.05
30 0.04
30
er
20
0.03
Windspeed (mph)
Temperature (Co)
0.02
0
·10 -5 10 15 20 25 30 35
Temperature (CO)
(a) Temperature
r~(u;)= /r;(u J I
j;D;(u;) (7)
Lfjr;(uJ Lf;D;(uJ 1
0.12 1 -~-..-----'-----r--r--;:;;::::;;:;==:;l
p I j ~1
na
0.1
I I
0.08
where Lf;r;(uJ> 0, Lf;D;(uJ> O,k > 1,1> 1
p I j~ 1
0.06
·5 10 15 20 25 30 35
set of 3 data points is considered in which two of them have Windspeed (mph)
exactly the same value: x={2;5;5}. Obviously, u={2;5}; 1=2;
(b) Wind speed
k=3; l<k. Naturally, the value of , lMM = {J/3; 2/ 3} while for
such small number of k it is not possible to get a meaningful FigA. Comparison of the multimodal typicality, histogram and Gaussian
pdjofthe climate dataset in 2-D
SMC_2016 000055
2016 IEEE International Conference on Systems, Man, and Cybernetics' SMC 20161 October 9-12,2016' Budapest, Hungary
0.06
1= ::,0=1 Multimodal typicality, r MM is a function having the
following properties:
0.05
a) it swns up to 1;
0.04
b) it is very close to (but not the same as) the histogram;
0.03
c) its value is within the range [0;1];
0.02 d) it combines the two completely different forms used so
far: histogram and pdfin one expression;
II II I
0.01
e) it combines the two representations of the probability
OL.II11 (frequency-based and distribution-based)
·5 10 15 20 25 30
Temperalure (CO)
t) it is free from the paradoxes thatpdfis related to [12].
(a) Interpolation
For small values of k, the multimodal typicality is exactly
the same as the frequentistic form of probability (see Fig. 2),
on
0.07
and with large k, it tends to the pdf For 1< k or 1« k (when
0.06
1= ::,0=1 there are different data points with the same value) different
modes will appear automatically while for cases when 1= k
0.05 or I ~ k one may still need clustering to get a multi-modal
representation [12].
0.04
si
0.03
C. Recursive Calculations
Recursive calculations play a very significant role in
0.02 online data streams processing [14]. With the utilization of
recursive calculation, there is no need to keep large amount of
0.005
L .0D;(uJ+ D;(uJ
, , j =1
(c) 3D multimodal typicality with inferences of 5 arbitrary data pOints by it+1= 1 becoming {.t; ,.1; ,..., it ,it+1} or {.t;,.t; ,... , J;,1 } .
Fig.5. Examples of inference of multi modal typicality
The cumulative proximity can be updated as follows:
sample is also taken into account. n;+1(u i ) = n; (u i ) + d 2(u i , U'+I ), i = 1,2,. .. 1 (9)
3. There is no need for clustering algorithm , thresholds or For the case when Euclidean distance is considered it
any parameters or complicated pre-processing technology becomes [7]-[9]:
involved to generate the multimodal typicality distribution.
SMC_2016 000056
2016 IEEE International Conference on Systems, Man, and Cybernetics' SMC 20161 October 9-12,2016' Budapest, Hungary
J= ::,.,..1
0.07 , 1
nf.,.,.. I
12
0. 1
1
- '~
0.06
0.1
0.05
0.08
0.04
0.06
0.03
0.04
0.02
0.02
0.01
-5 10 15 20 25 30 10 15 20 25 30 35
Te mperature (Co) Windspeed (mph)
on
Fig.6. Curves of empirical likelihood ofthe climate dataset in 2-D
1l";+1(u,+J = (t + lX(u,+, - ,u,+J (U'+I - ,u,)+ U'+I - ,u: I,u'+I) (10) where UR ,i and UL,i are the ifh dimensional values of U R,i and
U L,i that are the nearest to U'+ I in the j'h dimension and satisfy
1 1
where ,u'+1=--,u,
1+1
+--U'+
1+1
I' (11) UL,i < u ' +1,i < UII ,i '
si
U'+1=--U,
1+ 1
+--U'+
1+ 1
IU'+I' (12) extrapolation, see Figure 5(b), the red line), the frequency is
set to 1: /'+1= 1. Because such points for which inference is
After updating of the cumulative proximities, the typicality made do not actually exist (they are virtual), they should not
of all the unique data points including the new one at time influence the typicality of the other really existing data points.
er
instance k+ 1 can be updated using equation (6) as well as their The density of the virtual data point is:
corresponding multimodal typicality at time instance k+1 can
1 ) (14)
2 Id (u'+i'uJ
lV
In this section, the inference mechanism based on the 2
{U I,U2 , ••• ,u" u,+J. Then ,u'+1 and U'+ I are recursively Fig. 5(c) depicts a 3D example with fIve feasible points for
updated using equations (11) and (12), and then, most which an inference is made shown in red. In Fig. 6 inference
r;
importantly, calculate (U'+I) using equations (6)-(10). for many points is made (in red) which leads to a typicality
graph that looks like continuous (it is not continuous because
If the data point at which an estimate/inference of the the number of actual data points, k plus the points for which
Fi
typicality is made is within the range of the dataset (that is, the inference is made is not infInite).
perform an interpolation, see Fig. 5(a), the red line) then the
frequency /'+1 is estimated as the average of the frequencies
IV. EXAMPLE
of the two nearest data points of each dimension :
/'+1= ~ t
{~I
(U' +I,I - U L,i ) fR,I + (u R,i
(u R,i - u L.J
- U' +I,I ) fL,I (13) F
In this section, another example of multimodal typicality,
and inference for a benchmark dataset is presented. The
benchmark dataset in question is a wrist-worn accelerometer
(3 dimensional) data set for activities of daily life (ADL)
recognition described in [15]. In this paper, only part of the
data is used (5 clusters with 150 data points per cluster). The
SMC_2016 000057
2016 IEEE International Conference on Systems, Man, and Cybernetics' SMC 20161 October 9-12, 2016' Budapest, Hungary
on
si
er
~ w ~ M ~ ~ • a ~ ~ ~ w 35 40 45 50 55
y
data is real and interesting because they are clearly not uni- Assuming there are C classes, the class label for newly
modal. The proposed multimodal typicality can be determined arriving data points will be determined by:
without any clustering. It is visually very similar to the
histograms that can be obtained. The 3D graph of the C{x) = arg~x(r~ (x)) (16)
multimodal typicality and inference of the wrist-worn dataset p I
the data in 3 dimensions (x-axis , y-axis, z-axis) shown defined per class as:
respectively in Fig. 7(b)-(d) .
r MM (x) = h / v+1 D;'..,(x) (17)
k,v C Ij
V. NAiVE EDA CLASSIFIER
So called naIve Bayes classifier have been extensively
II
j =l i=1
!;.lD;.J (U J.l )
studied and widely used in various fields [6],[10]. NaIve
where the index Iv indicates the number of points in the v1h
Fi
SMC_2016 000058
2016 IEEE International Conference on Systems, Man, and Cybernetics' SMC 20161 October 9-12,2016' Budapest, Hungary
The performance of the proposed naIve EDA classifier was observations of discrete data points and the ensemble data
tested on a well-known challenging problem called PIMA properties are extracted from the data directly and
dataset [16]. The performance of the proposed naIve EDA automatically. A new concept, called multirnodal typicality, is
classifier was compared with the best known classifier SVM also proposed within this framework. Multirnodal typicality
and with the naiVe Bayes classifier which it resembles. First, takes the mutual distributions and frequencies of occurrence of
90% (691 points) of the data set were used for training. The the data points into account at the same time and combines
PIMA dataset is described in [16]; in this paper we only use classical pdf, histogram into one expression. It also combines
the following attributes: the frequentistic and the distribution-based interpretation of
probability. There is no need for any complicated
1) number oftirnes pregnant;
preprocessing technology or clustering algorithm to generate
2) plasma glucose concentration a 2 hours in an oral multimodal typicality, and it also can be calculated
glucose tolerance test; recursively. Within this new framework, inferences of
multimodal typicality of virtual data points can be made
3) diastolic blood pressure (nun Hg); entirely based on the existing data points, and a curve of
4) triceps skin fold thickness (mm). empirical likelihood can be drawn optionally. This new
empirical data analysis approach is very powerful in real cases
The results are depicted in Table I in the form of a and will be a promising tool in the field of data analytics.
confusion matrix. The proposed naIve EDA classifier provides
80.5% accuracy compared with the 79.2% for the SVM As future work, this novel empirical data analysis
on
classifier using linear kernel function [11] and 77.9% of the approach will be applied to various machine learning
naIve Bayes classifier using Gaussian distribution. Obviously, problems such as anomaly detection, clustering, more
the performance of the proposed naIve EDA classifier is the sophisticated classification algorithms, data prediction, etc.
best which is not unexpected, because it does take into account
REFERENCES
the real data distribution rather than idealize it. Moreover, it
does not require the decision maker to make a choice of a [II L. A. Zadeh, "Fuzzy sets", Information and Control, vol. 8 (3), pp.338-
distribution or parameters or have iterative optimization but is 353, 1965.
si
an entirely data driven (objective) approach and is free from [2] T. Bayes, "An essay towards solving a problem in the doctrine of
changes", Phil. Trans. Roy. Soc. London, vol. 53, pp. 370, 1763.
problem- and user- specific parameters and assumptions.
[3] A. Gammerman, V.Vovk, H. Papadopoulos (Eds.), Statistical Learning
and Data Sciences, Springer, 2015, pp.169-178.
VI. CONCLUSION AND FUTURE DIRECTION [41 C. Bishop, Pattern Recognition and Machine Learning, Springer, 2007,
er
In this paper, a novel non-parametric empirical data
ISBN-I3: 978-0387310732.
analysis approach is introduced which is free from any prior [5] P. Del Moral, "Non linear filtering: interacting particle solution",
Markov Processes and Related Fields, vol. 2 (4), pp. 555- 580, 1996.
assumptions. This approach is entirely based on empirical
[6] J. Principe, Information Theoretic Learning: Renyi's Entropy and Kernel
Perspectives, Springer, 2010, ISBN: 978-1-4419-1569-6.
[71 P. Angelov, "Typicality distribution function- a new density-based data
TABLE I. analytics tool" in IEEE International Joint Conference on Neural
lV
CONFUSION MATRIX FOR THE VALIDATION DATA
Positive
(3 Samples) (28 Samples) Learning: Data Mining, Inference, and Prediction, 2nd Edition, Springer,
82.61% 17.39% 2009, ISBN-13: 978-0387952840.
Negative [II] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector
(38 Samples) (8 Samples) Machines : and Other Kernel-Based Learning Methods, Cambridge
NaIve Bayes University Press, 2000, ISBN:052I780I95 (hb).
Classifier 29.03% 70.97%
Positive [12] P. Angelov, X. Gu and D. Kangin, "TEDA: typicality based empirical
(9 Samples) (22 Samples) data analysis", unpublished.
82.61% 17.39% [13] http://www.worldweatheronline.com
Fi
SMC_2016 000059