You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/261446728

Intelligent granulation of machine-generated data

Conference Paper · June 2013


DOI: 10.1109/IFSA-NAFIPS.2013.6608377

CITATIONS READS
3 99

2 authors, including:

Dominik Ślęzak
University of Warsaw
311 PUBLICATIONS   4,331 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

ICRA Project View project

Comparator Networks View project

All content following this page was uploaded by Dominik Ślęzak on 07 May 2015.

The user has requested enhancement of the downloaded file.


Intelligent Granulation of Machine-generated Data
Dominik Ślȩzak∗† and Marcin Kowalski∗
∗ Institute
of Mathematics, University of Warsaw
ul. Banacha 2, 02-097 Warsaw, Poland
† Infobright Inc.
ul. Krzywickiego 34/219, 02-078 Warsaw, Poland
slezak@infobright.com; mkowal@mimuw.edu.pl

Abstract—We discuss how the specifics of data granu- granular computations, their efficiency will depend on the
lation methodology can influence Infobright’s database quality of granules. In [11], [12], we investigated how to
system performance. We put together our two previous define the quality of granulation for a database system
research paths related to machine-generated data sets,
namely, dynamic reorganization of data during load performing operations with assistance of metadata layer
and efficient handling of alphanumeric columns with containing simplified representations of granules resulting
compound values. We emphasize the role of domain from vertical and horizontal data decompositions. We also
knowledge while tuning data granulation processes. showed how to implement a mechanism of organizing
Index Terms—Analytic Databases, Domain Knowl- the incoming tabular data into such blocks of rows that
edge, Compound Values, Stream Clustering, Outliers.
maximize a quality of their granular representations.
On the other hand, it is not so obvious how to translate
the quality of granules into the quality of the whole system.
I. Introduction
If the goal is to improve database performance, we can
Information granulation is one of fundamental aspects judge a given granular environment by its ability to reduce
of granular computing [1], [2]. Granules can correspond to computational effort related to accessing detailed granule
collections of entities gathered during, e.g., data acquisi- contents instead of working only with their high-level
tion or generation. Such collections may contain elements representations. However, it may significantly depend on
arranged together due to, e.g., their adjacency or similar- domain knowledge about, e.g., specific types of analytic
ity. Operations on granules, or rather their high-level rep- SQL statements that end users may wish to execute.
resentations, underly many applications of approximate Domain knowledge is useful also within single granules.
reasoning and hierarchical concept modeling [3], [4]. Information granulation is usually assumed to involve nu-
Via calculations on granules, one can scale operations meric content but this is not a case for analytic databases
over large and complex data. For instance, in data mining, storing machine-generated data. In [13], [14], we focused
there is a popular approach to search for meaningful on data sets with varchar columns corresponding to sub-
clusters basing on statistical summaries of heuristically collections of values fitting different structural patterns
pre-computed micro-clusters [5], [6]. There is no guarantee identified by data providers, representing web addresses,
that clusters obtained by grouping together micro-clusters online transactions and so on. We designed new methods
with similar representations will reflect all nuances of input of creating structure-based statistics for such granules.
data. Sill, computational savings thanks to operating with The primary objective of this paper is to put together
summaries instead of real data are often worth it. our methods of utilizing domain knowledge to better
Representations of dynamically assembled granules can annotate such internally compound granules with our pre-
be also useful in database systems aimed at maintaining viously developed methods of controlling a flow of loaded
the speed of data analysis in the face of massive data rows, in order to provide a more complete framework for
generated by various types of sensors and devices. Follow- handling massive amounts of machine-generated data.
ing the principles of rough sets [7], [8], one can compute The paper is organized as follows: In Section II, we recall
approximate or – if a system is designed to store both Infobright’s system architecture, which is an example of a
statistical snapshots and original contents of granules – more general approach to granular computational scala-
exact answers to standard SQL queries [9], [10]. Generally, bility. In Section III, we illustrate how domain knowledge
by scaling computations one may mean both speeding about data content and end user activities can shape the
up typical operations thanks to more intelligent work principles of data granulation optimization. In Section IV,
with granulated data, as well as introducing approximate we revisit our previous algorithmic framework for on-load
variants of those operations, analogously to the above- data granulation with respect to three types of outliers
mentioned analysis of micro-cluster representations. usually occurring in machine-generated data. In Section
Regardless of whether we consider a database, a data V, we outline the related work in the areas of databases
clustering framework, or any other form of employing and data mining. Section VI concludes our research.
Fig. 2. SELECT MAX(A) FROM T WHERE B>15; [10]. Min/max rough
values for numeric data columns A and B in table T are presented at
the left side. Data packs are classified into three categories: Irrelevant
(I) with no elements relevant for further execution; Relevant (R) with
all elements relevant for further execution; Suspect (S) that cannot be
Fig. 1. Loading and querying in Infobright database system. Data classified as neither R nor I. The R/I/S areas correspond to the rough
is partitioned onto row packs, each consisting of 216 rows. Values of set positive, negative and boundary regions. The first execution stage
each of data columns are stored separately, in data packs. Row packs classifies data packs into R/I/S regions with respect to condition
are labeled with rough values providing high-level information about B>15. The second stage employs the third row pack to approximate
data packs. This leads to a layer of granulated information systems the result as MAX(A) ≥ 18. Hence, all row packs except the first and
where objects correspond to row packs and attributes correspond to the third ones become unnecessary to complete the task. At the third
rough values. Rough values and compressed data packs are stored on stage, approximation is changed to MAX(A) ≥ X, where X depends on
disk. When querying, applicable rough values are put into memory. the result of exact (E) processing of the first row pack. If X ≥ 22,
Query performance depends on the efficiency of rough values in then there is no need to access the third row pack.
minimizing the access to data packs stored on disk. In case of, e.g.,
data filtering, rough values are applied to quickly exclude data packs
that do not satisfy a given SQL condition. Only not excluded packs
need to be decompressed and examined value by value.
into row packs in a way that would improve rough value
descriptions for many columns in the same time.
Infobright’s rough values can be treated as a metadata
II. Infobright – A Granular Database System
layer. Besides min/max statistics displayed above, they
The approach presented in this section is an example of may contain, e.g., a sum of numeric values or total length
a general strategy of scaling computations on large data of string values, a number of NULL values and so on [9],
sets. Following the principles of rough sets and granular [14]. The efficiency of rough values can be measured in
computing [1], [8], this methodology can be expressed by two ways. First of all, overall size of statistics maintained
the four following steps: 1) Decompose data onto granules; within rough values needs to be far smaller than the size of
2) Create statistical snapshots for each of granules; 3) underlying data. Secondly, rough values should be able to
Do approximate computations on snapshots; 4) Whenever prevent accessing data regions that would be unnecessary
there is no other choice, access some of granules. to answer to a given query when having a direct access to
Infobright’s engine is entirely based on the above strat- each of single data elements. In other words, there should
egy, in combination with the principles of columnar data be a low risk of occurrence of so called false suspects, which
stores [15], [16], adaptive data compression [17], [18], and are data packs classified as suspects that turn out to be
pipelined query processing [19], [20]. Figure 1 outlines a fully relevant or irrelevant after accessing them.
scheme of decomposing data onto so called data packs and The remaining metadata layers are responsible for ac-
producing their high-level representations, called rough cessing and interpreting contents of data packs and rough
values. It also illustrates a simple way of employing rough values. They may contain, e.g., dictionary structures for
values in query execution, analogous to some other cases columns with relatively small number of unique values or
of using statistics over partitioned data [21], [22]. already-mentioned domain-specific descriptions of values
There are many ways of using rough values in database of long varchar data columns. All such kinds of metadata
operations [9], [23]. Figure 2 shows importance of interac- require to be consistently taken into account while parsing,
tions between layers of approximate computing with rough granulating and compressing newly loaded data.
values and exact computing with data pack contents. In Metadata needs to be considered also while design-
case of queries involving many columns, it is important ing and interpreting rough values. However, we assume
that we know which rough values describe the same blocks that rough values’ functionality, such as ability to iden-
of rows, called row packs. This is helpful when resolving, tify rough-set-inspired regions as illustrated by Figure 2,
e.g., multidimensional ad-hoc filters. On the other hand, should remain the same for all kinds of internal structures
it becomes a challenge when attempting to organize rows and high-level descriptions used for data packs.
Fig. 4. Utilizing knowledge about attribute values [13]. Rules
stored in structure dictionary enable to parse values onto smaller
particles. Match table keeps information which rules were triggered
Fig. 3. Infobright’s granulation parameters [24]. Granularity controls for which values. Structure sub-collections contain values matched by
the amount of rows in every row pack (216 by default). Finer gran- the same rule. Each sub-collection is handled independently, within a
ularity yields higher rough value precision and more selective data single data pack. Outlier sub-collection contains items that were not
accessibility but also increases overall metadata size. Origin reflects matched by any rule. For example, Table T includes varchar column
scenarios where data is acquired from different sources by data load MaybeURL. Structure dictionary for MaybeURL contains a rule defining
processors, which parse, granulate and compress data locally, and how strings representing URL addresses are usually composed. Match
send assembled data packs and their corresponding rough values to table contains the following codes: 1 if a value is recognized as
the main database server. In such a case, row packs gathered in the URL; 0 for NULL values; -1 for unrecognized values. Sub-collections of
main server are homogeneous with respect to attributes denoting URL addresses are then decomposed with respect to their substrings
data origin. Quality stands for the on-load row reorganization proce- representing scheme, hierarchical part and so on. Such substrings
dures aimed at improvement of row packs with respect to precision are often easier to represent at the level of rough values.
of rough values over some selected data columns.

III. Importance of Domain Knowledge in rough sets [7], [24]. However, a general task of expressing
rough value precision is certainly far more difficult.
Employing domain knowledge is crucial at each step
of the computational strategy outlined in Section II. For In [13], [14], we investigated how knowledge about data
instance, when creating granules, one needs to develop origin may assist in seeking for optimal storage and repre-
fast decomposition methods assuring reasonable granular sentation of machine-generated columns with compound
quality, with respect to dynamically changing criteria. As values. Figure 4 illustrates how so called domain rules
another example, we need to guarantee that snapshots of can help in internal decompositions of alphanumeric val-
granules are small but also informative enough, where a ues onto their smaller pieces grouped into sub-collections
criterion of being informative is again subject to specifics within a data pack. Each sub-collection can be represented
of information needed in expected data operations. by a more informative snapshot component, although
In case of Infobright, informativeness corresponds to some of values may not follow discovered or predefined
how precisely rough values describe data packs. Indeed, rules, which could decrease rough value precision.
more precise descriptions quickly lead to better computa- Another problem is how to create row packs with precise
tion approximations. Figure 3 illustrates factors that can rough values for multiple columns. Usually, rows are nat-
influence rough value precision, with one of them referring urally organized with respect to columns correlated with
to on-load row reorganization developed in [11], [12]. specifics of data generation. Rough values for such columns
Row reorganization means that row packs are formed are often far more precise than for others. If, according to
by not necessarily consecutive rows. Besides some obvious domain knowledge, a task is to reorganize rows to improve
computational complexity challenges addressed in the next rough values for some of those other columns, then we need
section, the two remaining questions are how to make a to be prepared for losing their precision elsewhere.
transition from precision of single rough values to precision There are many ways to evaluate row packs with respect
of row packs represented by rough values corresponding to to rough values of their corresponding data packs [11], [12].
multiple data columns, and how to mathematically grasp For example, for ad-hoc query scenarios, one may reorga-
rough value precision for various data content types. nize newly loaded rows in order to improve rough values
Let us give just a few examples. For numeric columns, over arbitrary columns, but with a penalty subject to too
rough value precision can be inversely proportional to significant loss of precision for other columns, compared to
differences between max and min shown in Figure 2. For previously loaded data. In other cases, following an idea
so called low cardinality columns, it may be inversely to rank columns due to their expected role in queries [25],
proportional to a number of distinct values in a data pack, [26], one might consider row pack quality in terms of a
which corresponds to generalized decision functions used weighted average precision of involved rough values.
OLD BUFFER CONTENT NEW BUFFER CONTENT
+------+------+------+ +------+------+------+
| A | B | C | | A | B | C |
+------+------+------+ +------+------+------+
1 | ’?’ | 3 | 1 | 1 | ’?’ | 3 | 1 |
+------+------+------+ +------+------+------+
2 | ’4’ | NULL | 2 | 2 | ’4’ | NULL | 2 |
3 | ’1’ | NULL | 4 | 3 | ’1’ | NULL | 4 |
+------+------+------+ 5 | NULL | 1500 | 3 |
| | | | +------+------+------+
+------+------+------+ | | | |
4 | ’4’ | 5 | 3 | +------+------+------+
+------+------+------+ 4 | ’4’ | 5 | 3 |
Fig. 5. A flow of loaded rows. Parser identifies of particular data +------+------+------+
columns and checks them against available domain rules (see Section A ROW TO BE ADDED
III). Granulator organizes parsed rows into row packets, which are +------+------+------+
then used to assemble row packs sent to loader for rough value 5 | NULL | 1500 | 3 |
calculation and data pack compression. A new row pack is pushed +------+------+------+
to loader once granulator is ready to assembly it according to its
constraints. Alternatively, loader might request granulator for a
new row pack once it has finished processing a previous portion of Fig. 6. An example of the process of adding rows to a row buffer.
data. All considered modules rely on metadata layers discussed in Consider a data table with a varchar column A and integer columns
Section II. Granulator’s historical metadata summarizes rough value B and C. Note that most of values of A are integers as well, which may
precisions of already loaded data packs for particular data columns, be represented by a simple domain rule [13], [14]. All three types of
which may influence strategies of evaluating row packets and final outliers considered in Section IV are present. The buffers contains
row packs. For instance, optimization with respect to columns whose four row packets, aimed at gathering rows with outliers on A, B and
rough values are especially difficult to improve because of specifics C, as well as rows with no outliers at all. A row to be added has a
of a loaded data stream can gradually become less important even if NULL outlier on A and so called out-of-range outlier on B. The process
they were initially prioritized by domain experts. needs to heuristically decide which row packet the new row should
join. According to simple evaluation based on min and max values
over numeric columns [11], [12], it is added to row packet collecting
outliers for B, as it does not decrease rough value precision for C. On
IV. Dynamic Granulation of Complex Data the other hand, it would severely decrease rough value precision for
C when added to row packet gathering outliers on A.
Data granulation algorithms should work over a stream
of loaded data with no need of re-accessing already pro-
cessed granules. It is in analogy to data clustering and
micro-clustering techniques relying on in-memory struc- structural outliers discussed in Section III. NULL values can
tures, which receive new elements and dynamically re- be treated as outliers too. Although there is no problem
organize them into final result components [5], [27]. As with handling NULL values at data storage level [13], [17],
displayed in Figure 5, such a repackaging stage should be their occurrence may block some benefits of computations
located between modules responsible for parsing incoming over high-level representations of data packs.
rows and compressing the resulting row packs. In order to better handle rows with outlier values, let
In [11], [12], we considered data granulator with a very us redefine the objectives of Infobright’s data granulation
simple buffer structure containing packets interpreted as process by means of the two following levels: 1) Organize
incomplete row packs. For each new entry, a choice of row rows into row packs in such a way that percentage of data
packet was based on simulating a decrease of weighted packs including outliers is minimal; 2) Additionally, max-
rough value precision of particular packets that would imize rough value precisions computed for sub-collections
result from adding the considered row. Every time 216 rows of non-outlier values within each of data packs.
were collected in one of row packets, its contents were sent Figure 6 outlines a simple way of choosing a packet for
to further stages in order to finalize next row pack, making each new row. Assuming specified packets responsible for
space for receiving more data from the parser. gathering rows with outliers over each single data column,
If a new row could not match any packet without too the remaining question is how to deal with rows with
drastic decrease of rough value precisions in comparison multiple outliers. For this particular case, we can follow
to the history of previous matches, it was moved to a the same heuristic approach as proposed in [11], [12], now
trash packet producing row packs of lower quality. Such restricted to sub-collections of non-outlier values.
an approach turned out to be useful for handling data Figure 7 shows a mechanism of selecting rows from one
fluctuations, noises and outliers [28], [29]. However, some or many data packets in order to assembly row packs
issues may occur with a growing intensity and variety of according to the above objectives. Such mechanism should
outliers. If there are too many cases of values forcing their also simulate a quality of buffer content after removing
corresponding rows to be sent to trash, an average quality selected rows, i.e., its expected ability to form good row
of produced row packs becomes insufficient. packs in next iterations. Moreover, it requires introducing
In case of machine-generated data sets, besides out-of- more advanced data structures within each of data pack-
range values not fitting column distributions, there are also ets, to retrieve their specific fragments efficiently.
Fig. 8. Comparison of data sorting and granulation [12]. All three
data tables contain the same rows. For simplicity, each of row packs
in the last table contains only two elements. Two first tables show
that sorting with respect to particular columns destroys rough values
of others. Last table shows that more thoughtful granulation provides
better balance between precisions of rough values corresponding to
different columns. Rough values get more precise whenever data
pack elements become even slightly more homogeneous, at least for
some columns on some row packs. Comparing to classical clustering
criteria in data mining, value ranges of row packs can intersect with
each other. Comparing to other techniques of database tuning, the
proposed granulation objectives are less rigorous, with reorganization
of already loaded content (subject to evolving data regularities or end
user expectations) treated as optional rather than compulsory.

goals. At the system design level, it requires a distinction


Fig. 7. Row pack production. Buffer is being filled with newly parsed between data providers, who have domain expertise that
rows until it reaches its maximum capacity (e.g.: 218 rows) or loaded
data stream is finished. In both cases, a procedure of forming a new they usually want to share, and end users, who do not
row pack is launched. A heuristic algorithm decides which rows, and need to be aware how such expertise influences internal
from which row packets, are going to be used. Selected rows are computations [14], [31]. There is significant research how
deleted from their packets and sent to loader. One can consider the
following examples of heuristic rules for a task of locating outliers to adaptively translate feedback from both data providers
in possibly minimum number of produced data packs: 1) If there and end users into logical and physical specifications of
is a row packet with more than 216 rows having outlier values on database models [32], [33]. However, there are no interfaces
one of columns, then take a subset of 216 rows with possibly highest
precision of rough values over other columns; 2) Else, compose row letting domain experts share their knowledge without a
pack from a subset of rows taken from multiple row packets, trying need of mastering database administration tools.
to maximize a number of columns with no outliers. There are many techniques that can be applied to
improve machine-generated data layout. For instance, one
might sort rows with respect to columns expected to occur
V. Related Work
most frequently in SQL statements, creating multiple data
In analytic databases, various forms of summaries are projections ordered in different ways and, while querying,
commonly used at the level of query optimization [25], choosing those optimally fitting particular data operations
[30]. However, their utilization during query execution is [16], [26]. However, sorting may not be so beneficial for
usually restricted to simple operations aimed at avoiding long varchar columns. Moreover, it is often better to
accessing irrelevant data while resolving WHERE conditions avoid such restrictive data organization, which comes at
[18], [21]. Operations on data summaries are more popular significant computational cost (see also Figure 8).
in the area of granular computing [1], [4]. However, granu- The presented data granulation framework has some
lar computation models usually do not assume iterative obvious analogies to stream clustering [5], [27]. In data
work with both high-level representations and original mining, clustering is understood as approximately orga-
data. With this respect, Infobright’s approach to dynamic nizing rows into practically meaningful groups [34], [35].
enhancement of SQL execution is quite unique. Although our criteria for putting rows into row packs are
A need for processing massive machine-generated data different, it is still a good example of adopting data mining
sets is well-recognized, with respect to both their size and principles in order to improve database internals.
mixed content [6], [29]. However, it is usually assumed that Last but not least, it is important to refer to rich liter-
values of particular columns follow unique generic types. ature on handling outliers in data clustering and stream
For practical reasons, allowing collections of compound analysis [28], [36]. Although our way of understanding out-
alphanumeric values to get internally decomposed on more liers is not so standard, the proposed strategy of locating
homogeneous sub-granules is worth considering. them in minimum number of data packs while maximiz-
In both above cases, it is crucial to take an advantage ing rough value precision over other columns could be
of available knowledge about data sources and analytical regarded as a kind of outlier-robust micro-clustering.
VI. Conclusions [13] M. Kowalski, D. Ślȩzak, G. Toppin, and A. Wojna, “Injecting
Domain Knowledge into RDBMS – Compression of Alphanu-
We discussed various aspects of intelligent data granula- meric Data Attributes,” in ISMIS, 2011, pp. 386–395.
tion within Infobright’s framework for database analytics. [14] D. Ślȩzak, G. Toppin, M. Kowalski, and A. Wojna, “System and
Method for Managing Metadata in a Relational Database,” US
We focused on dynamic organization of tabular data rows Patent Application 2011/0307472 A1, 2011.
into row packs whose summaries would be most efficient [15] S. Idreos, F. Groffen, N. Nes, S. Manegold, K. S. Mullender,
while assisting standard SQL operations with fast granular and M. L. Kersten, “MonetDB: Two Decades of Research in
Column-oriented Database Architectures,” IEEE Data Eng.
calculations. We noted that for machine-generated data Bull., vol. 35, no. 1, pp. 40–45, 2012.
sets usefulness of such summaries depends primarily on [16] M. Stonebraker, D. Abadi, A. Batkin, X. Chen, M. Cherniack,
ability to isolate some specific types of outliers from more M. Ferreira, E. Lau, A. Lin, S. Madden, E. O’Neil, P. O’Neil,
A. Rasin, N. Tran, and S. Zdonik, “CStore: A Column Oriented
regular content during the data load process. DBMS,” in VLDB, 2005, pp. 553–564.
Actually, such a kind of granulation may be considered [17] C. Apanowicz, V. Eastwood, D. Ślȩzak, P. Synak, A. Wojna,
not only for newly loaded entries. In applications related M. Wojnarski, and J. Wróblewski, “Method and System for
Data Compression in a Relational Database,” US Patent Ap-
to machine-generated data processing, there is often a plication, 2008/0071818 A1, 2008.
need to produce and transmit huge query results obtained [18] P. White and C. French, “Database System with Methodology
as partial aggregations or projections of original content. for Storing a Database Table by Vertically Partitioning all
Columns of the Table,” US Patent 5,794,229, 1998.
Appropriate dynamic granulation of query outputs may [19] A. Ailamaki, D. J. DeWitt, and M. D. Hill, “Data Page Layouts
be highly beneficial both with respect to their compres- for Relational Databases on Deep Memory Hierarchies,” VLDB
sion and representation. For such scenarios, in our future J., vol. 11, no. 3, pp. 198–215, 2002.
[20] M. Zukowski and P. A. Boncz, “Vectorwise: Beyond Column
research, we are going to take an advantage of metadata Stores,” IEEE Data Eng. Bull., vol. 35, no. 1, pp. 21–27, 2012.
describing query inputs, which was unavailable in case of [21] R. Grondin, E. Fadeitchev, and V. Zarouba, “Searchable
an on-load organization of brand new data. Archive,” US Patent 7,243,110, 2007.
[22] J. K. Metzger, B. M. Zane, and F. D. Hinshaw, “Limiting Scans
The proposed framework is also worth extending toward of Loosely Ordered and/or Grouped Relations Using Nearly
more compound types of data columns, with values fitting Ordered Maps,” US Patent 6,973,452, 2005.
different patterns or regularities for different rows. Such [23] P. Synak, “Rough Set Approach to Optimisation of Subquery
Execution in Infobright Data Warehouse,” in SCKT (PRICAI
internally heterogeneous columns can occur in data sets Workshop), 2008.
resulting from, e.g., integration or tokenization of original [24] D. Ślȩzak, P. Synak, A. Wojna, and J. Wróblewski, “Two
sources. It is then important to locate values of such Database-related Interpretations of Rough Approximations:
Data Organization and Query Execution,” Fundamenta Infor-
different kinds in possibly separate data packs, which is maticae, 2013.
obviously not an easy task when optimizing row packs with [25] N. Bruno, S. Chaudhuri, and L. Gravano, “STHoles: A Multidi-
respect to multiple columns in the same time. mensional Workload-Aware Histogram,” in SIGMOD, 2001, pp.
211–222.
[26] A. Rasin, S. Zdonik, O. Trajman, and S. Lawande, “Au-
References tomatic Vertical-Database Design,” WO Patent Application,
2008/016877 A3, 2008.
[1] W. Pedrycz, Granular Computing – Analysis and Design of [27] M. Charikar, C. Chekuri, T. Feder, and R. Motwani, “Incre-
Intelligent Systems. CRC Press, 2013. mental Clustering and Dynamic Information Retrieval,” SIAM
[2] L. A. Zadeh, Computing with Words – Principal Concepts and J. Comput., vol. 33, no. 6, pp. 1417–1440, 2004.
Ideas, ser. Studies in Fuzziness and Soft Computing. Springer, [28] C. C. Aggarwal, Outlier Analysis. Springer, 2013.
2012, vol. 277. [29] A. Deligiannakis, Y. Kotidis, V. Vassalos, V. Stoumpos, and
[3] S. K. Pal, L. Polkowski, and A. Skowron, Eds., Rough-neuro- A. Delis, “Another Outlier Bites the Dust: Computing Mean-
computing: Techniques for Computing with Words. Springer, ingful Aggregates in Sensor Networks,” in ICDE, 2009, pp. 988–
2003. 999.
[4] W. Pedrycz, A. Skowron, and V. Kreinovich, Eds., Handbook of [30] B. J. Oommen and L. G. Rueda, “The Efficiency of Histogram-
Granular Computing. Wiley-Interscience, 2008. like Techniques for Database Query Optimization,” Comput. J.,
[5] C. C. Aggarwal, Ed., Data Streams: Models and Algorithms. vol. 45, no. 5, pp. 494–510, 2002.
Springer, 2007. [31] L. T. Moss and S. Atre, Business Intelligence Roadmap: The
[6] U. Fayyad, P. Bradley, and C. Reina, “A Scalable System for Complete Project Lifecycle for Decision-support Applications.
Clustering of Large Databases having Mixed Data Attributes,” Addison-Wesley, 2003.
US Patent 6,581,058 B1, 2003. [32] S. Chaudhuri and V. R. Narasayya, “Self-Tuning Database
[7] Z. Pawlak and A. Skowron, “Rudiments of Rough Sets,” Inf. Systems: A Decade of Progress,” in VLDB, 2007, pp. 3–14.
Sci., vol. 177, no. 1, pp. 3–27, 2007. [33] S. Idreos, “Cracking Big Data,” ERCIM News, vol. 2012, no. 89,
[8] ——, “Rough Sets: Some Extensions,” Inf. Sci., vol. 177, no. 1, 2012.
pp. 28–40, 2007. [34] G. Peters, F. Crespo, P. Lingras, and R. Weber, “Soft Clustering
[9] D. Ślȩzak and V. Eastwood, “Data Warehouse Technology by – Fuzzy and Rough Approaches and Their Extensions and
Infobright,” in SIGMOD, 2009, pp. 841–846. Derivatives,” Int. J. Approx. Reasoning, vol. 54, no. 2, pp. 307–
[10] D. Ślȩzak, J. Wróblewski, V. Eastwood, and P. Synak, “Bright- 322, 2013.
house: An Analytic Data Warehouse for Ad-hoc Queries,” Proc. [35] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data Clustering: A
VLDB Endow., vol. 1, no. 2, pp. 1337–1345, 2008. Review,” ACM Comput. Surv., vol. 31, no. 3, pp. 264–323, 1999.
[11] D. Ślȩzak and M. Kowalski, “Intelligent Data Granulation on [36] C. Böhm, C. Faloutsos, and C. Plant, “Outlier-robust Clustering
Load: Improving Infobright’s Knowledge Grid,” in FGIT, 2009, using Independent Components,” in SIGMOD, 2008, pp. 185–
pp. 12–25. 198.
[12] D. Ślȩzak, M. Kowalski, V. Eastwood, and J. Wróblewski,
“Methods and Systems for Database Organization,” US Patent
8,266,147 B2, 2012.

View publication stats

You might also like