Professional Documents
Culture Documents
net/publication/261446728
CITATIONS READS
3 99
2 authors, including:
Dominik Ślęzak
University of Warsaw
311 PUBLICATIONS 4,331 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Dominik Ślęzak on 07 May 2015.
Abstract—We discuss how the specifics of data granu- granular computations, their efficiency will depend on the
lation methodology can influence Infobright’s database quality of granules. In [11], [12], we investigated how to
system performance. We put together our two previous define the quality of granulation for a database system
research paths related to machine-generated data sets,
namely, dynamic reorganization of data during load performing operations with assistance of metadata layer
and efficient handling of alphanumeric columns with containing simplified representations of granules resulting
compound values. We emphasize the role of domain from vertical and horizontal data decompositions. We also
knowledge while tuning data granulation processes. showed how to implement a mechanism of organizing
Index Terms—Analytic Databases, Domain Knowl- the incoming tabular data into such blocks of rows that
edge, Compound Values, Stream Clustering, Outliers.
maximize a quality of their granular representations.
On the other hand, it is not so obvious how to translate
the quality of granules into the quality of the whole system.
I. Introduction
If the goal is to improve database performance, we can
Information granulation is one of fundamental aspects judge a given granular environment by its ability to reduce
of granular computing [1], [2]. Granules can correspond to computational effort related to accessing detailed granule
collections of entities gathered during, e.g., data acquisi- contents instead of working only with their high-level
tion or generation. Such collections may contain elements representations. However, it may significantly depend on
arranged together due to, e.g., their adjacency or similar- domain knowledge about, e.g., specific types of analytic
ity. Operations on granules, or rather their high-level rep- SQL statements that end users may wish to execute.
resentations, underly many applications of approximate Domain knowledge is useful also within single granules.
reasoning and hierarchical concept modeling [3], [4]. Information granulation is usually assumed to involve nu-
Via calculations on granules, one can scale operations meric content but this is not a case for analytic databases
over large and complex data. For instance, in data mining, storing machine-generated data. In [13], [14], we focused
there is a popular approach to search for meaningful on data sets with varchar columns corresponding to sub-
clusters basing on statistical summaries of heuristically collections of values fitting different structural patterns
pre-computed micro-clusters [5], [6]. There is no guarantee identified by data providers, representing web addresses,
that clusters obtained by grouping together micro-clusters online transactions and so on. We designed new methods
with similar representations will reflect all nuances of input of creating structure-based statistics for such granules.
data. Sill, computational savings thanks to operating with The primary objective of this paper is to put together
summaries instead of real data are often worth it. our methods of utilizing domain knowledge to better
Representations of dynamically assembled granules can annotate such internally compound granules with our pre-
be also useful in database systems aimed at maintaining viously developed methods of controlling a flow of loaded
the speed of data analysis in the face of massive data rows, in order to provide a more complete framework for
generated by various types of sensors and devices. Follow- handling massive amounts of machine-generated data.
ing the principles of rough sets [7], [8], one can compute The paper is organized as follows: In Section II, we recall
approximate or – if a system is designed to store both Infobright’s system architecture, which is an example of a
statistical snapshots and original contents of granules – more general approach to granular computational scala-
exact answers to standard SQL queries [9], [10]. Generally, bility. In Section III, we illustrate how domain knowledge
by scaling computations one may mean both speeding about data content and end user activities can shape the
up typical operations thanks to more intelligent work principles of data granulation optimization. In Section IV,
with granulated data, as well as introducing approximate we revisit our previous algorithmic framework for on-load
variants of those operations, analogously to the above- data granulation with respect to three types of outliers
mentioned analysis of micro-cluster representations. usually occurring in machine-generated data. In Section
Regardless of whether we consider a database, a data V, we outline the related work in the areas of databases
clustering framework, or any other form of employing and data mining. Section VI concludes our research.
Fig. 2. SELECT MAX(A) FROM T WHERE B>15; [10]. Min/max rough
values for numeric data columns A and B in table T are presented at
the left side. Data packs are classified into three categories: Irrelevant
(I) with no elements relevant for further execution; Relevant (R) with
all elements relevant for further execution; Suspect (S) that cannot be
Fig. 1. Loading and querying in Infobright database system. Data classified as neither R nor I. The R/I/S areas correspond to the rough
is partitioned onto row packs, each consisting of 216 rows. Values of set positive, negative and boundary regions. The first execution stage
each of data columns are stored separately, in data packs. Row packs classifies data packs into R/I/S regions with respect to condition
are labeled with rough values providing high-level information about B>15. The second stage employs the third row pack to approximate
data packs. This leads to a layer of granulated information systems the result as MAX(A) ≥ 18. Hence, all row packs except the first and
where objects correspond to row packs and attributes correspond to the third ones become unnecessary to complete the task. At the third
rough values. Rough values and compressed data packs are stored on stage, approximation is changed to MAX(A) ≥ X, where X depends on
disk. When querying, applicable rough values are put into memory. the result of exact (E) processing of the first row pack. If X ≥ 22,
Query performance depends on the efficiency of rough values in then there is no need to access the third row pack.
minimizing the access to data packs stored on disk. In case of, e.g.,
data filtering, rough values are applied to quickly exclude data packs
that do not satisfy a given SQL condition. Only not excluded packs
need to be decompressed and examined value by value.
into row packs in a way that would improve rough value
descriptions for many columns in the same time.
Infobright’s rough values can be treated as a metadata
II. Infobright – A Granular Database System
layer. Besides min/max statistics displayed above, they
The approach presented in this section is an example of may contain, e.g., a sum of numeric values or total length
a general strategy of scaling computations on large data of string values, a number of NULL values and so on [9],
sets. Following the principles of rough sets and granular [14]. The efficiency of rough values can be measured in
computing [1], [8], this methodology can be expressed by two ways. First of all, overall size of statistics maintained
the four following steps: 1) Decompose data onto granules; within rough values needs to be far smaller than the size of
2) Create statistical snapshots for each of granules; 3) underlying data. Secondly, rough values should be able to
Do approximate computations on snapshots; 4) Whenever prevent accessing data regions that would be unnecessary
there is no other choice, access some of granules. to answer to a given query when having a direct access to
Infobright’s engine is entirely based on the above strat- each of single data elements. In other words, there should
egy, in combination with the principles of columnar data be a low risk of occurrence of so called false suspects, which
stores [15], [16], adaptive data compression [17], [18], and are data packs classified as suspects that turn out to be
pipelined query processing [19], [20]. Figure 1 outlines a fully relevant or irrelevant after accessing them.
scheme of decomposing data onto so called data packs and The remaining metadata layers are responsible for ac-
producing their high-level representations, called rough cessing and interpreting contents of data packs and rough
values. It also illustrates a simple way of employing rough values. They may contain, e.g., dictionary structures for
values in query execution, analogous to some other cases columns with relatively small number of unique values or
of using statistics over partitioned data [21], [22]. already-mentioned domain-specific descriptions of values
There are many ways of using rough values in database of long varchar data columns. All such kinds of metadata
operations [9], [23]. Figure 2 shows importance of interac- require to be consistently taken into account while parsing,
tions between layers of approximate computing with rough granulating and compressing newly loaded data.
values and exact computing with data pack contents. In Metadata needs to be considered also while design-
case of queries involving many columns, it is important ing and interpreting rough values. However, we assume
that we know which rough values describe the same blocks that rough values’ functionality, such as ability to iden-
of rows, called row packs. This is helpful when resolving, tify rough-set-inspired regions as illustrated by Figure 2,
e.g., multidimensional ad-hoc filters. On the other hand, should remain the same for all kinds of internal structures
it becomes a challenge when attempting to organize rows and high-level descriptions used for data packs.
Fig. 4. Utilizing knowledge about attribute values [13]. Rules
stored in structure dictionary enable to parse values onto smaller
particles. Match table keeps information which rules were triggered
Fig. 3. Infobright’s granulation parameters [24]. Granularity controls for which values. Structure sub-collections contain values matched by
the amount of rows in every row pack (216 by default). Finer gran- the same rule. Each sub-collection is handled independently, within a
ularity yields higher rough value precision and more selective data single data pack. Outlier sub-collection contains items that were not
accessibility but also increases overall metadata size. Origin reflects matched by any rule. For example, Table T includes varchar column
scenarios where data is acquired from different sources by data load MaybeURL. Structure dictionary for MaybeURL contains a rule defining
processors, which parse, granulate and compress data locally, and how strings representing URL addresses are usually composed. Match
send assembled data packs and their corresponding rough values to table contains the following codes: 1 if a value is recognized as
the main database server. In such a case, row packs gathered in the URL; 0 for NULL values; -1 for unrecognized values. Sub-collections of
main server are homogeneous with respect to attributes denoting URL addresses are then decomposed with respect to their substrings
data origin. Quality stands for the on-load row reorganization proce- representing scheme, hierarchical part and so on. Such substrings
dures aimed at improvement of row packs with respect to precision are often easier to represent at the level of rough values.
of rough values over some selected data columns.
III. Importance of Domain Knowledge in rough sets [7], [24]. However, a general task of expressing
rough value precision is certainly far more difficult.
Employing domain knowledge is crucial at each step
of the computational strategy outlined in Section II. For In [13], [14], we investigated how knowledge about data
instance, when creating granules, one needs to develop origin may assist in seeking for optimal storage and repre-
fast decomposition methods assuring reasonable granular sentation of machine-generated columns with compound
quality, with respect to dynamically changing criteria. As values. Figure 4 illustrates how so called domain rules
another example, we need to guarantee that snapshots of can help in internal decompositions of alphanumeric val-
granules are small but also informative enough, where a ues onto their smaller pieces grouped into sub-collections
criterion of being informative is again subject to specifics within a data pack. Each sub-collection can be represented
of information needed in expected data operations. by a more informative snapshot component, although
In case of Infobright, informativeness corresponds to some of values may not follow discovered or predefined
how precisely rough values describe data packs. Indeed, rules, which could decrease rough value precision.
more precise descriptions quickly lead to better computa- Another problem is how to create row packs with precise
tion approximations. Figure 3 illustrates factors that can rough values for multiple columns. Usually, rows are nat-
influence rough value precision, with one of them referring urally organized with respect to columns correlated with
to on-load row reorganization developed in [11], [12]. specifics of data generation. Rough values for such columns
Row reorganization means that row packs are formed are often far more precise than for others. If, according to
by not necessarily consecutive rows. Besides some obvious domain knowledge, a task is to reorganize rows to improve
computational complexity challenges addressed in the next rough values for some of those other columns, then we need
section, the two remaining questions are how to make a to be prepared for losing their precision elsewhere.
transition from precision of single rough values to precision There are many ways to evaluate row packs with respect
of row packs represented by rough values corresponding to to rough values of their corresponding data packs [11], [12].
multiple data columns, and how to mathematically grasp For example, for ad-hoc query scenarios, one may reorga-
rough value precision for various data content types. nize newly loaded rows in order to improve rough values
Let us give just a few examples. For numeric columns, over arbitrary columns, but with a penalty subject to too
rough value precision can be inversely proportional to significant loss of precision for other columns, compared to
differences between max and min shown in Figure 2. For previously loaded data. In other cases, following an idea
so called low cardinality columns, it may be inversely to rank columns due to their expected role in queries [25],
proportional to a number of distinct values in a data pack, [26], one might consider row pack quality in terms of a
which corresponds to generalized decision functions used weighted average precision of involved rough values.
OLD BUFFER CONTENT NEW BUFFER CONTENT
+------+------+------+ +------+------+------+
| A | B | C | | A | B | C |
+------+------+------+ +------+------+------+
1 | ’?’ | 3 | 1 | 1 | ’?’ | 3 | 1 |
+------+------+------+ +------+------+------+
2 | ’4’ | NULL | 2 | 2 | ’4’ | NULL | 2 |
3 | ’1’ | NULL | 4 | 3 | ’1’ | NULL | 4 |
+------+------+------+ 5 | NULL | 1500 | 3 |
| | | | +------+------+------+
+------+------+------+ | | | |
4 | ’4’ | 5 | 3 | +------+------+------+
+------+------+------+ 4 | ’4’ | 5 | 3 |
Fig. 5. A flow of loaded rows. Parser identifies of particular data +------+------+------+
columns and checks them against available domain rules (see Section A ROW TO BE ADDED
III). Granulator organizes parsed rows into row packets, which are +------+------+------+
then used to assemble row packs sent to loader for rough value 5 | NULL | 1500 | 3 |
calculation and data pack compression. A new row pack is pushed +------+------+------+
to loader once granulator is ready to assembly it according to its
constraints. Alternatively, loader might request granulator for a
new row pack once it has finished processing a previous portion of Fig. 6. An example of the process of adding rows to a row buffer.
data. All considered modules rely on metadata layers discussed in Consider a data table with a varchar column A and integer columns
Section II. Granulator’s historical metadata summarizes rough value B and C. Note that most of values of A are integers as well, which may
precisions of already loaded data packs for particular data columns, be represented by a simple domain rule [13], [14]. All three types of
which may influence strategies of evaluating row packets and final outliers considered in Section IV are present. The buffers contains
row packs. For instance, optimization with respect to columns whose four row packets, aimed at gathering rows with outliers on A, B and
rough values are especially difficult to improve because of specifics C, as well as rows with no outliers at all. A row to be added has a
of a loaded data stream can gradually become less important even if NULL outlier on A and so called out-of-range outlier on B. The process
they were initially prioritized by domain experts. needs to heuristically decide which row packet the new row should
join. According to simple evaluation based on min and max values
over numeric columns [11], [12], it is added to row packet collecting
outliers for B, as it does not decrease rough value precision for C. On
IV. Dynamic Granulation of Complex Data the other hand, it would severely decrease rough value precision for
C when added to row packet gathering outliers on A.
Data granulation algorithms should work over a stream
of loaded data with no need of re-accessing already pro-
cessed granules. It is in analogy to data clustering and
micro-clustering techniques relying on in-memory struc- structural outliers discussed in Section III. NULL values can
tures, which receive new elements and dynamically re- be treated as outliers too. Although there is no problem
organize them into final result components [5], [27]. As with handling NULL values at data storage level [13], [17],
displayed in Figure 5, such a repackaging stage should be their occurrence may block some benefits of computations
located between modules responsible for parsing incoming over high-level representations of data packs.
rows and compressing the resulting row packs. In order to better handle rows with outlier values, let
In [11], [12], we considered data granulator with a very us redefine the objectives of Infobright’s data granulation
simple buffer structure containing packets interpreted as process by means of the two following levels: 1) Organize
incomplete row packs. For each new entry, a choice of row rows into row packs in such a way that percentage of data
packet was based on simulating a decrease of weighted packs including outliers is minimal; 2) Additionally, max-
rough value precision of particular packets that would imize rough value precisions computed for sub-collections
result from adding the considered row. Every time 216 rows of non-outlier values within each of data packs.
were collected in one of row packets, its contents were sent Figure 6 outlines a simple way of choosing a packet for
to further stages in order to finalize next row pack, making each new row. Assuming specified packets responsible for
space for receiving more data from the parser. gathering rows with outliers over each single data column,
If a new row could not match any packet without too the remaining question is how to deal with rows with
drastic decrease of rough value precisions in comparison multiple outliers. For this particular case, we can follow
to the history of previous matches, it was moved to a the same heuristic approach as proposed in [11], [12], now
trash packet producing row packs of lower quality. Such restricted to sub-collections of non-outlier values.
an approach turned out to be useful for handling data Figure 7 shows a mechanism of selecting rows from one
fluctuations, noises and outliers [28], [29]. However, some or many data packets in order to assembly row packs
issues may occur with a growing intensity and variety of according to the above objectives. Such mechanism should
outliers. If there are too many cases of values forcing their also simulate a quality of buffer content after removing
corresponding rows to be sent to trash, an average quality selected rows, i.e., its expected ability to form good row
of produced row packs becomes insufficient. packs in next iterations. Moreover, it requires introducing
In case of machine-generated data sets, besides out-of- more advanced data structures within each of data pack-
range values not fitting column distributions, there are also ets, to retrieve their specific fragments efficiently.
Fig. 8. Comparison of data sorting and granulation [12]. All three
data tables contain the same rows. For simplicity, each of row packs
in the last table contains only two elements. Two first tables show
that sorting with respect to particular columns destroys rough values
of others. Last table shows that more thoughtful granulation provides
better balance between precisions of rough values corresponding to
different columns. Rough values get more precise whenever data
pack elements become even slightly more homogeneous, at least for
some columns on some row packs. Comparing to classical clustering
criteria in data mining, value ranges of row packs can intersect with
each other. Comparing to other techniques of database tuning, the
proposed granulation objectives are less rigorous, with reorganization
of already loaded content (subject to evolving data regularities or end
user expectations) treated as optional rather than compulsory.