HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 167
Lossless Compression and Aggregation Analysis
for Numerical and
Categorical Data in Data Cubes
K. Bhaskar Naik, M. Supriya, Ch. Prathima, B. Ramakantha Reddy
Abstract— Logistic regression is an important technique for analyzing and predicting data with categorical attributes and the linear
regression models are important techniques for analyzing and predicting data with numerical attributes. We propose a novel scheme to
compress the data in such a way that we can construct regression models to answer any OLAP query without accessing the raw data.
Through these regression models we develop new compressible measures, for compressing and aggregating the both numerical and
categorical type of data effectively. Based on a firstorder approximation to the maximum likelihood estimating equations, and Ordinary least
squares we develop a compression scheme that compresses each base cell into a small compressed data block with essential information to
support the aggregation models. Aggregation formulas for deriving highlevel regression models from lower level component cells are given.
We prove that the compression is lossless in the sense that the aggregated estimator deviates from the true model by an error that is bounded
and approaches to zero when the data size increases. The results show that the proposed compression and aggregation scheme can make
feasible OLAP of regression in a data cube. Further, it supports realtime analysis of stream data, which can only be scanned once and
cannot be permanently retained.
Index Terms— Data cubes, Aggregation, Compression, OLAP, Linear Regression, Logistic Regression.
——————————
——————————
1 INTRODUCTION
he fast development of OLAP technology has led
to high demand for more sophisticated data analyz
ing capabilities, such as prediction, trend monitor
ing, and exception detection of multidimensional data.
Oftentimes, existing simple measures such as sum()
and average() become insufficient, and more sophis
ticated statistical models, such as regression analysis,
are desired to be supported in OLAP. Moreover, there
are lots of applications with dynamically changing
stream data generated continuously in a dynamic
environment, with huge volume, infinite flow, and
fast changing behavior. When collected, such data is
almost always at a rather low level, consisting of various
kinds of detailed temporal and other features. To find
interesting or unusual patterns, it is essential to per
form regression analysis at certain meaningful abstrac
tion levels, discover critical changes of data, and drill
down to some more detailed levels for indepth analy
sis when needed.
Here in this paper, we propose two regression
analysis models such as logistic regression and nonlinear
analysis in data cubes to handling numerical and categor
ical data respectively. Logistic regression is an important
statistical method for modeling and predicting categorical
data. When we conduct logistic regression analysis in
realworld data mining applications, we often encounter
the difficulty of not having the complete set of data in
advance. It is often demanded to recover logistic regres
sion models of a large data set with access not to the raw
data but to only sketchy information of divided chunks of
the data set. Nonlinear regression analysis is also one of
the important statistical methods for effectively modeling
and predicting numerical data that may be discrete or
continous. Example 1: Suppose a nation wide bank wants
to study the likelihood of customers to apply for a new
credit card. Suppose that for each day, for each re
gional branch of the bank, there is a data set contain
ing (y
1
, x
11
, x
12
),(y
2
, x
21
, x
22
) ... ,(y
n
, x
n1
, x
n2
), where n is
the number of customers, x
n1
represents the age of the
nth customer, x
n2
represents the account balance of the
nth customer, and y
n
is a binary indicator of whether
the customer applied for the new credit card (0 for no
and 1 for yes). To model the relationship between credit
card application and user information, the bank manager
can assume that the probability of a customer applying
for the new credit card. p depends on the customer
age x1 and account balance x2 as follows:
Logit(p)= log( p/1p)=β0+β1x1+β2x2 (1)
The above model (1) is called a logistic regression model.
After the logit transformation, logit(p) ranges over the
entire real line and makes it reasonable to be
T
————————————————
 K. Bhaskar Naik M.Tech., Assistant Professor, Dept. of CSE., Sree Vidyan
ikethan Engg. College, Tirupati.
 M. Supriya M.Tech., Teaching Assistant, Dept. of CSE., Sree Vidyani
kethan Engg. College, Tirupati.
 Ch. Prathima M.Tech., Assistant Professor, Dept. of CSE., Sree Vidyani
kethan Engg. College, Tirupati.
 B.Ramakantha Reddy M.Tech., Teaching Assistant, Dept. of CSE., Sree
Vidyanikethan Engg. College.
© 2011 Journal of Computing Press, NY, USA, ISSN 21519617
http://sites.google.com/site/journalofcomputing/
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 5, MAY 2011, ISSN 21519617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 168
modeled as a linear function of x = (x
1
,x
2
)
T
. The re
gression coefficients β=(β0,β1,β2)T , are often estimated
using maximum likelihood.
A key issue for aggregation operation is: Can we gen
erate the highlevel models without accessing the raw
data? Since, computing the regression model requires
performing a nonlinear numerical optimization problem,
solving such a problem from scratch over a large ag
gregated data set for each rollup operation is computa
tionally very expensive.
It is far more desirable to derive highlevel regression
models from low level model parameters without access
ing the raw data. In this paper, we proposed a compres
sion scheme and its associated theory to support high
quality aggregation of regression models in a multidi
mensional data space. In the proposed approach, we
compress each data segment by retaining only the model
parameters and a small amount of auxiliary measures.
We then develop an aggregation formula that allows us
to reconstruct the regression models from partitioned
segments with a small approximation error. The error is
theoretically bounded and asymptotically convergent to
zero as the sample size grows.
We proposed the concepts of regression cubes
and a data cell compression techniques, LCRN (Lossless
Compression Representation for Numerical data) and
LCRC(Lossless Compression Representation for Categor
ical data) to support efficient OLAP operations in the
regression cubes. LCRC is a compression technique, es
pecially for categorical data. In this technique we used
logistic regression analysis for compressing and aggregat
ing the data cubes, which are having categorical data.
LCRN is compression technique is for numerical data. In
this technique, we use nonlinear regression analysis for
compressing and aggregating the data cubes, which are
having numerical data. The nonlinear regression analysis
models are effectively used for modeling and predicting
the numerical data.
2 LOSSLESS COMPRESSION AND AGGREGATION
We proposed lossless compression techniqu to sup
port efficient computation of the OLS for linear regression
and the MLEs for logistic regression models in data cu
bes. We elaborate the notion of lossless as follows:
Definition 1. In data cube analysis, a cell function, g is
a function that takes the data records of any cell with
an arbitrary size as inputs and maps into a fixedlength
vector as an output.
That is,
g(c) = v, for any data ce11 c (2)
where the output vector v has a fixed size. Suppose
that we have a regression model y = f (x, θ), where y and
x are attributes and θ is coefficient. Suppose ca is a cell
aggregated from the component cells c1, ..., ck . We define
a cell function g2 to obtain mi = g2(ci ), i = 1, ... ,k and
use an aggregation function g1 to obtain an esti
mate of the regression coefficients for ca by
u
~
= g1(m1,…m2) (3)
The dimension of mi is independent of the number of
tuples in ci. We call mi an lossless compression repre
sentation (LCR) of the cell ci , i = 1, ... k. We show that
the difference between the estimates obtained from ag
gregating the linearized equations in component cells and
the OLS and MLE in the aggregated cell approaches zero
when the number of tuples in the component cells is suf
ficiently large. Further, the space complexity of LCR is
independent of the number of tuples.
3 LCRN : LCR FOR NUMERICAL DATA
3.1 Theory of GMLR
We now briefly review the theory of GMLR. Suppose
we have n tuples in a cell: (x
i
, y
i
), for i = 1..n, where
x
T
=(x
i1
, x
i2
, ... , x
ip
) are the p regression dimensions of
the ith tuple, and each scalar y
i
is the measure of the
ith tuple. To apply multiple linear regressions, from each
x
i
, we compute a vector of k terms ui:
(4)
The first element of ui is u0=1 for fitting an intercept,
and the remaining k—1 terms u
j
(x
i
),j = 1..k — 1 are de
rived from the regression attributes xi and are often
written as ui,j for simplicity. u
j
(x
i
) can be any kind of
function of x
i
. It could be simple as a constant, or it
could also be a complex nonlinear function of x
i
.
The nonlinear regression function is defined as fol
lows:
(5)
Where is a kx1vector
y=(y1,y2,…yn)T and ui,j= uj(xi) We can now write the
regression function E(yU)=U .
Definition 2. The OLS estimate of η is the argument
that minimizes the residual sum of squares function
RSS(η)=(yUη)T(yUη). If the inverse of (UTU) exists, OLS
estimates of regression parameters are unique and are
given by:
η = (UTU)1UTy (6)
We only consider the case where the inverse of (UTU)
exists. If the inverse of (UTU) does not exist, then the
matrix (UTU) is of less than full rank and we can always
use a subset of the u terms in fitting the model, so that
the reduced model matrix has full rank.
The memory size of U is nk, and the size of UTU is k2,
where n is the number of tuples of a cell and k is the
number of regression terms which usually a very small





.

\

=





.

\

=
÷ ÷ 1 ,
1 ,
1
1
0
.....
1
) (
....
) (
k i
i
i k
i
i
u
u
x u
x u
u
u
i
T
k k i i i
u u u u y E n n n n = + + + =
÷ ÷ 1 1 1 1 0
........ )  (
T
k
) ...., , (
, 1 , 2 1 , 0 ÷
= n n n n n
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 5, MAY 2011, ISSN 21519617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 169
constant independent of the number of tuples. For exam
ple, k is two for linear regression, and three for quadratic
regression. Therefore, the overall space complexity is O(n)
and only linear to n.
3.2 LCR for Numerical data of a Data Cell
We propose a compressed representation of data cells
to support multidimensional GMLR analysis. The com
pressed information for the materialized cells will be suf
ficient for deriving regression models of all other cells.
Definition 3. For multidimensional online regression
analysis in data cubes, the Lossless Compression Repre
sentation for Numerical data (LCRN) of a data cell c is
defined as the following set:
LCRN(c)={ i=0,….,k1}
U { i,j=0,…,k1,i<j} (7)
Where (8)
It is useful to write LCRN the form of matrices. In
fact, elements in an LCRN can be arranged into two ma
trices: nˆ andO, where
T
k
) ˆ ...., ˆ , ˆ ˆ ( ˆ
, 1 , 2 1 , 0 ÷
= n n n n n
and
We can write an LCRN in a matrix form as
LCRN= (nˆ , O).
From the above we conclude that For aggregation of
numerical data, suppose LCRN(c1) = (
1
ˆ n
,
1
O
), LCRN(c2)
= (
2
ˆ n
,
2
O
)... , LCRN(cm) = (
m
nˆ
,
m
O
) are the LCRNs for
the m component cells, respectively, and suppose
LCRN(ca) = (
a
nˆ
,
a
O
) is the LCRN for the aggregated cell,
then LCRN(ca ) can be derived from that of the compo
nent cells using the following equations:
a. and
b.
Note that, since
ij
u =
ji
u and, thus, ΘT=Θ, we only
need to store the upper triangle of in an LCRN. There
fore, the size of an LCRN is S(k)=(k2+3k)/2. The follow
ing property of LCRN’s indicates that this representa
tion is economical in space and scalable for large data
cubes. The size S(k) of an LCRN of a data cell is quadratic
in k, the number of regression terms, and is independent
of n, the number of tuples in the data cell.
4 LCRC : LCR FOR CATEGORICAL
In this section, we review the theory of logistic regres
sion and propose our lossless compression technique to
support the construction of regression cubes having cate
gorical data.
4.1 Logistic Regression Model
Suppose we have n independent observations (y
1
,
x
1
) ,... , (y
n
, x
n
), where y
i
is a binary variable assumed to
have a Bernoulli distribution with parameter p
i
=P(y
i
= 1)
and xi € IRd are some explanatory variables. An inter
cept term can be easily included by setting the first ele
ment of x
i
to be 1. Logistic regression models are widely
used to model binary responses using the following for
mulation:
(9)
where β € IRd are some unknown regression coefficients
often estimated using maximum likelihood. The maxi
mum likelihood estimates (MLEs) are so chosen as to
maximize the following likelihood function:
L(β) =
i i
y
i
n
i
y
i
p p
÷
=
÷
I
1
1
) 1 ( (10)
According to (9), we have
Pi = (11)
Thus, the likelihood function can be written as
L(β) = (12)
Maximizing L(β) is equivalent to maximizing the log
likelihood function
L(β)= (13)
The parameter that maximizes l(β) is the solution to the
following likelihood equation:
(14)
The NewtonRaphson method proved that it is strongly
consistent under certain regularity conditions. Strong
consistency converges to true parameter with probability
one, when n goes to infinity.
4.2 Compression and Aggregation Scheme
Denote the observations in the k
th
component cell ck
by {(y
k1
, x
k1
), ... , (y
kn
, x
kn
)}, where each y
ki
is a bina
ry categorical attribute, and x
ki
is a ddimensional
vector of explanatory variables. For the logistic regres
sion model in (10), we denote by the MLE of β in (10)
based on the data cell ck. Therefore, equation is the
solution to the likelihood equation.
Its Taylor’s expansion at is given by
(15)
i
T
i
i
x
p
p
 =
÷ 1
log
 
i
i y
i
T
n
i
y
i
T
x x
÷
=
÷
I
1
1
)) ( 1 ( ) (  u  u
 
¯
=
÷ +
n
i
i
T
i
T
i
x x y
1
)) ( 1 log(  u 
) exp( 1
) exp(
T
T


+
0 )] ( [
) (
) (
1
= ÷ =
c
c
= '
¯
=
i
n
i
i
T
i
x x y
l
l  u



k

ˆ
0 )] ( [
) (
) (
,
1
= ÷ =
c
c
= '
¯
=
j k
n
j
kj
T
kj
k
k
x x y
l
l  u



k

ˆ
kj
n
j
j k
T
k kj
T
k k
x x x l
¯
=
÷ ÷ = '
1
,
)] )
ˆ
)(
ˆ
( [ ) (    u 
kj
n
j
j k
T
k kj
T
k
x x x
¯
=
÷ ÷
1
2
,
] ) )
ˆ
)((
ˆ
(
2
1
[    u






.

\

= O
÷ ÷ ÷ ÷
÷
÷
1 , 1 1 , 1 0 , 1
1 , 1 11 10
1 , 0 01 00
. .
. . . . .
. . . . .
. .
. .
k k k k
k
k
u u u
u u u
u u u
¯
=
=
n
h
hj hi ij
u u
1
u
( ) ( )
¯ ¯
=
÷
=
O O =
m
i
i i
m
i
i a
1
1
1
ˆ ˆ n n
¯
=
O = O
m
i
i
1
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 5, MAY 2011, ISSN 21519617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 170
where the second equality comes from the fact =0 and
is some vector between β and from Taylor’s theorem.
Let Fk (β) be the firstorder approximation to the likeli
hood equation, it follows from (15) that
F
k
(β) =
=  A
k
β + A
k
k

ˆ
Where
A
k =
It is clear that Fk (β) only depends on A
k
and
whose dimensions are independent of the number of
data records in ck. Solving the linearized equation,
F
k
(β)= =0
In the aggregated cell instead of 0 ) ( = '
¯
k
k
l  by
saving (
k

ˆ
,Ak) in cell ck. Using this compression
scheme, we can approximate the MLE of ca using the so
lution to
F
a
= =
¯
=
K
k 1
(Akβ+Ak) =0 (16)
Which leads to (17)
In addition, in the aggregated cell ca, we can also obtain
it’s aA
a
from the Ak of component cells by observing
A
a
=
To summarise, our asymptotically lossless compression
technique can be described as follows:
Compression of LCRC:
LCRC = (
k

ˆ
, A
k
) (18)
In each component cell ck aggregation of LCRC is calcu
lated the aggregated LCRC (
a

ˆ
, A
a
) using (16) and (17).
Such a process can be used to aggregate base cells at the
lowest level as well as those cells at intermediate lev
els. But for any nonbase cell, is used in place of
k

ˆ
in its
LCRC.
5 CONCLUSION
In this paper, we have developed a compression
scheme that compresses a data cell into a compressed
representation whose size is, independent of the size of
the cell. We have developed the LCRN and LCRC,
the lossless compression techniques for aggregations of
linear and nonlinear regression and logistic regression
parameters in data cubes respectively, so that only a
small number of data values (numerical as well as cate
gorical) instead of the complete raw data need to be
registered for multidimensional regression analysis.
Lossless aggregation formulae are derived based on
the compressed LCRN, LCRC representations. The ag
gregation is efficient in terms of time and space complexi
ties. The proposed technique allows us to quickly perform
OLAP operations and generate regression models at any
level in a data cube without retrieving or storing the raw
data. We are currently extending the technique to more
general situations, such as quasilikelihood estimation for
generalized statistical models.
REFERENCES
[1] A. Agresti, Categorical Data Analysis, second ed. John Wiley & Sons,
2002.
[2] B. Chen, L. Chen, Y. Lin, and R. Ramakrishnan, “Predic
tion Cubes,” Proc. 3lst Int’l Conf. Very Large Data Bases
(VLDB ’05), pp. 982993, 2005..
[3] K. Chen, I. Hu, and Z. Ying, “Strong Consistency of Maxi
mum QuasiLikelihood Estimators in Generalized Linear
Models with Fixed and Adaptive Designs,” The Annals of
Statistics, vol. 27, pp. 11551163, 1999
[4] Y. Chen, G. Dong, J. Han, J. Pei, B. Wah, and J. Wang, “Regres
sion Cubes with Lossless Compression and Aggregation,”
IEEE Trans. Knowledge and Data Eng., vol. 18, pp. 15851599,
2006.
[5] R Xi, N. Lin and Y. chen “Compression and Aggregation for
logistic regression analysis in data cubes,” IEEE Trans.
Knowledge and Data Eng., vol. 21, no.4, pp. 479492, 2006.
K. Bhaskar Naik, M.Tech., Assistant Professor of De
partment of Computer Science and Engineering at Sree
Vidyanikethan Engineering College, Tirupathi, A.P.,
INDIA. He received B.Tech and M.Tech degree in Com
puter Science at JNTU Hyderabad and JNTU Anantapur
and pursing Ph.D. His research interests are Data Mining,
Knowledge Engineering, Image Processing and Pattern
Recognition.
M. Supriya, M.Tech., Teaching Assistant of De
partment of Computer Science and Engineering at
Sree Vidyani kethan Engineering College, Tirupathi,
A.P., INDIA. She received M.Tech degree in Com
puter Science. Her academic interests are Data Min
ing and Com puter Networks.
Ch. Prathima, M.Tech., Assistant Professor of Depart
ment of Computer Science and Engineering at Sree
Vidyanikethan Engineering College, Tirupathi, , A.P.,
INDIA. She received M.Tech degree in Computer Sci
ence. Her academic interests are Computer Networks and
Cloud Computing.
B. Ramakantha Reddy, M.Tech., Teaching Assistant of
Department of Computer Science and Engineering at
Sree Vidyanikethan Engineering College, Tirupathi, A.P.,
INDIA. He received M.Tech degree in Computer Sci
ence. His academic interests are Data Mining and Com
puter Networks.
kj
n
j
j k
T
k kj
T
k
x x x
¯
=
÷ ÷
1
,
)] )
ˆ
(
ˆ
( [    u
( )
¯ ¯
= =
+
=
n
j
x
x x
n
j
T
kj kj kj
T
k
kj
T
k
T
kj kj
T
k
e
e
x x x
1
2
ˆ
ˆ
1
1
] )
ˆ
( [


 u
k

ˆ
¯
=
K
k
k
F
1
) (
¯
=
K
k
k
F
1
) (
¯ ¯
=
÷
=

.

\

=
K
k
k k
K
k
k a
A A
1
1
1
ˆ
~
 
( )  
¯ ¯¯
= = =
=
K
k
k
K
k
n
j
T
kj kj kj
T
k
A x x x
1 1 1
ˆ
 u