You are on page 1of 4

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 5, MAY 2011, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 167

Lossless Compression and Aggregation Analysis
for Numerical and
Categorical Data in Data Cubes
K. Bhaskar Naik, M. Supriya, Ch. Prathima, B. Ramakantha Reddy
Abstract— Logistic regression is an important technique for analyzing and predicting data with categorical attributes and the linear
regression models are important techniques for analyzing and predicting data with numerical attributes. We propose a novel scheme to
compress the data in such a way that we can construct regression models to answer any OLAP query without accessing the raw data.
Through these regression models we develop new compressible measures, for compressing and aggregating the both numerical and
categorical type of data effectively. Based on a first-order approximation to the maximum likelihood estimating equations, and Ordinary least
squares we develop a compression scheme that compresses each base cell into a small compressed data block with essential information to
support the aggregation models. Aggregation formulas for deriving high-level regression models from lower level component cells are given.
We prove that the compression is lossless in the sense that the aggregated estimator deviates from the true model by an error that is bounded
and approaches to zero when the data size increases. The results show that the proposed compression and aggregation scheme can make
feasible OLAP of regression in a data cube. Further, it supports real-time analysis of stream data, which can only be scanned once and
cannot be permanently retained.
Index Terms— Data cubes, Aggregation, Compression, OLAP, Linear Regression, Logistic Regression.
——————————

——————————
1 INTRODUCTION
he fast development of OLAP technology has led
to high demand for more sophisticated data analyz-
ing capabilities, such as prediction, trend monitor-
ing, and exception detection of multidimensional data.
Oftentimes, existing simple measures such as sum()
and average() become insufficient, and more sophis-
ticated statistical models, such as regression analysis,
are desired to be supported in OLAP. Moreover, there
are lots of applications with dynamically changing
stream data generated continuously in a dynamic
environment, with huge volume, infinite flow, and
fast changing behavior. When collected, such data is
almost always at a rather low level, consisting of various
kinds of detailed temporal and other features. To find
interesting or unusual patterns, it is essential to per-
form regression analysis at certain meaningful abstrac-
tion levels, discover critical changes of data, and drill
down to some more detailed levels for in-depth analy-
sis when needed.

Here in this paper, we propose two regression
analysis models such as logistic regression and non-linear
analysis in data cubes to handling numerical and categor-
ical data respectively. Logistic regression is an important
statistical method for modeling and predicting categorical
data. When we conduct logistic regression analysis in
real-world data mining applications, we often encounter
the difficulty of not having the complete set of data in
advance. It is often demanded to recover logistic regres-
sion models of a large data set with access not to the raw
data but to only sketchy information of divided chunks of
the data set. Non-linear regression analysis is also one of
the important statistical methods for effectively modeling
and predicting numerical data that may be discrete or
continous. Example 1: Suppose a nation wide bank wants
to study the likelihood of customers to apply for a new
credit card. Suppose that for each day, for each re-
gional branch of the bank, there is a data set contain-
ing (y
1
, x
11
, x
12
),(y
2
, x
21
, x
22
) ... ,(y
n
, x
n1
, x
n2
), where n is
the number of customers, x
n1
represents the age of the
nth customer, x
n2
represents the account balance of the
nth customer, and y
n
is a binary indicator of whether
the customer applied for the new credit card (0 for no
and 1 for yes). To model the relationship between credit
card application and user information, the bank manager
can assume that the probability of a customer applying
for the new credit card. p depends on the customer
age x1 and account balance x2 as follows:
Logit(p)= log( p/1-p)=β0+β1x1+β2x2 (1)
The above model (1) is called a logistic regression model.
After the logit transformation, logit(p) ranges over the
entire real line and makes it reasonable to be
T
————————————————
- K. Bhaskar Naik M.Tech., Assistant Professor, Dept. of CSE., Sree Vidyan-
ikethan Engg. College, Tirupati.
- M. Supriya M.Tech., Teaching Assistant, Dept. of CSE., Sree Vidyani-
kethan Engg. College, Tirupati.
- Ch. Prathima M.Tech., Assistant Professor, Dept. of CSE., Sree Vidyani-
kethan Engg. College, Tirupati.
- B.Ramakantha Reddy M.Tech., Teaching Assistant, Dept. of CSE., Sree
Vidyanikethan Engg. College.



© 2011 Journal of Computing Press, NY, USA, ISSN 2151-9617
http://sites.google.com/site/journalofcomputing/
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 5, MAY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 168

modeled as a linear function of x = (x
1
,x
2
)
T
. The re-
gression coefficients β=(β0,β1,β2)T , are often estimated
using maximum likelihood.

A key issue for aggregation operation is: Can we gen-
erate the high-level models without accessing the raw
data? Since, computing the regression model requires
performing a nonlinear numerical optimization problem,
solving such a problem from scratch over a large ag-
gregated data set for each roll-up operation is computa-
tionally very expensive.

It is far more desirable to derive high-level regression
models from low- level model parameters without access-
ing the raw data. In this paper, we proposed a compres-
sion scheme and its associated theory to support high-
quality aggregation of regression models in a multidi-
mensional data space. In the proposed approach, we
compress each data segment by retaining only the model
parameters and a small amount of auxiliary measures.
We then develop an aggregation formula that allows us
to reconstruct the regression models from partitioned
segments with a small approximation error. The error is
theoretically bounded and asymptotically convergent to
zero as the sample size grows.

We proposed the concepts of regression cubes
and a data cell compression techniques, LCRN (Lossless
Compression Representation for Numerical data) and
LCRC(Lossless Compression Representation for Categor-
ical data) to support efficient OLAP operations in the
regression cubes. LCRC is a compression technique, es-
pecially for categorical data. In this technique we used
logistic regression analysis for compressing and aggregat-
ing the data cubes, which are having categorical data.
LCRN is compression technique is for numerical data. In
this technique, we use non-linear regression analysis for
compressing and aggregating the data cubes, which are
having numerical data. The nonlinear regression analysis
models are effectively used for modeling and predicting
the numerical data.
2 LOSSLESS COMPRESSION AND AGGREGATION
We proposed lossless compression techniqu to sup-
port efficient computation of the OLS for linear regression
and the MLEs for logistic regression models in data cu-
bes. We elaborate the notion of lossless as follows:

Definition 1. In data cube analysis, a cell function, g is
a function that takes the data records of any cell with
an arbitrary size as inputs and maps into a fixed-length
vector as an output.
That is,
g(c) = v, for any data ce11 c (2)
where the output vector v has a fixed size. Suppose
that we have a regression model y = f (x, θ), where y and
x are attributes and θ is coefficient. Suppose ca is a cell
aggregated from the component cells c1, ..., ck . We define
a cell function g2 to obtain mi = g2(ci ), i = 1, ... ,k and
use an aggregation function g1 to obtain an esti-
mate of the regression coefficients for ca by
u
~
= g1(m1,…m2) (3)
The dimension of mi is independent of the number of
tuples in ci. We call mi an lossless compression repre-
sentation (LCR) of the cell ci , i = 1, ... k. We show that
the difference between the estimates obtained from ag-
gregating the linearized equations in component cells and
the OLS and MLE in the aggregated cell approaches zero
when the number of tuples in the component cells is suf-
ficiently large. Further, the space complexity of LCR is
independent of the number of tuples.
3 LCRN : LCR FOR NUMERICAL DATA
3.1 Theory of GMLR

We now briefly review the theory of GMLR. Suppose
we have n tuples in a cell: (x
i
, y
i
), for i = 1..n, where
x
T
=(x
i1
, x
i2
, ... , x
ip
) are the p regression dimensions of
the ith tuple, and each scalar y
i
is the measure of the
ith tuple. To apply multiple linear regressions, from each
x
i
, we compute a vector of k terms ui:



(4)


The first element of ui is u0=1 for fitting an intercept,
and the remaining k—1 terms u
j
(x
i
),j = 1..k — 1 are de-
rived from the regression attributes xi and are often
written as ui,j for simplicity. u
j
(x
i
) can be any kind of
function of x
i
. It could be simple as a constant, or it
could also be a complex nonlinear function of x
i
.
The nonlinear regression function is defined as fol-
lows:

(5)

Where is a kx1vector
y=(y1,y2,…yn)T and ui,j= uj(xi) We can now write the
regression function E(y|U)=U .

Definition 2. The OLS estimate of η is the argument
that minimizes the residual sum of squares function
RSS(η)=(y-Uη)T(y-Uη). If the inverse of (UTU) exists, OLS
estimates of regression parameters are unique and are
given by:
η = (UTU)-1UTy (6)
We only consider the case where the inverse of (UTU)
exists. If the inverse of (UTU) does not exist, then the
matrix (UTU) is of less than full rank and we can always
use a subset of the u terms in fitting the model, so that
the reduced model matrix has full rank.

The memory size of U is nk, and the size of UTU is k2,
where n is the number of tuples of a cell and k is the
number of regression terms which usually a very small
|
|
|
|
|
.
|

\
|
=
|
|
|
|
|
.
|

\
|
=
÷ ÷ 1 ,
1 ,
1
1
0
.....
1
) (
....
) (
k i
i
i k
i
i
u
u
x u
x u
u
u
i
T
k k i i i
u u u u y E n n n n = + + + =
÷ ÷ 1 1 1 1 0
........ ) | (
T
k
) ...., , (
, 1 , 2 1 , 0 ÷
= n n n n n
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 5, MAY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 169

constant independent of the number of tuples. For exam-
ple, k is two for linear regression, and three for quadratic
regression. Therefore, the overall space complexity is O(n)
and only linear to n.

3.2 LCR for Numerical data of a Data Cell
We propose a compressed representation of data cells
to support multidimensional GMLR analysis. The com-
pressed information for the materialized cells will be suf-
ficient for deriving regression models of all other cells.

Definition 3. For multidimensional online regression
analysis in data cubes, the Lossless Compression Repre-
sentation for Numerical data (LCRN) of a data cell c is
defined as the following set:
LCRN(c)={ |i=0,….,k-1}
U { |i,j=0,…,k-1,i<j} (7)

Where (8)

It is useful to write LCRN the form of matrices. In
fact, elements in an LCRN can be arranged into two ma-
trices: nˆ andO, where
T
k
) ˆ ...., ˆ , ˆ ˆ ( ˆ
, 1 , 2 1 , 0 ÷
= n n n n n
and







We can write an LCRN in a matrix form as
LCRN= (nˆ , O).
From the above we conclude that For aggregation of
numerical data, suppose LCRN(c1) = (
1
ˆ n
,
1
O
), LCRN(c2)
= (
2
ˆ n
,
2
O
)... , LCRN(cm) = (
m

,
m
O
) are the LCRNs for
the m component cells, respectively, and suppose
LCRN(ca) = (
a

,
a
O
) is the LCRN for the aggregated cell,
then LCRN(ca ) can be derived from that of the compo-
nent cells using the following equations:

a. and

b.

Note that, since
ij
u =
ji
u and, thus, ΘT=Θ, we only
need to store the upper triangle of in an LCRN. There-
fore, the size of an LCRN is S(k)=(k2+3k)/2. The follow-
ing property of LCRN’s indicates that this representa-
tion is economical in space and scalable for large data
cubes. The size S(k) of an LCRN of a data cell is quadratic
in k, the number of regression terms, and is independent
of n, the number of tuples in the data cell.
4 LCRC : LCR FOR CATEGORICAL
In this section, we review the theory of logistic regres-
sion and propose our lossless compression technique to
support the construction of regression cubes having cate-
gorical data.
4.1 Logistic Regression Model
Suppose we have n independent observations (y
1
,
x
1
) ,... , (y
n
, x
n
), where y
i
is a binary variable assumed to
have a Bernoulli distribution with parameter p
i
=P(y
i
= 1)
and xi € IRd are some explanatory variables. An inter-
cept term can be easily included by setting the first ele-
ment of x
i
to be 1. Logistic regression models are widely
used to model binary responses using the following for-
mulation:

(9)

where β € IRd are some unknown regression coefficients
often estimated using maximum likelihood. The maxi-
mum likelihood estimates (MLEs) are so chosen as to
maximize the following likelihood function:

L(β) =
i i
y
i
n
i
y
i
p p
÷
=
÷
I
1
1
) 1 ( (10)

According to (9), we have

Pi = (11)

Thus, the likelihood function can be written as

L(β) = (12)

Maximizing L(β) is equivalent to maximizing the log-
likelihood function

L(β)= (13)

The parameter that maximizes l(β) is the solution to the
following likelihood equation:

(14)

The Newton-Raphson method proved that it is strongly
consistent under certain regularity conditions. Strong
consistency converges to true parameter with probability
one, when n goes to infinity.
4.2 Compression and Aggregation Scheme
Denote the observations in the k
th
component cell ck
by {(y
k1
, x
k1
), ... , (y
kn
, x
kn
)}, where each y
ki
is a bina-
ry categorical attribute, and x
ki
is a d-dimensional
vector of explanatory variables. For the logistic regres-
sion model in (10), we denote by the MLE of β in (10)
based on the data cell ck. Therefore, equation is the
solution to the likelihood equation.



Its Taylor’s expansion at is given by




(15)

i
T
i
i
x
p
p
| =
÷ 1
log
| |
i
i y
i
T
n
i
y
i
T
x x
÷
=
÷
I
1
1
)) ( 1 ( ) ( | u | u
| |
¯
=
÷ +
n
i
i
T
i
T
i
x x y
1
)) ( 1 log( | u |
) exp( 1
) exp(
T
T
|
|
+
0 )] ( [
) (
) (
1
= ÷ =
c
c
= '
¯
=
i
n
i
i
T
i
x x y
l
l | u
|
|
|
k
|
ˆ
0 )] ( [
) (
) (
,
1
= ÷ =
c
c
= '
¯
=
j k
n
j
kj
T
kj
k
k
x x y
l
l | u
|
|
|
k
|
ˆ
kj
n
j
j k
T
k kj
T
k k
x x x l
¯
=
÷ ÷ = '
1
,
)] )
ˆ
)(
ˆ
( [ ) ( | | | u | 
kj
n
j
j k
T
k kj
T
k
x x x
¯
=
÷ ÷
1
2
,
] ) )
ˆ
)((
ˆ
(
2
1
[ | | | u 
|
|
|
|
|
|
.
|

\
|
= O
÷ ÷ ÷ ÷
÷
÷
1 , 1 1 , 1 0 , 1
1 , 1 11 10
1 , 0 01 00
. .
. . . . .
. . . . .
. .
. .
k k k k
k
k
u u u
u u u
u u u
¯
=
=
n
h
hj hi ij
u u
1
u
( ) ( )
¯ ¯
=
÷
=
O O =
m
i
i i
m
i
i a
1
1
1
ˆ ˆ n n
¯
=
O = O
m
i
i
1
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 5, MAY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 170

where the second equality comes from the fact =0 and
is some vector between β and from Taylor’s theorem.
Let Fk (β) be the first-order approximation to the likeli-
hood equation, it follows from (15) that

F
k
(β) =

= - A
k
β + A
k
k
|
ˆ

Where

A
k =



It is clear that Fk (β) only depends on A
k
and
whose dimensions are independent of the number of
data records in ck. Solving the linearized equation,

F
k
(β)= =0

In the aggregated cell instead of 0 ) ( = '
¯
k
k
l | by
saving (
k
|
ˆ
,Ak) in cell ck. Using this compression
scheme, we can approximate the MLE of ca using the so-
lution to

F
a
= =
¯
=
K
k 1
(-Akβ+Ak) =0 (16)


Which leads to (17)


In addition, in the aggregated cell ca, we can also obtain
it’s aA
a
from the Ak of component cells by observing

A
a
=

To summarise, our asymptotically lossless compression
technique can be described as follows:
Compression of LCRC:

LCRC = (
k
|
ˆ
, A
k
) (18)

In each component cell ck aggregation of LCRC is calcu-
lated the aggregated LCRC (
a
|
ˆ
, A
a
) using (16) and (17).
Such a process can be used to aggregate base cells at the
lowest level as well as those cells at intermediate lev-
els. But for any non-base cell, is used in place of
k
|
ˆ
in its
LCRC.
5 CONCLUSION
In this paper, we have developed a compression
scheme that compresses a data cell into a compressed
representation whose size is, independent of the size of
the cell. We have developed the LCRN and LCRC,
the lossless compression techniques for aggregations of
linear and nonlinear regression and logistic regression
parameters in data cubes respectively, so that only a
small number of data values (numerical as well as cate-
gorical) instead of the complete raw data need to be
registered for multidimensional regression analysis.

Lossless aggregation formulae are derived based on
the compressed LCRN, LCRC representations. The ag-
gregation is efficient in terms of time and space complexi-
ties. The proposed technique allows us to quickly perform
OLAP operations and generate regression models at any
level in a data cube without retrieving or storing the raw
data. We are currently extending the technique to more
general situations, such as quasi-likelihood estimation for
generalized statistical models.
REFERENCES
[1] A. Agresti, Categorical Data Analysis, second ed. John Wiley & Sons,
2002.
[2] B. Chen, L. Chen, Y. Lin, and R. Ramakrishnan, “Predic-
tion Cubes,” Proc. 3lst Int’l Conf. Very Large Data Bases
(VLDB ’05), pp. 982-993, 2005..
[3] K. Chen, I. Hu, and Z. Ying, “Strong Consistency of Maxi-
mum Quasi-Likelihood Estimators in Generalized Linear
Models with Fixed and Adaptive Designs,” The Annals of
Statistics, vol. 27, pp. 1155-1163, 1999
[4] Y. Chen, G. Dong, J. Han, J. Pei, B. Wah, and J. Wang, “Regres-
sion Cubes with Lossless Compression and Aggregation,”
IEEE Trans. Knowledge and Data Eng., vol. 18, pp. 1585-1599,
2006.
[5] R Xi, N. Lin and Y. chen “Compression and Aggregation for
logistic regression analysis in data cubes,” IEEE Trans.
Knowledge and Data Eng., vol. 21, no.4, pp. 479-492, 2006.

K. Bhaskar Naik, M.Tech., Assistant Professor of De-
partment of Computer Science and Engineering at Sree
Vidyanikethan Engineering College, Tirupathi, A.P.,
INDIA. He received B.Tech and M.Tech degree in Com-
puter Science at JNTU Hyderabad and JNTU Anantapur
and pursing Ph.D. His research interests are Data Mining,
Knowledge Engineering, Image Processing and Pattern
Recognition.

M. Supriya, M.Tech., Teaching Assistant of De-
partment of Computer Science and Engineering at
Sree Vidyani- kethan Engineering College, Tirupathi,
A.P., INDIA. She received M.Tech degree in Com-
puter Science. Her academic interests are Data Min-
ing and Com- puter Networks.

Ch. Prathima, M.Tech., Assistant Professor of Depart-
ment of Computer Science and Engineering at Sree
Vidyanikethan Engineering College, Tirupathi, , A.P.,
INDIA. She received M.Tech degree in Computer Sci-
ence. Her academic interests are Computer Networks and
Cloud Computing.

B. Ramakantha Reddy, M.Tech., Teaching Assistant of
Department of Computer Science and Engineering at
Sree Vidyanikethan Engineering College, Tirupathi, A.P.,
INDIA. He received M.Tech degree in Computer Sci-
ence. His academic interests are Data Mining and Com-
puter Networks.
kj
n
j
j k
T
k kj
T
k
x x x
¯
=
÷ ÷
1
,
)] )
ˆ
(
ˆ
( [ | | | u
( )
¯ ¯
= =

+
=
n
j
x
x x
n
j
T
kj kj kj
T
k
kj
T
k
T
kj kj
T
k
e
e
x x x
1
2
ˆ
ˆ
1
1
] )
ˆ
( [
|
|
| u
k
|
ˆ
¯
=
K
k
k
F
1
) (|
¯
=
K
k
k
F
1
) (|
¯ ¯
=
÷
=
|
.
|

\
|
=
K
k
k k
K
k
k a
A A
1
1
1
ˆ
~
| |
( ) | |
¯ ¯¯
= = =
=
K
k
k
K
k
n
j
T
kj kj kj
T
k
A x x x
1 1 1
ˆ
| u