Professional Documents
Culture Documents
Definition
Data Reduction is a way to attain a compressed version or representation of the data with less
volume. This condensed data maintains the integrity of data and generates similar analysis as that of
the actual data.
Below we have the quarterly sales data of a company across four item
types: entertainment, keyboard, mobile, and locks, for one of its
locations aka Delhi. The data is in 2-dimensional (2D) format and gives
us information about sales on a quarterly basis.
1. Dimensionality
2. Numerosity reduction
(1) Dimension reduction
In the language of data, dimensions are also known as features or
attributes . These dimensions are nothing else but properties of the data,
i.e., describing what the data is about.
For instance, we have the data of employees of a company. We have their
name, age, gender, location, education, and income. All these variables do
nothing but help us to know, understand and describe the data point.
As the features increase, the sparsity of the dataset also increases.
The sparsity indicates that there is a relatively higher percentage of
the variables that do not contain actual data. These “empty” cells or
NA values take up unnecessary storage.
Regression and log-linear models can be used to approximate the given data. In
(simple) linear regression, the data are modeled to fit a straight line.
For example, a random variable, y (called a response variable), can be modeled
as a linear function of another random variable, x (called a predictor variable),
with the equation y = wx + b
where the variance of y is assumed to be constant. x and y are numeric
database attributes. The coefficients, w and b (called regression coefficients),
specify the slope of the line and the y-intercept, respectively. These
coefficients can be solved for by the method of least squares, which minimizes
the error between the actual line separating the data and the estimate of the
line. Multiple linear regression is an extension of (simple) linear regression,
which allows a response variable, y, to be modeled as a linear function of two
or more predictor variables.
• Log-Linear: This technique is useful when the purpose is to
explore and understand the relationship between two or more
discrete variables. For understanding purposes, let’s take a
scenario where in n-dimensional space, a set of tuples (‘apple’,
‘banana’, ‘cherry’, ‘dragon fruit’) is available. Now, we can get
the joint probabilities of this tuple by taking the product of
these marginal elements in the following way:
Regression and log-linear models can both be used on sparse data, although
their application may be limited. While both methods can handle skewed data,
regression does exceptionally well. Regression can be computationally intensive
when applied to high-dimensional data, whereas log-linear models show good
scalability for up to 10 or so dimensions.
This way the log-linear model can be used for studying the probability of
each of these tuples in this multi-dimensional space.
2. Non-parametric Reduction
On the other hand, the non-parametric methods do not hold the
assumption of the data fitting in the model. Unlike the parametric
method, these methods may not give a very high decrease in data
reduction though it generates homogenous and systematic reduced data
in spite of the size of the data.
The types of Non-Parametric data reduction methodology are:
• Histogram
• Clustering
• Sampling
Clustering
Clustering techniques consider data tuples as objects. They partition the objects
into groups, or clusters, so that objects within a cluster are “similar” to one
another and “dissimilar” to objects in other clusters. Similarity is commonly
defined in terms of how “close” the objects are in space, based on a distance
function. The “quality” of a cluster may be represented by its diameter, the
maximum distance between any two objects in the cluster. Centroid distance is
an alternative measure of cluster quality and is defined as the average distance
of each cluster object from he cluster centroid (denoting the “average object,”
or average point in space for the cluster).
In data reduction, the cluster representations of the data are used to replace the
actual data. The effectiveness of this technique depends on the data’s nature. It
is much more effective for data that can be organized into distinct clusters than
for smeared data.