You are on page 1of 22

Data Reduction

Data Reduction
• Data reduction is a capacity optimization
technique in which data is reduced to its
simplest possible form to free up capacity on
a storage device. There are many ways to
reduce data, but the idea is very simple—
squeeze as much data into physical storage as
possible to maximize capacity.
Example
• An example in astronomy is the data reduction in the
Kepler satellite. This satellite records 95-megapixel
images once every six seconds, generating dozens of
megabytes of data per second, which is orders-of-
magnitudes more than the downlink bandwidth of 550
KBps. The on-board data reduction encompasses co-
adding the raw frames for thirty minutes, reducing the
bandwidth by a factor of 300. Furthermore, interesting
targets are pre-selected and only the relevant pixels are
processed, which is 6% of the total. This reduced data is
then sent to Earth where it is processed further.
Types of Data Reduction
• Dimensionality Reduction
• Numerosity Reduction
• Statistical modelling
Dimensionality Reduction
• When dimensionality increases, data becomes
increasingly sparse while density and distance between
points, critical to clustering and outlier analysis, becomes
less meaningful. Dimensionality reduction helps reduce
noise in the data and allows for easier visualization, such
as the example below where 3-dimensional data is
transformed into 2 dimensions to show hidden parts. One
method of dimensionality reduction is wavelet transform,
in which data is transformed to preserver relative distance
between objects at different levels of resolution, and is
often used for image compression.
Numerosity Reduction
• This method of data reduction reduces the data volume by choosing alternate,
smaller forms of data representation. Numerosity reduction can be split into 2
groups: parametric and non-parametric methods. Parametric methods (regression,
for example) assume the data fits some model, estimate model parameters, store
only the parameters, and discard the data. One example of this is in the image
below, where the volume of data to be processed is reduced based on more
specific criteria. Another example would be a log-linear model, obtaining a value at
a point in m-D space as the product on appropriate marginal subspaces. Non-
parametric methods do not assume models, some examples being histograms,
clustering, sampling, etc.
Statistical modelling
• Data reduction can be obtained by assuming a
statistical model for the data. Classical
principles of data reduction include:
sufficiency, likelihood, conditionality and
 equivariance.
Methods of data reduction: 
• These are explained as following below. 
• 1. Data Cube Aggregation: 
This technique is used to aggregate data in a simpler form. For example, imagine that information
you gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your
company every three months. They involve you in the annual sales, rather than the quarterly
average,  So we can summarize the data in such a way that the resulting data summarizes the total
sales per year instead of per quarter. It summarizes the data. 
• 2. Dimension reduction: 
Whenever we come across any data which is weakly important, then we use the attribute required
for our analysis. It reduces data size as it eliminates outdated or redundant features. 
• Step-wise Forward Selection – 
The selection begins with an empty set of attributes later on we decide best of the original
attributes on the set based on their relevance to other attributes. We know it as a p-value in
statistics. Suppose there are the following attributes in the data set in which few attributes are
redundant. 
• Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}
Step-wise Backward Selection – 
This selection starts with a set of complete attributes in the original data
and at each point, it eliminates the worst remaining attribute in the
set. Suppose there are the following attributes in the data set in which
few attributes are redundant. 
• Initial attribute Set: {X1, X2, X3, X4, X5, X6}
• Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }
• Step-1: {X1, X2, X3, X4, X5}
• Step-2: {X1, X2, X3, X5}
• Step-3: {X1, X2, X5} Final reduced attribute set: {X1, X2, X5} Combination
of forwarding and Backward Selection – 
It allows us to remove the worst and select best attributes, saving time
and making the process faster. 
•  Data Compression: 
The data compression technique reduces the size of the files using
different encoding mechanisms (Huffman Encoding & run-length
Encoding). We can divide it into two types based on their
compression techniques. 
• Lossless Compression – 
Encoding techniques (Run Length Encoding) allows a simple and
minimal data size reduction. Lossless data compression uses
algorithms to restore the precise original data from the compressed
data.  
• Lossy Compression – 
Methods such as Discrete Wavelet transform technique, PCA
(principal component analysis) are examples of this compression. For
e.g., JPEG image format is a lossy compression, but we can find the
meaning equivalent to the original the image. In lossy-data
compression, the decompressed data may differ to the original data
but are useful enough to retrieve information from them. 
•  Numerosity Reduction: 
In this reduction technique the actual data is replaced with mathematical models or
smaller representation of the data instead of actual data, it is important to only store
the model parameter. Or non-parametric method such as clustering, histogram,
sampling. For More Information on Numerosity Reduction Visit the link below: 

Discretization & Concept Hierarchy Operation: 


Techniques of data discretization are used to divide the attributes of the continuous
nature into data with intervals. We replace many constant values of the attributes by
labels of small intervals. This means that mining results are shown in a concise, and
easily understandable way. 
• Top-down discretization – 
If you first consider one or a couple of points (so-called breakpoints or split points) to
divide the whole set of attributes and repeat of this method up to the end, then the
process is known as top-down discretization also known as splitting. 
• Bottom-up discretization – 
If you first consider all the constant values as split-points, some are discarded
through a combination of the neighbourhood values in the interval, that process is
called bottom-up discretization. 
Concept Hierarchies: 
• It reduces the data size by collecting and then replacing the
low-level concepts (such as 43 for age) to high-level
concepts (categorical variables such as middle age or
Senior). 
• For numeric data following techniques can be followed: 
• Binning – 
Binning is the process of changing numerical variables into
categorical counterparts. The number of categorical
counterparts depends on the number of bins specified by
the user. 
• Histogram Analysis
Histograms
• A histogram is a graphical presentation of information using bars of various heights
and in a histogram, each bar groups numbers into ranges.
• A histogram is a graphical display of data using bars of different heights. In a
histogram, each bar groups numbers into ranges. Taller bars show that more data falls
in that range. A histogram displays the shape and spread of continuous sample data. 
• It is a plot that lets you discover, and show, the underlying frequency distribution
(shape) of a set of continuous data. This allows the inspection of the data for its
underlying distribution (e.g., normal distribution), outliers, skewness, etc. It is an
accurate representation of the distribution of numerical data, it relates only one
variable. Includes bin or bucket- the range of values that divide the entire range of
values into a series of intervals and then count how many values fall into each interval.
• Bins are consecutive, non- overlapping intervals of a variable. As the adjacent bins
leave no gaps, the rectangles of histogram touch each other to indicate that the
original value is continuous.
• Histograms are based on area, not height of bars
• In a histogram, the height of the bar does not necessarily
indicate how many occurrences of scores there were
within each bin. It is the product of height multiplied by
the width of the bin that indicates the frequency of
occurrences within that bin. One of the reasons that the
height of the bars is often incorrectly assessed as
indicating the frequency and not the area of the bar is
because a lot of histograms often have equally spaced bars
(bins), and under these circumstances, the height of the
bin does reflect the frequency.
Histogram Vs Bar Chart
• The major difference is that a histogram is
only used to plot the frequency of score
occurrences in a continuous data set that has
been divided into classes, called bins.
• Bar charts, on the other hand, can be used for
a lot of other types of variables including
ordinal and nominal data sets.
Properties
• Taller bars indicate that more data falls in that range.
• A histogram shows the shape and spread of continual sample data.
• It is a plot that allows you to find, and show, the basic frequency distribution
(shape) of a set of continuous data.
• This permits the assessment of the data for its essential distribution, skewness,
outliers, and so on.
• It is an exact portrayal of the distribution of mathematical data and it relates
just a single variable.
• Incorporates bin or bucket- the range of values that partition the whole range
of values into a progression of intervals and afterward check the number of
values that fall into every interval.
• Bins are sequential, non-covering intervals of variables. As the adjacent bins
leave no gaps, the rectangles shapes of the histogram contact each other to
demonstrate that the first value is continuous.
Histogram analysis
• Like the process of binning, the histogram is used to
partition the value for the attribute X, into disjoint
ranges called brackets. There are several partitioning
rules: 
• Equal Frequency partitioning: Partitioning the values
based on their number of occurrences in the data set. 
• Equal Width Partitioning: Partitioning the values in a
fixed gap based on the number of bins i.e. a set of
values ranging from 0-20. 
• Clustering: Grouping the similar data together. 
Common mistakes
• Try several bin size, it can lead to very different conclusions.
• Don’t use weird color sheme. It does not give any more
insight.
• Don’t confound it with a barplot. A barplot gives a value for
each group of a categoric variable. Here, we have only a
numeric variable and we chack its distribution.
• Don’t compare more than ~3 groups in the same histogram.
The graphic gets cluttered and hardly understandable.
Instead use a violin plot, a boxplot, a ridgeline plot or use
small multiple.
• Using unequal bin widths
Cross Filtering
• Cross-filtering makes it easier and more
intuitive for viewers of dashboards to interact
with a dashboard's data and understand how
one metric affects another. With cross-
filtering, users can click a data point in one
dashboard tile to have all dashboard tiles
automatically filter on that value.
• https://www.youtube.com/watch?v=nRTOHmq
yhoo

You might also like