You are on page 1of 39

Data Reduction

Muchake Brian
Phone: 0701178573
Email: bmuchake@gmail.com, bmuchake@cis.mak.ac.ug,

Do not Keep Company With Worthless People


Psalms 26:11
Introduction to Data Reduction
• Data reduction is the transformation of numerical or alphabetical digital information derived
empirically or experimentally into a corrected, ordered, and simplified form. The basic concept
is the reduction of multitudinous amounts of data down to the meaningful parts.
• Data reduction is the process of reducing the amount of capacity required to store data. Data
reduction can increase storage efficiency and reduce costs. Storage vendors will often
describe storage capacity in terms of raw capacity and effective capacity, which refers to data
after the reduction.
• Data reduction techniques can be applied to obtain a reduced representation of the data set
that is much smaller in volume but still contain critical information.
• Data reduction can be achieved several ways. The main types are data deduplication,
compression and single-instance storage.
Introduction to Data Reduction
Data Reduction Strategies:-
• 1. Data Cube Aggregation : Aggregation operations are applied to the data in the
construction of a data cube.
• 2. Dimensionality Reduction : In this, redundant attributes are detected and removed which
reduce the data set size.
• 3. Data Compression : Encoding mechanisms are used to reduce the data set size.
• 4. Numerosity Reduction : Here, the data are replaced or estimated by alternative.
• 5. Discretisation and concept hierarchy generation :Where raw data values for attributes are
replaced by ranges or higher conceptual levels.
Data Cube Aggregation
• A data cube is generally used to easily interpret data. It is especially useful when
representing data together with dimensions as certain measures of business requirements.
• A cube's every dimension represents certain characteristic of the database, for example,
daily, monthly or yearly sales. The data included inside a data cube makes it possible
analyze almost all the figures for virtually any or all customers, sales agents, products, and
much more.
• Thus, a data cube can help to establish trends and analyze performance.
• Data cubes are mainly categorized into two categories: namely MOLAP and ROLAP.
Data Cube Aggregation [Cont’d]
1. MOLAP
•Multidimensional Data Cube: Most OLAP products are developed based on a structure where
the cube is patterned as a multidimensional array.
•These multidimensional OLAP (MOLAP) products usually offers improved performance when
compared to other approaches mainly because they can be indexed directly into the structure of
the data cube to gather subsets of data. When the number of dimensions is greater, the cube
becomes sparser. That means that several cells that represent particular attribute combinations
will not contain any aggregated data.
•This in turn boosts the storage requirements, which may reach undesirable levels at times,
making the MOLAP solution untenable for huge data sets with many dimensions. Compression
techniques might help; however, their use can damage the natural indexing of MOLAP.
Data Cube Aggregation [Cont’d]
2. ROLAP
•Relational OLAP: Relational OLAP make use of the relational database model. The ROLAP
data cube is employed as a bunch of relational tables (approximately twice as many as the
quantity of dimensions) compared to a multidimensional array. Each one of these tables,
known as a cuboid, signifies a specific view.
Data Cube Aggregation [Cont’d]
Data Cube/Hypercube
 Hypercubes summarise data into dimensions
 Multidimensional Hypercubes enable managers to analyse values at the intersection of these
dimensions
Total annual sales

t
of TVs in U.S.A.
uc 2Qtr Date
TV 1Qtr 3Qtr 4Qtr sum
od

Country
U.S.A
Pr

PC
VC
sum R Canada
Mexico
sum
Data Cube Aggregation [Cont’d]
Illustration of Data Cube
 Suppose a company wants to keep track of sales records with the help of sales data warehouse with
respect to time, item, branch, and location.
 These dimensions allow to keep track of monthly sales and at which branch the items were sold. There
is a table associated with each dimension. This table is known as dimension table. For example, "item"
dimension table may have attributes such as item_name, item_type, and item_brand.
 The following table represents the 2-D view of Sales Data for a company with respect to time, item,
and location dimensions.
Data Cube Aggregation [Cont’d]

 But here in this 2-D table, we have records with respect to time and item only. The sales for New Delhi are shown with
respect to time, and item dimensions according to type of items sold.
 If we want to view the sales data with one more dimension, say, the location dimension, then the 3-D view would be useful.
The 3-D view of the sales data with respect to time, item, and location is shown in the table below:
Data Cube Aggregation [Cont’d]

 The above 3-D table can be represented as 3-D data cube as shown in the following figure:
Data Cube Aggregation [Cont’d]

 The above 3-D table can be represented as 3-D data cube as shown in the following figure:
Data Cube Aggregation [Cont’d]

Architecture of MOLAP Model


Data Cube Aggregation [Cont’d]

Architecture of ROLAP Model


Data Cube Aggregation [Cont’d]
Architecture of ROLAP Model [Cont’d]
 The analytical server in the middle tier application layer creates multidimensional views on the fly.
 The multidimensional system at the presentation layer provides a multidimensional view of the data to
the users.
 When the users issue complex queries based on this multidimensional view, the queries are
transformed into complex SQL directed to the relational database.
Attribute Subset Selection
• Attribute subset Selection is a technique which is used for data reduction in data mining
process. Data reduction reduces the size of data so that it can be used for analysis purposes
more efficiently.
Need of Attribute Subset Selection
• The data set may have a large number of attributes. But some of those attributes can be
irrelevant or redundant.
• The goal of attribute subset selection is to find a minimum set of attributes such that dropping
of those irrelevant attributes does not much affect the utility of data and the cost of data
analysis could be reduced. Mining on a reduced data set also makes the discovered pattern
easier to understand.
Attribute Subset Selection [Cont’d]
Process of Attribute Subset Selection
• The brute force approach can be very expensive in which each subset (2^n possible subsets) of the
data having n attributes can be analysed.
• The best way to do the task is to use the statistical significance tests such that best (or worst)
attributes can be recognized.
• Statistical significance test assumes that attributes are independent of one another. This is a kind
of greedy approach in which a significance level is decided (statistically ideal value of significance
level is 5%) and the models are tested again and again until p-value (probability value) of all
attributes is less than or equal to the selected significance level. The attributes having p-value
higher than significance level are discarded.
• This procedure is repeated again and again until all the attribute in data set has p-value less than
or equal to the significance level. This gives us the reduced data set having no irrelevant attributes.
Attribute Subset Selection [Cont’d]
Methods of Attribute Subset Selection
1. Stepwise Forward Selection.
2. Stepwise Backward Elimination.
3. Combination of Forward Selection and Backward Elimination.
4. Decision Tree Induction.
Stepwise Forward Selection
• This procedure start with an empty set of attributes as the minimal set. The most relevant
attributes are chosen(having minimum p-value) and are added to the minimal set. In each
iteration, one attribute is added to a reduced set.
Attribute Subset Selection [Cont’d]
Stepwise Backward Elimination
• Here all the attributes are considered in the initial set of attributes. In each iteration, one attribute is
eliminated from the set of attributes whose p-value is higher than significance level.
Combination of Forward Selection and Backward Elimination
• The stepwise forward selection and backward elimination are combined so as to select the relevant attributes
most efficiently. This is the most common technique which is generally used for attribute selection.
Decision Tree Induction
• This approach uses decision tree for attribute selection. It constructs a flow chart like structure having nodes
denoting a test on an attribute. Each branch corresponds to the outcome of test and leaf nodes is a class
prediction.
• The attribute that is not the part of tree is considered irrelevant and hence discarded.
Numerosity Reduction in Data Mining
• This is a technique of choosing smaller forms or data representation to reduce the volume of
data.
• Data reduction process reduces the size of data and makes it suitable and feasible for
analysis. In the reduction process, integrity of the data must be preserved and data volume is
reduced. There are many techniques that can be used for data reduction. Numerosity
reduction is one of them.
• Numerosity Reduction is a data reduction technique which replaces the original data by
smaller form of data representation. There are two techniques for numerosity reduction,
namely, 1. Parametric, 2. Non-Parametric methods.
Numerosity Reduction in Data Mining [Cont’d]
Parametric Methods
• For parametric methods, data is represented using some model. The model is used to estimate
the data, so that only parameters of data are required to be stored, instead of actual data.
• Regression and Log-Linear methods are used for creating such models.
1.Regression
• Regression can be a simple linear regression or multiple linear regression. When there is only
single independent attribute, such regression model is called simple linear regression and if there
are multiple independent attributes, then such regression models are called multiple linear
regression.
• In linear regression, the data are modeled to a fit straight line. For example, a random variable y
can be modeled as a linear function of another random variable x with the equation
Numerosity Reduction in Data Mining [Cont’d]
• y = ax+b
• where a and b (regression coefficients) specifies the slope and y-intercept of the line, respectively.
• In multiple linear regression, y will be modeled as a linear function of two or more
predictor(independent) variables.
2. Log-Linear Model
• Log-linear model can be used to estimate the probability of each data point in a multidimensional
space for a set of discretized attributes, based on a smaller subset of dimensional combinations.
This allows a higher-dimensional data space to be constructed from lower-dimensional attributes.
• Regression and log-linear model can both be used on sparse data, although their application may
be limited.
Numerosity Reduction in Data Mining [Cont’d]
Non-Parametric Methods
• These methods are used for storing reduced representations of the data include histograms, clustering, sampling
and data cube aggregation.
1.Histograms: Histogram is the data representation in terms of frequency. It uses binning to approximate data
distribution and is a popular form of data reduction.
2.Clustering: Clustering divides the data into groups/clusters. This technique partitions the whole data into different
clusters. In data reduction, the cluster representation of the data are used to replace the actual data. It also helps
to detect outliers in data.
3.Sampling: Sampling can be used for data reduction because it allows a large data set to be represented by a
much smaller random data sample (or subset).
4.Data Cube Aggregation: Data cube aggregation involves moving the data from detailed level to a fewer number
of dimensions. The resulting data set is smaller in volume, without loss of information necessary for the analysis
task.
Dimensionality Reduction in Data Mining
• Dimensionality reduction or dimension reduction is the process of reducing the number of random
variables under consideration by obtaining a set of principal variables. Approaches can be divided
into feature selection and feature extraction.
• Feature selection approaches try to find a subset of the input variables (also called features or
attributes). The three strategies are: the filter strategy (e.g. information gain), the wrapper strategy
(e.g. search guided by accuracy), and the embedded strategy (selected features add or are
removed while building the model based on prediction errors).
• Data analysis such as regression or classification can be done in the reduced space more
accurately than in the original space.
• Feature projection (also called Feature extraction) transforms the data in the high-dimensional
space to a space of fewer dimensions. The data transformation may be linear, as in principal
component analysis (PCA), but many nonlinear dimensionality reduction techniques also exist.
Dimensionality Reduction in Data Mining [Cont’d]
Components of Dimensionality Reduction
• Feature selection: In this, we try to find a subset of the original set of variables, or features, to
get a smaller subset which can be used to model the problem. It usually involves three ways:
1.Filter
2.Wrapper
3.Embedded
• Feature extraction: This reduces the data in a high dimensional space to a lower dimension
space, i.e. a space with lesser no. of dimensions.
Dimensionality Reduction in Data Mining [Cont’d]
Methods of Dimensionality Reduction
• The various methods used for dimensionality reduction include:
• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)
• Generalized Discriminant Analysis (GDA)
 Dimensionality reduction may be both linear or non-linear, depending upon the method used. The
prime linear method, called Principal Component Analysis, or PCA, is discussed below.
1. Principal Component Analysis
• This method was introduced by Karl Pearson. It works on a condition that while the data in a higher
dimensional space is mapped to data in a lower dimension space, the variance of the data in the lower
dimensional space should be maximum.
Dimensionality Reduction in Data Mining [Cont’d]

• It involves the following steps:


a)Construct the covariance matrix of the data.
b)Compute the eigenvectors of this matrix.
c)Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large fraction of variance
of the original data.
Dimensionality Reduction in Data Mining [Cont’d]
• Hence, we are left with a lesser number of eigenvectors, and there might have been some
data loss in the process. But, the most important variances should be retained by the
remaining eigenvectors.
2. Linear Discriminant Analysis
• Linear Discriminant Analysis (LDA) is a dimensionality reduction technique. As the name
implies dimensionality reduction techniques reduce the number of dimensions (i.e. variables)
in a dataset while retaining as much information as possible. For instance, suppose that we
plotted the relationship between two variables where each color represent a different class.
Dimensionality Reduction in Data Mining [Cont’d]
3. General Discriminant Analysis
•General Discriminant Analysis (GDA) is called a "general" discriminant analysis because it applies the
methods of the general linear model to the discriminant function analysis problem.
•A general overview of discriminant function analysis, and the traditional methods for fitting linear models
with categorical dependent variables and continuous predictors, is provided in the context of
Discriminant Analysis.
•In GDA, the discriminant function analysis problem is "recast" as a general multivariate linear model,
where the dependent variables of interest are (dummy-) coded vectors that reflect the group membership
of each case.
•GDA deals with nonlinear discriminant analysis using kernel function operator. The underlying theory is
close to the support vector machines (SVM) insofar as the GDA method provides a mapping of the input
vectors into high-dimensional feature space.
Dimensionality Reduction in Data Mining [Cont’d]
Advantages of Dimensionality Reduction
• It helps in data compression, and hence reduced storage space.
• It reduces computation time.
• It also helps remove redundant features, if any.
Disadvantages of Dimensionality Reduction
• It may lead to some amount of data loss.
• PCA tends to find linear correlations between variables, which is sometimes undesirable.
• PCA fails in cases where mean and covariance are not enough to define datasets.
Data Discretization in Data Mining
• Data discretization is defined as a process of converting continuous data attribute values into a finite
set of intervals with minimal loss of information.
• Discretization is the process of transferring continuous functions, models, variables, and equations into
discrete counterparts. This process is usually carried out as a first step toward making them suitable for
numerical evaluation and implementation on digital computers.
• Dichotomization is the special case of discretization in which the number of discrete classes is 2, which can
approximate a continuous variable as a binary variable (creating a dichotomy for modeling purposes, as in
binary classification).
• Discretization Methods include:
1. Binning
• Binning is a top-down splitting technique based on a specified number of bins. Binning is an
unsupervised discretization technique.
Data Discretization in Data Mining
2 Histogram Analysis
•Because histogram analysis does not use class information so it is an unsupervised discretization
technique. Histograms partition the values for an attribute into disjoint ranges called buckets.
3 Cluster Analysis
•Cluster analysis is a popular data discretization method.A clustering algorithm can be applied to
discrete a numerical attribute of A by partitioning the values of A into clusters or groups.
•Data Discretization techniques can be used to divide the range of continuous attribute into
intervals. Numerous continuous attribute values are replaced by small interval labels.
•Top-down discretization: If the process starts by first finding one or a few points (called split
points or cut points) to split the entire attribute range, and then repeats this recursively on the
resulting intervals, then it is called top-down discretization or splitting.
Data Discretization in Data Mining
• Bottom-up discretization
• If the process starts by considering all of the continuous values as potential split-points, removes some
by merging neighborhood values to form intervals, then it is called bottom-up discretization or merging.
4. Entropy-Based Discretization: Most entropy-based discretization methods are local and it is
easy to lose valuable information in the data. This uses the concept of information gain and it
is supervised top-down splitting.
5. Discretization by Intuitive Partitioning: This uses 3-4-5 rule to segment numerical data.
Data Concept Hierarchy Generation in Data Mining
• A concept hierarchy defines a sequence of mappings from a set of low-level concepts to
higher-level, more general concepts.
• Concept hierarchy generation based on the number of distinct values per attribute. Suppose a
user selects a set of location-oriented attributes—street, country, province_ or_state, and city
—from the All Electronics database, but does not specify the hierarchical ordering among the
attributes.
• Concept hierarchies may also be defined by discretizing or grouping values for a given
dimension or attribute, resulting in a set-grouping hierarchy.
• Concept hierarchies may be provided manually by system users, domain experts, or
knowledge engineers, or may be automatically generated based on statistical analysis of the
data distribution.
Data Concept Hierarchy Generation in Data Mining [Cont’d]
• Concept hierarchies can be used to reduce the data by collecting and replacing low-level
concepts with higher-level concepts.
• In the multidimensional model, data are organized into multiple dimensions, and each
dimension contains multiple levels of abstraction defined by concept hierarchies. This
organization provides users with the flexibility to view data from different perspectives.
• Data mining on a reduced data set means fewer input/output operations and is more efficient
than mining on a larger data set.
• Because of these benefits, discretization techniques and concept hierarchies are typically
applied before data mining, rather than during mining.
Data Concept Hierarchy Generation in Data Mining [Cont’d]
Birch in Data Mining
• BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data
mining algorithm used to perform hierarchical clustering over particularly large data-sets.
• An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-
dimensional metric data points in an attempt to produce the best quality clustering for a given
set of resources (memory and time constraints). In most cases, BIRCH only requires a single
scan of the database.
• It is local in that each clustering decision is made without scanning all data points and currently
existing clusters. It exploits the observation that data space is not usually uniformly occupied
and not every data point is equally important.
• It makes full use of available memory to derive the finest possible sub-clusters while minimizing
I/O costs. It is also an incremental method that does not require the whole data set in advance.
Birch in Data Mining [Cont’d]
• BIRCH deals with large datasets by first generating a more compact summary that retains as
much distribution information as possible, and then clustering the data summary instead of the
original dataset.
• Its I/O cost is linear with the dataset size: a single scan of the dataset yields a good clustering,
and one or more additional passes can(optionally) be used to improve the quality further
OPTIC in Data Mining
• Ordering points to identify the clustering structure (OPTICS) is an algorithm for finding density-
based clusters in spatial data.
• It was presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander.
Its basic idea is similar to DBSCAN, but it addresses one of DBSCAN's major weaknesses: the
problem of detecting meaningful clusters in data of varying density. To do so, the points of the
database are (linearly) ordered such that spatially closest points become neighbors in the
ordering.
• Additionally, a special distance is stored for each point that represents the density that must be
accepted for a cluster so that both points belong to the same cluster. This is represented as a
dendrogram.
OPTIC in Data Mining [Cont’d]
Extensions
• OPTICS-OF is an outlier detection algorithm based on OPTICS. The main use is the extraction of
outliers from an existing run of OPTICS at low cost compared to using a different outlier detection
method. The better known version LOF is based on the same concepts.
• DeLi-Clu, Density-Link-Clustering combines ideas from single-linkage clustering and OPTICS,
eliminating the parameter and offering performance improvements over OPTICS.
• HiSC is a hierarchical subspace clustering (axis-parallel) method based on OPTICS.
• HiCO is a hierarchical correlation clustering algorithm based on OPTICS.
• DiSH is an improvement over HiSC that can find more complex hierarchies.
• FOPTICS is a faster implementation using random projections.
• HDBSCAN* is based on a refinement of DBSCAN, excluding border-points from the clusters and thus
following more strictly the basic definition of density-levels by Hartigan.[11]

You might also like