You are on page 1of 31

Data Preprocessing-

Data Integration, Data


Reduction, Clustering, Data
Discretization

Dr. Atul Garg


Data Preprocessing: Data Integration

2
2
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Can reduce and avoid redundancies and inconsistencies in resulting data set
• Can improve accuracy and speed

Entity identification problem:


• Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton

• Schema integration: e.g., A.cust-id  B.cust-# or roll no or student –id


• Integrate metadata from different sources

• Detecting and resolving data value conflicts


• For the same real world entity, attribute values from different sources are different
• Possible reasons: different representations, different scales, e.g., metric vs. British units
3
3
Handling Redundancy in Data Integration

• Redundant data occur often when integration of multiple databases


• Object identification: The same attribute or object may have different names in
different databases
• Derivable data: One attribute may be a “derived” attribute in another table, e.g.,
annual revenue
• Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality

4
4
Data Preprocessing:
Data Reduction

5
5
Data Reduction Strategies

• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
• Why data reduction? — A database/data warehouse may store terabytes of
data. Complex data analysis may take a very long time to run on the
complete data set.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Numerosity reduction (some simply call it: Data Reduction)
• Data compression

6
Dimensionality Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
• Dimensionality reduction techniques
• Wavelet transforms: It transforms a vector into a numerically different vector
• Principal Component Analysis: often used to reduce the dimensionality of large data sets, by transforming
a large set of variables into a smaller one
• Supervised and nonlinear techniques (e.g., feature selection)

7
Data Reduction 2: Numerosity Reduction
• Reduce data volume by choosing alternative, smaller forms of data
representation
• Parametric methods
• Assume the data fits some model, estimate model parameters, store
only the parameters, and discard the data (except possible outliers)
• Ex.: Log-linear models
Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling, …

8
Parametric Data Reduction: Regression and Log-Linear Models

• Linear regression
• Data modeled to fit a straight line
• Often uses the least-square method to fit the line
• Multiple regression
• Allows a response variable Y to be modeled as a linear function of multidimensional
feature vector
• Log-linear model
• Approximates discrete multidimensional probability distributions

9
Histogram Analysis

• Divide data into buckets and store average (sum) for each bucket
• Partitioning rules:
• Equal-width: equal bucket range
• Equal-frequency (or equal-depth)

10
Clustering
• Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
• Can be very effective if data is clustered but not if data is “smeared”
• Can have hierarchical clustering and be stored in multi-dimensional index tree
structures

11
Sampling

• Sampling: obtaining a small sample s to represent the whole data set N


• Allow a mining algorithm to run in complexity that is potentially sub-linear to
the size of the data
• Key principle: Choose a representative subset of the data
• Simple random sampling may have very poor performance in the presence
of skew
• Develop adaptive sampling methods, e.g., stratified sampling:
• Note: Sampling may not reduce database I/Os (page at a time)
12
Types of Sampling
• Simple random sampling
• There is an equal probability of selecting any particular item
• Sampling without replacement
• Once an object is selected, it is removed from the data warehouse
• Sampling with replacement
• A selected object is not removed from the population
• Stratified sampling:
• Partition the data set, and draw samples from each partition (proportionally, i.e.,
approximately the same percentage of the data)
• Used in conjunction with skewed data

13
Sampling: With or without Replacement

Raw Data
14
Sampling: Cluster or Stratified Sampling
Stratified random sampling is a sampling method that involves taking samples of a population
subdivided into smaller groups called strata.

Raw Data Cluster/Stratified Sample

15
Data Cube Aggregation

A data cube enables data to be modeled and viewed in multiple dimensions.


• The lowest level of a data cube (base cuboid)
• The aggregated data for an individual entity of interest
• E.g., a customer in a phone calling data warehouse
• Multiple levels of aggregation in data cubes
• Further reduce the size of data to deal with
• Reference appropriate levels
• Use the smallest representation which is enough to solve the task
• Queries regarding aggregated information should be answered using data cube, when
possible
16
Data Reduction 3: Data Compression
• String compression
• There are extensive theories and well-tuned algorithms
• Typically lossless, but only limited manipulation is possible without expansion
• Audio/video compression
• Typically lossy compression, with progressive refinement
• Sometimes small fragments of signal can be reconstructed without reconstructing the
whole
• Time sequence is not audio
• Typically short and vary slowly with time
• Dimensionality and numerosity reduction may also be considered as forms of data
compression

17
Data Compression

Original Data Compressed


Data
lossless

Original Data
Approximated

18
Data Preprocessing:
Data Transformation and Data Discretization

19
Data Transformation
• A function that maps the entire set of values of a given attribute to a new set of
replacement values. E.g. each old value can be identified with one of the new
values
• Methods
• Smoothing: Remove noise from data
• Attribute/feature construction
• New attributes constructed from the given ones
• Aggregation: Summarization, data cube construction
• Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Discretization: Concept hierarchy climbing 20
Normalization
Normalization is used to scale the data of an attribute so that it falls in a smaller range.
Normalization is generally required when we are dealing with attributes on a different scale,
otherwise, it may lead to a dilution in effectiveness of an important equally important
attribute(on lower scale) because of other attribute having values on larger scale. In simple
words, when multiple attributes are there but attributes have values on different scales, this
may lead to poor data models while performing data mining operations.
• Min-Max Normalization – In this technique of data normalization, linear transformation
is performed on the original data.
• Z-score normalization – In this technique, values are normalized based on mean and
standard deviation of the data A.
• Decimal Scaling Method For Normalization – It normalizes by moving the decimal
point of values of the data. To normalize the data by this technique, we divide each value
of the data by the maximum absolute value of data.
21
Min-Max Normalization

Where A is the attribute data,


Min(A), Max(A) are the minimum and maximum absolute value of A respectively.
v’ is the new value of each entry in data.
v is the old value of each entry in data.
new_max(A), new_min(A) is the max and min value of the range(i.e boundary value of range
required) respectively.

Suppose that the minimum and maximum values for the attribute income
are $12,000 and $98,000, respectively. We would like to map income to the
range [0.0, 1.0]. By min-max normalization, a value of $73,600 for income is
transformed to 73,600 − 12,000 ( 1.0 − 0.0 ) + 0 = 0.716
98,000 − 12,000
Z-score Normalization
v’, v is the new and old of each
entry in data respectively.
σA, A is the standard deviation
and mean of A respectively.

Find the mean of the dataset is 21.2 and the standard deviation is 29.8.
To perform a z-score normalization on the first value in the dataset, we can use the
following formula:
•New value = (x – μ) / σ
•New value = (3 – 21.2) / 29.8
•New value = -0.61
Z-score Normalization

The mean of the normalized values is 0 and the standard deviation of the normalized
values is 1.
The normalized values represent the number of standard deviations that the original
value is from the mean.
For example:
•The first value in the dataset is 0.61 standard deviations below the mean.
•The second value in the dataset is 0.54 standard deviations below the mean.
•…
•The last value in the dataset is 3.79 standard deviations above the mean.
The benefit of performing this type of normalization is that the clear outlier in
the dataset (134) has been transformed in such a way that it’s no longer a
massive outlier.
Decimal Scaling Method For
Normalization
It normalizes by moving the decimal point of values of the data. To normalize the data
by this technique, we divide each value of the data by the maximum absolute value of
data. The data value, vi, of data is normalized to vi‘ by using the formula below –

Let the input data is: -10, 201, 301, -401, 501, 601, 701
To normalize the above data,
Step 1: Maximum absolute value in given data(m): 701
Step 2: Divide the given data by 1000 (i.e j=3)
Result: The normalized data is: -0.01, 0.201, 0.301, -0.401,
0.501, 0.601, 0.701

where j is the smallest integer such that max(|vi‘|)<1.


Data Discretization
Data Discretization techniques can be used to divide the range of continuous attribute
into intervals. Numerous continuous attribute values are replaced by small interval
labels. This leads to a concise, easy-to-use, knowledge-level representation of mining
results.
• Top-down discretization: If the process starts by first finding one or a few points
(called split points or cut points) to split the entire attribute range, and then repeats this
recursively on the resulting intervals, then it is called top-down discretization or
splitting.
• Bottom-up discretization: If the process starts by considering all of the continuous
values as potential split-points, removes some by merging neighbourhood values to
form intervals, then it is called bottom-up discretization or merging.

26
Data Discretization
• Three types of attributes
• Nominal—values from an unordered set, e.g., color, profession
• Ordinal—values from an ordered set, e.g., military or academic rank
• Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
• Interval labels can then be used to replace actual data values
• Reduce data size by discretization
• Supervised vs. unsupervised
• Split (top-down) vs. merge (bottom-up)
• Discretization can be performed recursively on an attribute
• Prepare for further analysis, e.g., classification

27
Data Discretization Methods
Typical methods: All the methods can be applied recursively
• 1 Binning: Binning is a top-down splitting technique based on a specified number
of bins. Binning is an unsupervised discretization technique.
• 2 Histogram Analysis: Because histogram analysis does not use class
information so it is an unsupervised discretization technique. Histograms partition
the values for an attribute into disjoint ranges called buckets.
• 3 Cluster Analysis: Cluster analysis is a popular data discretization method. A
clustering algorithm can be applied to discrete a numerical attribute of A by
partitioning the values of A into clusters or groups.
• Each initial cluster or partition may be further decomposed into several
subcultures, forming a lower level of the hierarchy
• Decision-tree analysis (supervised, top-down split)
28
Concept Hierarchy Generation
• Discretization can be performed rapidly on an attribute to provide a
hierarchical partitioning of the attribute values, known as a concept
hierarchy.
• Concept hierarchies can be used to reduce the data by collecting and
replacing low-level concepts with higher-level concepts.
• In the multidimensional model, data are organized into multiple
dimensions, and each dimension contains multiple levels of abstraction
defined by concept hierarchies.
• This organization provides users with the flexibility to view data from
different perspectives.
• Examples include – geographic location, – job category and item type, etc
29
Data Preprocessing : Summary
• Data quality: accuracy, completeness, consistency, timeliness, believability, interpretability
• Data cleaning: e.g. missing/noisy values, outliers
• Data integration from multiple sources:
• Entity identification problem
• Remove redundancies
• Detect inconsistencies
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
30
References
• R.N. Prasad and Seema Acharya, “Fundamentals of Business Analytics”, Wiley
India Publishers.
• Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber,
Third edition, Morgan Kaufman Publishers.
• https://www.geeksforgeeks.org/data-normalization-in-data-mining/
• http://www.lastnightstudy.com/Show?id=45/Data-Discretization-and-Concept-
Hierarchy-Generation
• http://dataminingzone.weebly.com/uploads/6/5/9/4/6594749/ch_7discretization_an
d_concept_hierarchy_generation.pdf
• http://webpages.iust.ac.ir/yaghini/Courses/Application_IT_Fall2008/DM_02_07_D
ata%20Discretization%20and%20Concept%20Hierarchy%20Generation.pdf

31

You might also like