You are on page 1of 60

Data Preprocessing

Data Preprocessing

• Why preprocess the data?

• Data cleaning

• Data integration and transformation

• Data reduction

• Discretization and concept hierarchy generation


Why preprocessing ?

In real world data's are dirty


• Incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
• Noisy: containing errors or outliers
• Inconsistent: containing discrepancies in codes or names

No quality data, no quality mining results!


• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of quality data
 Several Data preprocessing techniques

• Data cleaning: fill in missing values, smooth noisy data,


identify or remove outliers, and resolve inconsistencies.

• Data integration: using multiple databases, data cubes,


or files.

• Data transformation: normalization and aggregation.

• Data reduction: reducing the volume but producing the


same or similar analytical results.

• Data discretization: part of data reduction, replacing


numerical attributes with nominal ones.
An Overview
Data Quality:
• There are many factors comprising data quality
‒ Accuracy
‒ Completeness
‒ Consistency
‒ Timeliness
‒ Believability
‒ Interpretability
• Example: Manager @ AllElectronics
• Task- Analyze company’s data with respect to your branch’s
sale
• May be there will be no recorded value, or reported error
• Data analyze by data mining techniques are incomplete,
inaccurate or noisy and inconsistent.
Continue….
• This scenario illustrates 3 of the elements defining data
quality: Accuracy, Completeness and Consistency.

• There are many possible reasons for inaccurate data.


• Reasons: Data collection instruments may be faulty, human or
computer errors, Purposely submit incorrect data values in
mandatory field etc…

• There are many reasons for incomplete data.


• Reasons: Not consider important at the time of entry, relevant
data may not be recorded, equipment malfunctions, other
recorded data may have been deleted, data history or
modifications may have been overlooked.
Continue….

• Data quality depends on intended use of data.


• Timeliness also affects data quality.
• Example: Sales representative fail to submit their sales records
on time at the end of month
• Two other factors- believability and interpretability
• Believability- reflects how much data are trusted by
users
• Interpretability- reflects how easy the data are
understood
• Even though the database is now accurate, complete,
consistent and timely, sales department consider it as of
low quality due to poor believability and
interpretability.
Major Tasks in Data Preprocessing
• Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
 Integration of multiple databases, data cubes, or files
• Data transformation
 Normalization and aggregation
• Data reduction
 Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
 Part of data reduction but with particular importance, especially for
numerical data
Forms of data preprocessing
Data Preprocessing
• Why preprocess the data?

• Data cleaning

• Data integration and transformation

• Data reduction

• Discretization and concept hierarchy generation


Data Cleaning (Data Cleansing)

Data cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data


Missing Values
Data is not always available
 E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data

Missing data may be due to


 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of entry
 not register history or changes of the data

Missing data may need to be inferred.


Continue…

How to Handle Missing Data?


• Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of missing
values per attribute varies considerably)
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g., ―unknown‖, a
new class?!
• Use the attribute mean or median for all samples belonging to the
same class as the given tuple
• Use the most probable value to fill in the missing value: inference-
based such as Bayesian formula or decision tree
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
• Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
Continue…
How to Handle Noisy Data?
• Binning
 first sort data and partition into (equi-depth) bins
 then smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
• Clustering
 detect and remove outliers
• Regression
 smooth by fitting the data into regression functions
• Combined computer and human inspection
 detect suspicious values and check by human
Continue…
Simple Discretization Methods: Binning
• Equal-width (distance) partitioning:
 It divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B-A)/N.
 The most straightforward
 But outliers may dominate presentation
 Skewed data is not handled well.

• Equal-depth (frequency) partitioning:


 It divides the range into N intervals, each containing approximately
same number of samples
 Good data scaling
 Managing categorical attributes can be tricky.
Continue…
Binning Methods for Data Smoothing
• Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 28, 34
* Partition into (equal-frequency) bins:
- Bin 1: 4, 8, 15
- Bin 2: 21, 21, 24
- Bin 3: 25, 28, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9
- Bin 2: 22, 22, 22
- Bin 3: 29, 29, 29
* Smoothing by bin medians : replace each bin value by bin median
* Smoothing by bin boundaries:
- Bin 1: 4, 4,15
- Bin 2: 21, 21, 24
- Bin 3: 25, 25,34
Regression
• Linear regression involves finding ―best‖ line to fit two
attributes, so one attribute can be used to predict the
other.
• Multiple linear regression is an extension of linear
regression, where more than 2 attributes are involved and
data are fit to a multidimensional surface
Data Cleaning as a Process
• First step is data discrepancy detection.
• Field Overloading: It is a kind of source of errors that
typically occurs when developers compress new attribute
definitions into unused portions of already defined attributes.
• Unique rule : It is a rule says that each value of the given
attribute must be different from all other values for that
attribute.
• Consecutive rule: It is a rule says that there can be no
missing values between the lowest and highest values for the
attribute, and that all values must be unique.
• Null rule : specifies the use of blanks, question marks,
special characters, or other strings that may indicate the null
condition and how such values should be handled.
Data Preprocessing
• Why preprocess the data?

• Data cleaning

• Data integration and transformation

• Data reduction

• Discretization and concept hierarchy generation


Data Integration
Data integration:
 Combines data from multiple sources into a coherent store
Issues:
Schema integration
 Integrate metadata from different sources
Entity identification problem
 Identify real world entities from multiple data sources, e.g., A.cust-id
B.cust-#
Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different sources
are different
 Possible reasons: different representations, different scales, e.g., metric
vs. British units
Redundancy
 Object identification: The same attribute may have different names in
different database.
 Derived data: One attribute may be derived from another attribute.
Continue…
Handling redundant data in data integration
• Redundant data occur often when integration of multiple
databases
The same attribute may have different names in different
databases
One attribute may be a ―derived‖ attribute in another
table, e.g., annual revenue
• Redundant data may be able to be detected by correlational
analysis
• Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
Continue…
1.Correlation Analysis
• For numeric data,
 Use correlation coefficient and covariance
Covariance of numeric data
 Consider 2 numeric attributes A and B and A and B, and
a set of n observations {(a1,b1),...,(an,bn)}. The mean
values of A and B, respectively, are also known as the
expected values on A and B, that is,
Continue…
• The covariance between A and B is defined as

---eqn (A)
Correlation Coefficient for numeric data
• Evaluate the correlation between two attributes, A
and B , by computing the correlation coefficient
(also known as Pearson’s product moment
coefficient, named after its inventor, Karl Pearson).
This is

- -----eqn (B)
Continue…
• Compare eqn(B) for rA,B (correlation coefficient)
with eqn(A) for covariance

Where σA and σB are the standard deviations of A and B,


respectively. It can also be shown that

• If A and B are independent then E(A.B)=E(A).E(B)


So covariance is
Continue…
Example: Table 1:

Consider table 1 which presents a simplified example of stock prices


observed at 5 time points for AllElectronics and HighTech, a high tech
company. If the stocks are affected by the same industry trends, will their
prices rise or fall together?
E(AllElectronics)= (6+5+4+3+2)/5= 20/5= $ 4 and
E(HighTech)= (20+10+14+5+5)/5 = 54/5= $ 10.80
Using eqn(A), we compute
Cov(AllElectronics,HighTech)=(6*20+5*10+4*14+3*5+2*5)/5 – 4*10.80
= 50.2-43.2 = 7
Therefore, given the positive covariance we can say that stock prices for
both companies rise together.
Continue…
• For nominal data

 Use χ2 (chi-square) test.


• Suppose A has c distinct values, namely a1,a2,...ac and B has r distinct
values, namely b1,b2,...br. The data tuples described by A and B can be
shown as a contingency table, with the c values of A making up the
columns and the r values of B making up the rows.
• Let (Ai,Bj) denote the joint event that attribute A takes on value ai and
attribute B takes on value bj, that is, where (A = ai,B = bj). Each and
every possible (Ai,Bj) joint event has its own cell (or slot) in the table.
• The χ2value (also known as the Pearson χ2 statistic) is computed as

Oij – observed frequency (actual count)


eij- expected frequency

n- no: of data tuples ( Example : 2x2 contigency table )


Data Transformation
• Data are transformed or consolidated into forms
appropriate for mining.
• It involves:
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Attribute/feature construction
• New attributes constructed from the given ones
• Discretization
• raw values of numeric attribute are replaced by interval
labels or conceptual labels.
Continue…
Data Transformation
Normalization
• Min-max normalization
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
• Z-score normalization
v meanA
v'
stand _ devA
• Decimal scaling
v
v'
10 j Where j is the smallest integer such that Max(| v '|)<1
Data Preprocessing
• Why preprocess the data?

• Data cleaning

• Data integration and transformation

• Data reduction

• Discretization and concept hierarchy generation


Data Reduction
• Warehouse may store terabytes of data: Complex
data analysis/mining may take a very long time to
run on the complete data set.
• Analyzing reduced representation of the data set
without compromising the integrity of the original
data and yet producing the quality knowledge.
• Data reduction strategies
Data cube aggregation
Attribute subset selection
Dimensionality reduction
Numerosity reduction
Discretization and concept hierarchy generation
Continue…
Data Cube Aggregation
• The lowest level of a data cube
• the aggregated data for an individual entity of interest
• Multiple levels of aggregation in data cubes
• Further reduce the size of data to deal with
• Reference appropriate levels
• Use the smallest representation which is enough to solve
the task
Continue…

Data cube for sales at AllElectronics

• Lowest level of abstraction – base cuboid


• Highest level of abstraction – apex cuboid
• Varying level of abstraction- cuboids
Attribute subset selection / Feature
selection
• Reduces the data set size by removing irrelevant or
redundant attributes (or dimensions).
• Select a minimum set of features such that the
probability distribution of different classes given the
values for those features is as close as possible to
the original distribution given the values of all
features
• Basic Heuristic methods of feature selection:
• Step-wise forward selection
• Step-wise backward elimination
• Combining forward selection and backward elimination
• Decision-tree induction
Continue…

Greedy (heuristic) methods for attribute selection


Continue…
• Step-wise forward selection:
• The best single-feature is picked first
• Then next best feature condition to the first, ...
• Step-wise backward elimination:
• Repeatedly eliminate the worst feature
• Combination of forward selection &backward elimination :
• Select the best attribute and removes the worst
• Decision tree induction:
• Flow chart like structure
• Internal node – test on an attribute
• External node- class prediction
Dimensionality Reduction
• Data encoding or transformations are applied so as
to obtain a reduced or ―compressed‖ representation
of the original data.

Original Data Compressed


Data
lossless

Original Data
Approximated
Continue…
Wavelet Transforms
Discrete wavelet transform (DWT): linear signal
processing
Compressed approximation: store only a small fraction
of the strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
Example of wavelet families

Haar2 Daubechie4
Continue…
General algorithm for DWT is as follows:
1. Length, L, must be an integer power of 2 (padding
with 0s, when necessary)
2. Each transform involves applying two functions.
The first applies some data smoothing, such as a
sum or weighted average. The second performs a
weighted difference, which acts to bring out the
detailed features of the data.
3. The two functions are applied to pairs of the input
data, resulting in two sets of data of length L/2.
4. The two functions are recursively applied to the
sets of data obtained in the previous loop, until the
resulting data sets obtained are of length 2.
5. Selected values from the data sets obtained in the
above iterations are designated the wavelet
coefficients of the transformed data
Continue…
Principal Component Analysis
 Also called as Karhunen-Loeve, or K-L, method.
Procedure:
1. The input data are normalized, so that each attribute
falls within the same range. This step helps ensure that
attributes with large domains will not dominate
attributes with smaller domains.
2. PCA computes k orthonormal vectors that provide a
basis for the normalized input data. These are unit
vectors that each point in a direction perpendicular to
the others. These vectors are referred to as the principal
components. The input data are a linear combination of
the principal components.
3. The principal components are sorted in order of
decreasing ―significance‖ or strength. The principal
components essentially serve as a new set of axes for
the data, providing important information about
variance.
Continue…
4. Because the components are sorted according to
decreasing order of ―significance,‖ the size of the data
can be reduced by eliminating the weaker components,
that is, those with low variance. Using the strongest
principal components, it should be possible to
reconstruct a good approximation of the original data.
X2
Y2
Y1

X1

• PCA is computationally inexpensive, can be applied to


ordered and unordered attributes, and can handle sparse
data and skewed data.
Numerosity Reduction

Data volume can be reduced by choosing alternative smaller


forms of data.
These techniques may be parametric or nonparametric
Parametric : Assume the data fits some model, then
estimate model parameters, and store only the parameters ,
instead of actual data.
Non parametric: In which histogram, clustering and
sampling is used to store reduced form of data.
Continue…
 Numerosity reduction techniques

• Regression and Log-Linear Models


 Can be used to approximate given data
Linear regression: Data are modeled to fit a straight
line
 Often uses the least-square method to fit the line
Multiple regression: allows a response variable Y to
be modeled as a linear function of multidimensional
feature vector
Log-linear model: approximates discrete
multidimensional probability distributions
Continue…
Linear regression: Y = + X
• Two parameters , and specify the line and are to be
estimated by using the data at hand.
• using the least squares criterion to the known values of
Y1, Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2.
• Many nonlinear functions can be transformed into the
above.
Log-linear models:
• The multi-way table of joint probabilities is
approximated by a product of lower-order tables.
• Probability: p(a, b, c, d) = ab ac ad bcd
Continue…
• Histograms
 A popular data reduction technique
Divide data into buckets and store average (sum) for
each bucket
 Each bucket represents a single attribute – value /
frequency pair- singleton buckets
It can be constructed optimally in one dimension using
dynamic programming
It divides up the range of possible values in a data set
into classes or groups. For each group, a rectangle
(bucket) is constructed with a base length equal to the
range of values in that specific group, and an area
proportional to the number of observations falling into
that group.
Continue…
 The buckets are displayed in a horizontal axis
while height of a bucket represents the average
frequency of the values.
40
35
30
25
20
15
10
5
0
10000 30000 50000 70000 90000
Continue…
 Related to quantization problems.
Buckets can be determined based on the following
partitioning rules:
 Equal- width : Histogram with bars having same width
 Equal-frequency: Histogram with bars having same
height
 V- Optimal: Histogram with least variance
 MaxDiff: Bucket boundaries defined by user specified
threshold
 V-Optimal and MaxDiff histograms tend to be the
most accurate and practical.
Histograms are highly effective at approximating
both sparse and dense data, as well as highly
skewed and uniform data.
Continue…
• Clustering
 Clustering techniques consider data tuples as objects.
They partition the objects into groups or clusters, so that
objects within a cluster are ―similar‖ to one another and
―dissimilar‖ to objects in other clusters.
Similarity is commonly defined in terms of how ―close‖
the objects are in space, based on a distance function.
• The ―quality‖ of a cluster may be represented by its
 Diameter - the maximum distance between any two
objects in the cluster.
 Centroid distance- the average distance of each cluster
object from the cluster centroid (denoting the ―average
object,‖ or average point in space for the cluster).
Continue…
• Sampling
 It allows a large data set to be represented by a much
smaller random sample (or subset) of the data.
Suppose that a large data set, D, contains N tuples. Let’s
look at the most common ways that we could sample D
for data reduction,
Fig :

1. Simple Random Sample Without Replacement


(SRWR) of size s : This is created by drawing s of the
N tuples from D (s < N), where the probability of
drawing any tuple in D is 1=N, that is, all tuples are
equally likely to be sampled.
Continue…
2. Simple random sample with replacement
(SRSWR) of size s: This is similar to SRSWOR,
except that each time a tuple is drawn from D, it is
recorded and then replaced. That is, after a tuple is
drawn, it is placed back in D so that it may be
drawn again.
3. Cluster sample: If the tuples in D are grouped into
M mutually disjoint ―clusters,‖ then an SRS of s
clusters can be obtained, where s < M.
4. Stratified sample: If D is divided into mutually
disjoint parts called strata, a stratified sample of D is
generated by obtaining an SRS at each stratum. This
helps ensure a representative sample, especially
when the data are skewed.
Continue…
• An advantage of Sampling
1. The cost of obtaining a sample is proportional to
the size of the sample, s, as opposed to N, the data
set size. Hence, sampling complexity is
potentially sub linear to the size of the data.
2. When applied to data reduction, sampling is most
commonly used to estimate the answer to an
aggregate query

Next
back
Data Preprocessing
• Why preprocess the data?

• Data cleaning

• Data integration and transformation

• Data reduction

• Discretization and concept hierarchy generation


Discretization and concept hierarchy
generation
• Discretization:
reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
• Concept hierarchies
reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or
senior).
Continue…
Discretization for numeric data
• There are 5 methods for numeric concept hierarchy
generation. These include
1. Binning

2. Histogram analysis

3. Clustering analysis

4. Entropy-based discretization

5. Segmentation by natural partitioning


Continue…
• Entropy-Based Discretization
Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the entropy after
partitioning is
| S1 | | S2 |
E (S , T ) Ent ( S 1) Ent ( S 2)
|S| |S|

The boundary that minimizes the entropy function over


all possible boundaries is selected as a binary
discretization.
The process is recursively applied to partitions obtained
until some stopping criterion is met, e.g.,
Ent ( S ) E (T , S )
Experiments show that it may reduce data size and
improve classification accuracy
Continue…
• Segmentation by natural partitioning
 3-4-5 rule can be used to segment numeric data into
relatively uniform, ―natural‖ intervals.
 If an interval covers 3, 6, 7 or 9 distinct values at the
most significant digit, partition the range into 3 equi-
width intervals
 If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into 4 intervals
 If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervals

Eg:
count

Step 1: -$351 -$159 profit $1,838 $4,700

Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max

Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$4000 -$5,000)
Step 4:

($2,000 - $5, 000)


(-$400 - 0) (0 - $1,000) ($1,000 - $2, 000)
(0 -
($1,000 -
(-$400 - $200)
$1,200) ($2,000 -
-$300) $3,000)
($200 -
($1,200 -
$400)
(-$300 - $1,400)
($3,000 -
-$200)
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) ($600 - ($1,600 - $5,000)
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
 Concept hierarchy generation for categorical data
 Specification of a partial ordering of attributes explicitly
at the schema level by users or experts
 Specification of a portion of a hierarchy by explicit data
grouping
 Specification of a set of attributes, but not of their partial
ordering
 Specification of only a partial set of attributes
 Specification of a set of attributes
 Concept hierarchy can be automatically generated based
on the number of distinct values per attribute in the given
attribute set. The attribute with the most distinct values is
placed at the lowest level of the hierarchy.

You might also like