03 Data Preprocessing

336331 (Data Ming)
3:
(Data Preprocessing)
- (Incomplete data) (Missing

value) N/A
- (Noisy data) (Error)
(Outliers)
- (Inconsistent data)
()
Cust_ID
Name
Income
Age
Birthday
001
n/a
200
12/10/79
002
$2000
25
27 Dec 81
003
-10000
27
18 Feb 20

1) Data Cleaning
2) Data Integration
3) Data Transformation
4) Data Reduction

Data cleaning
Data integration
Data transformation
-2, 32, 100,59, 48
-0.02, 0.32, 1.00, 0.59, 0.48
attribute
A1
attribute
A3 A126
A1
transaction
T1
transaction
Data reduction
A2
T2
T2000
A2
A115
1) Data Cleaning ()
(Missing
Value) smooth
(Missing value)

???
(Missing value)
1.
2.
3.
4.
5.
6.
Ignore the tuple

Fill in the missing value manually
Use a global constant to fill in the missing value
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same
class as the given tuple
Use the most propable value to fill in the missing value
Ignore the tuple

(Classification)
Fill in the missing value manually

Use a global constant to fill in the missing value

unknown
Use the attribute mean to fill in the missing value

12000
Use the attribute mean for all samples belonging to the

same class as the given tuple

Use the most propable value to fill in the missing value

(Regression)
(Bayesian formula) (Decision tree)

(Noisy data)

-
-
- (Data Transmission)
-
(Noisy data)

Binning Methods
Regression
Clustering
Binning Methods
binning
(Partition)
bin bin
(Local Smoothing)
(Neighborhood) bin bucket bin
(Bin Means) bin (Bin Medians) bin (Bin
Boundaries)
Binning Methods
Binning Method
Example:
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
(N = 3)
Partition into (equi-depth) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Regression
Smooth by fitting the data into regression functions
Linear Regression
Y = +X
Multiple Linear Regression
Y =b0 +b1 X1 +b2 X2 +...+bmXm
Regression

(Least-square error)
Y1
y = x+1
y = x+1
Y1
X1
Clustering

Cluster
Outlier
2) Data Transformation ( )

(Normalization)

min-max normalization
z-score normalization
normalization by decimal scaling
Sigmoidal
Min-Max Nornalization
[new_minA, new_maxA]
v minA
v'
(new _ maxA new _ minA ) new _ min A
maxA min A

(income) 12,000 (min)
98,000 (max) 73,600
[0,1] 73,600

73,600 12,000
(1.0 0) 0 0.716
98,000 12,000
Z-Score
0 1
v mean A
v'
stand_dev A
(income) 54,000 (mean)

16,000 (stand_dev) 73,600
Z-Score
73,600 54,000
1.225
16,000
Decimal scaling

v
v' j
10
A -986 917
|-986| = 986 1000
j=3 -986 -0.986
986
0.986
3
10
Sigmoidal
1 e
y'
1 e
Normalize input into [-1, 1]
1
x
-1
y mean
std
3) (Data Integration)

1. (Data Redundancies)
(Data Inconsistencies)
2.

(Schema Integration) Metadata
entities Cusid
A CustNumber B

Data Warehousing
Data Integration
Data integration:
combines data from multiple sources into a coherent store
Schema integration
integrate metadata from different sources
Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id B.cust-#
Detecting and resolving data value conflicts

for the same real world entity, attribute values from different
sources are different

possible reasons: different representations, different scales, e.g.,
metric vs. British units
Data Integration (Cont.)

Redundant data occur often when integration of multiple
databases
The same attribute may have different names in different
databases
One attribute may be a derived attribute in another
table, e.g., annual revenue
Redundant data may be able to be detected by correlation
analysis
Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and

improve mining speed and quality
Data Integration : Correlation analysis

The correlation between attribute A and B can be measured
by
rA, B
( A A )( B B )
(n 1) A B
If rA,B greater than 0, then A and B are positively correlated,
meaning that the value of A increase as the values of B

increase
The mean of A is
A
n
The standard deviation of A is A
( A A)
n 1
4) (Data Reduction)

4) (Data Reduction) (cont.)

Data reduction strategies
Data cube aggregation
Dimensionality reduction remove unimportant attributes
Data Compression
Numerosity reduction fit data into models
Discretization and concept hierarchy generation
Data Reduction: Data cube aggregation

The data can be aggregated that the resulting data summarize
Ex. The data consist of the ALLElectronics sales per quarter,
for the year 2002 to 2004.
aggregated in data summarize the total sales per year instead of
per quarter, without loss of information necessary of the

analysis task
Data Reduction: Data cube aggregation

Concept hierarchies may exist for each attribute, allowing
the analysis of data at multiple levels of abstraction

Data cube
Lattice of cuboids
Data Reduction: Dimensionality reduction

Feature selection (i.e., attribute subset selection):
Select a minimum set of features such that the probability
distribution of different classes given the values for those

features is as close as possible to the original distribution
given the values of all features
reduce # of patterns in the patterns, easier to understand
Heuristic methods:
step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
decision-tree induction

Step-wise forward selection
Start with an empty of attributes called the reduced set
The best of the original attributes is determined and added to the reduced set
At each subsequent iteration, the best of the remaining original attributes
is added to the reduced set
Initial attribute set:

{A1, A2, A3, A4, A5, A6}
Initial reduced set:
{}
{A1}
{A1, A4}
Reduced attribute set:
{A1, A4, A6}

Step-wise backward elimination
Start with the full set of attributes
At each step, removes the worst attribute remaining in the set

{A1, A2, A3, A4, A5, A6}
Initial reduced set:

{A1, A3, A4, A5, A6}
{A1, A4, A5, A6}
Reduced attribute set:
{A1, A4, A6}

Combining
forward
selection
and
backward
elimination
At each step, selects the best attribute and removes
the worst from among the remaining attributes

Decision-tree induction

{A1, A2, A3, A4, A5, A6}
A4 ?
A6?
A1?
Class 1
>
Class 2
Class 1
Reduced attribute set: {A1, A4, A6}
Class 2
Data Reduction: Data Compression

String compression
There are extensive theories and well-tuned algorithms
Typically lossless
But only limited manipulation is possible without expansion
Audio/video compression
Typically lossy compression, with progressive refinement
Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
Time sequence is not audio

Typically short and vary slowly with time
Data Reduction: Numerosity reduction

Reduce data volume by choosing alternative, smaller forms of data representation
Type of Numerosity reduction:
Parametric methods
Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)

Example: Regression
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling

Histograms
A popular data reduction technique
Divide data into buckets and store average (sum) for each bucket
Can be constructed optimally in one dimension using dynamic programming
Related to quantization problems.
40
35
30
25
20
15
10
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000

Clustering
Partition data set into clusters, and one can store
cluster representation only

Can be very effective if data is clustered but not if
data is smeared
Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
There are many choices of clustering definitions
and clustering algorithms

Sampling
obtaining a small sample s to represent the whole data set N
Simple Random Sample Without Replacement (SRSWOR)
The probability of drawing any tuple in D is 1/N
Simple Random Sample With Replacement (SRSWR)
Cluster /Stratified sampling
Approximate the percentage of each class (or subpopulation of
interest) in the overall database

Used in conjunction with skewed data
Raw Data

Raw Data
Cluster/Stratified Sample

Examples:

Hierarchical Reduction
reduce the data by collecting and replacing low level concepts
(such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior)
Ex. Suppose that the tree contain 10,000 tuples with key ranging form 1 to
6 buckets for the key. Each bucket contains roughly 10,000/6 items.
Therefore, each bucket has pointers to the data keys 986, 3396, 5411, 8392
and 9544, respectively.
The use of multidimensional index trees as a form of data reduction relies on
an ordering of the attribute values in each dimension.
Data Reduction: Discretization

Three types of attributes:
Nominal values from an unordered set
Ordinal values from an ordered set
Continuous real numbers
Discretization:
divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical
attributes.
Reduce data size by discretization
Prepare for further analysis

Typical methods: All the methods can be applied
recursively
Binning
Histogram analysis
Clustering analysis
Entropy-based discretization
Segmentation by natural partitioning

Entropy-based discretization
H(X )
P( x) log
xAX
P( x)
Example: Coin Flip

AX = {heads, tails}
P(heads) = P(tails) =
log2() = * - 1
H(X) = 1
What about a two-headed coin?
Conditional Entropy:
H(X |Y)
P( y ) H ( X | y )
yAY

Given a set of samples S, if S is partitioned into two intervals S1
and S2 using boundary T, the entropy after partitioning is

H (S , T )
| S1|
|S|
H ( S 1)
|S2|
|S|
H ( S 2)
The boundary that minimizes the entropy function over all
possible boundaries is selected as a binary discretization.

The process is recursively applied to partitions obtained until
some stopping criterion is met, e.g.,
H (S ) H (T , S )
Experiments show that it may reduce data size and improve
classification accuracy

A simply 3-4-5 rule can be used to segment numeric data into
relatively uniform, natural intervals
distinct values at the

most significant digit
3, 6, 9
7
2, 4, 8
1, 5, 10
Natural interval
(equi-width)
3
3 (2-3-2)
4
5

Data Reduction: Concept Hierarchy

Specification of a partial ordering of attributes explicitly at
the schema level by users or experts

street<city<state<country
Specification of a portion of a hierarchy by explicit data
grouping
{Urbana, Champaign, Chicago}<Illinois
Specification of a set of attributes.

System automatically generates partial ordering by analysis of the
number of distinct values
E.g., street < city <state < country
Specification of only a partial set of attributes
E.g., only street < city, not others

Automatic Concept Hierarchy Generation
Some concept hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in the

given data set
The attribute with the most distinct values is placed at the lowest
level of the hierarchy
Note: Exceptionweekday, month, quarter, year

Automatic Concept Hierarchy Generation
country
15 distinct values
province or_ state
65 distinct values
city
3567 distinct values
street
674,339 distinct values
HW#3
1.
2.
3.
4.
5.
6.
What is Data preprocessing?

Why Preprocess the Data?
What is Major Tasks in Data Preprocessing?
What is Data cleaning task?
How to Handle Missing Data?
What is Normalization Method?
HW#3
7.
Attribute income are $50,000 (min) and $ 150,000 (max). A

value of $ 100,000 for income would like to map to the new
range in [3,5]. Please calculate the income is transformed ?
8.
Attribute income are $76,000 (mean) and $ 12,500 (std). A

value of $ 95,000 for income would like to map to the new
range. Please calculate the income is transformed ?
9.
Attribute A range -650 to 999 normalized to decimal value

of -650 to decimal scaling therefore, j = 2?
10. What is Task in Data Integration?

11. What is Data reduction strategy?
LAB 3
bank-data-missvalue.csv

03 Data Preprocessing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

03 Data Preprocessing

Uploaded by

Copyright:

Available Formats

336331 (Data Ming)

- (Incomplete data) (Missing

-2, 32, 100,59, 48

-0.02, 0.32, 1.00, 0.59, 0.48

Ignore the tuple

Ignore the tuple

Fill in the missing value manually

Use a global constant to fill in the missing value

Use the attribute mean to fill in the missing value

Use the attribute mean for all samples belonging to the

Use the most propable value to fill in the missing value

Y =b0 +b1 X1 +b2 X2 +...+bmXm

(income) 54,000 (mean)

Normalize input into [-1, 1]

multiple data sources, e.g., A.cust-id B.cust-#

Detecting and resolving data value conflicts

sources are different

Data Integration (Cont.)

Redundant data may be able to be detected by correlation

Careful integration of the data from multiple sources may

help reduce/avoid redundancies and inconsistencies and

Data Integration : Correlation analysis

If rA,B greater than 0, then A and B are positively correlated,

meaning that the value of A increase as the values of B

The standard deviation of A is A

4) (Data Reduction) (cont.)

Data Reduction: Data cube aggregation

for the year 2002 to 2004.

aggregated in data summarize the total sales per year instead of

per quarter, without loss of information necessary of the

Data Reduction: Data cube aggregation

the analysis of data at multiple levels of abstraction

Data Reduction: Dimensionality reduction

distribution of different classes given the values for those

Data Reduction: Dimensionality reduction

At each subsequent iteration, the best of the remaining original attributes

is added to the reduced set

Initial attribute set:

Data Reduction: Dimensionality reduction

Initial attribute set:

Initial reduced set:

Data Reduction: Dimensionality reduction

Data Reduction: Dimensionality reduction

Initial attribute set:

Reduced attribute set: {A1, A4, A6}

Data Reduction: Data Compression

without reconstructing the whole

Time sequence is not audio

Data Reduction: Numerosity reduction

parameters, and discard the data (except possible outliers)

Data Reduction: Numerosity reduction

Data Reduction: Numerosity reduction

cluster representation only

Data Reduction: Numerosity reduction

interest) in the overall database

Data Reduction: Numerosity reduction

Data Reduction: Numerosity reduction

Data Reduction: Numerosity reduction

Data Reduction: Numerosity reduction

Data Reduction: Discretization

Data Reduction: Discretization

Data Reduction: Discretization

Example: Coin Flip

Data Reduction: Discretization