Professional Documents
Culture Documents
3:
(Data Preprocessing)
()
Cust_ID
Name
Income
Age
Birthday
001
n/a
200
12/10/79
002
$2000
25
27 Dec 81
003
-10000
27
18 Feb 20
1) Data Cleaning
2) Data Integration
3) Data Transformation
4) Data Reduction
Data cleaning
Data integration
Data transformation
attribute
A1
attribute
A3 A126
A1
transaction
T1
transaction
Data reduction
A2
T2
T2000
A2
A115
1) Data Cleaning ()
(Missing
Value) smooth
(Missing value)
???
(Missing value)
1.
2.
3.
4.
5.
6.
(Classification)
unknown
12000
(Regression)
(Bayesian formula) (Decision tree)
(Noisy data)
-
-
- (Data Transmission)
-
(Noisy data)
Binning Methods
Regression
Clustering
Binning Methods
binning
(Partition)
bin bin
(Local Smoothing)
(Neighborhood) bin bucket bin
(Bin Means) bin (Bin Medians) bin (Bin
Boundaries)
Binning Methods
Binning Method
Example:
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
(N = 3)
Partition into (equi-depth) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Regression
Smooth by fitting the data into regression functions
Linear Regression
Y = +X
Multiple Linear Regression
Regression
(Least-square error)
Y1
y = x+1
y = x+1
Y1
X1
Clustering
Cluster
Outlier
2) Data Transformation ( )
(Normalization)
min-max normalization
z-score normalization
normalization by decimal scaling
Sigmoidal
Min-Max Nornalization
[new_minA, new_maxA]
v minA
v'
(new _ maxA new _ minA ) new _ min A
maxA min A
(income) 12,000 (min)
98,000 (max) 73,600
[0,1] 73,600
73,600 12,000
(1.0 0) 0 0.716
98,000 12,000
Z-Score
0 1
v mean A
v'
stand_dev A
Decimal scaling
v
v' j
10
A -986 917
|-986| = 986 1000
j=3 -986 -0.986
986
0.986
3
10
Sigmoidal
1 e
y'
1 e
1
x
-1
y mean
std
3) (Data Integration)
1. (Data Redundancies)
(Data Inconsistencies)
2.
(Schema Integration) Metadata
entities Cusid
A CustNumber B
Data Warehousing
Data Integration
Data integration:
combines data from multiple sources into a coherent store
Schema integration
integrate metadata from different sources
Entity identification problem: identify real world entities from
databases
The same attribute may have different names in different
databases
One attribute may be a derived attribute in another
table, e.g., annual revenue
analysis
by
rA, B
( A A )( B B )
(n 1) A B
The mean of A is
A
n
( A A)
n 1
4) (Data Reduction)
Lattice of cuboids
Heuristic methods:
step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
decision-tree induction
forward
selection
and
backward
elimination
At each step, selects the best attribute and removes
the worst from among the remaining attributes
A1?
Class 1
>
Class 2
Class 1
Class 2
Assume the data fits some model, estimate model parameters, store only the
35
30
25
20
15
10
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
Raw Data
Cluster/Stratified Sample
Ex. Suppose that the tree contain 10,000 tuples with key ranging form 1 to
6 buckets for the key. Each bucket contains roughly 10,000/6 items.
Therefore, each bucket has pointers to the data keys 986, 3396, 5411, 8392
and 9544, respectively.
The use of multidimensional index trees as a form of data reduction relies on
an ordering of the attribute values in each dimension.
recursively
Binning
Histogram analysis
Clustering analysis
Entropy-based discretization
Segmentation by natural partitioning
H(X )
P( x) log
xAX
P( x)
H(X |Y)
P( y ) H ( X | y )
yAY
| S1|
|S|
H ( S 1)
|S2|
|S|
H ( S 2)
H (S ) H (T , S )
Experiments show that it may reduce data size and improve
classification accuracy
Natural interval
(equi-width)
3
3 (2-3-2)
4
5
grouping
country
15 distinct values
65 distinct values
city
street
HW#3
1.
2.
3.
4.
5.
6.
HW#3
7.
8.
9.
LAB 3
bank-data-missvalue.csv