Professional Documents
Culture Documents
2
Data Preprocessing
n Why preprocess the data?
n Data cleaning
n Data integration and transformation
n Data reduction
n Discretization
n Summary
3
Why Data Preprocessing?
4
Why Is Data Preprocessing Important?
5
Multi-Dimensional Measure of Data
Quality
n A well-accepted multi-dimensional view:
q Accuracy
q Completeness
q Consistency
q Timeliness
q Believability
q Value added
q Interpretability
q Accessibility
6
Major Tasks in Data Preprocessing
n Data cleaning
q Fill in missing values, smooth noisy data, identify or remove
outliers and noisy data, and resolve inconsistencies
n Data integration
q Integration of multiple databases, or files
n Data transformation
q Normalization and aggregation
n Data reduction
q Obtains reduced representation in volume but produces the
same or similar analytical results
n Data discretization (for numerical data)
7
Data Preprocessing
n Why preprocess the data?
n Data cleaning
n Data integration and transformation
n Data reduction
n Discretization
n Summary
8
Data Cleaning
n Importance
q Data cleaning is the number one problem in data
warehousing
9
Missing Data
n Data is not always available
q E.g., many tuples have no recorded values for several
attributes, such as customer income in sales data
10
How to Handle Missing Data?
11
Noisy Data
n Noise: random error or variance in a measured
variable.
n Incorrect attribute values may due to
q faulty data collection instruments
q data entry problems
q data transmission problems
q etc
n Other data problems which requires data cleaning
q duplicate records, incomplete data, inconsistent data
12
How to Handle Noisy Data?
n Binning method:
q first sort data and partition into (equi-size) bins
q then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
n Clustering
q detect and remove outliers
n Combined computer and human inspection
q detect suspicious values and check by human (e.g., deal with
possible outliers)
13
Binning Methods for Data Smoothing
n Sorted data for price (in dollars):
q Eg. 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
n Partition into (equi-size) bins:
q Bin 1: 4, 8, 9, 15
q Bin 2: 21, 21, 24, 25
q Bin 3: 26, 28, 29, 34
n Smoothing by bin means:
q Bin 1: 9, 9, 9, 9
q Bin 2: 23, 23, 23, 23
q Bin 3: 29, 29, 29, 29
n Smoothing by bin boundaries:
q Bin 1: 4, 4, 4, 15
q Bin 2: 21, 21, 21, 25
q Bin 3: 26, 26, 26, 34
14
Outlier Removal
n Data points inconsistent with the majority of data
n Different outliers
q Valid: CEO s salary,
q Noisy: One s age = 200, widely deviated points
n Removal methods
q Clustering
q Curve-fitting
q Hypothesis-testing with a given model
15
Data Preprocessing
n Why preprocess the data?
n Data cleaning
n Data integration and transformation
n Data reduction
n Discretization
n Summary
16
Data Integration
n Data integration:
q combines data from multiple sources
n Schema integration
q integrate metadata from different sources
q Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id ≡ B.cust-#
n Detecting and resolving data value conflicts
q for the same real world entity, attribute values from different
sources are different, e.g., different scales, metric vs. British
units
n Removing duplicates and redundant data
17
Data Transformation
n Smoothing: remove noise from data
n Normalization: scaled to fall within a small, specified
range
n Attribute/feature construction
q New attributes constructed from the given ones
n Aggregation: summarization
q Integrate data from different sources (tables)
n Generalization: concept hierarchy climbing
18
Data Transformation: Normalization
v'
19
Data Preprocessing
20
Data Reduction Strategies
n Data is too big to work with
q Too many instances
q too many features (attributes) – curse of dimensionality
n Data reduction
q Obtain a reduced representation of the data set that is much smaller
in volume but yet produce the same (or almost the same) analytical
results (easily said but difficult to do)
n Data reduction strategies
q Dimensionality reduction — remove unimportant attributes
q Aggregation and clustering –
n Remove redundant or close associated ones
q Sampling
21
Dimensionality Reduction
n Feature selection (i.e., attribute subset selection):
q Direct methods –
n Select a minimum set of attributes (features) that is sufficient for the
data mining task.
q Indirect methods -
n Principal component analysis (PCA)
n Singular value decomposition (SVD)
n Independent component analysis (ICA)
n Various spectral and/or manifold embedding (active topics)
n Heuristic methods (due to exponential # of choices):
q step-wise forward selection
q step-wise backward elimination
q combining forward selection and backward elimination
n Combinatorial search – exponential computation cost
q etc
22
Histograms
40
24
Sampling
25
Sampling
26
Data Preprocessing
n Why preprocess the data?
n Data cleaning
n Data integration and transformation
n Data reduction
n Discretization
n Summary
27
Discretization
n Three types of attributes:
q Nominal — values from an unordered set
q Ordinal — values from an ordered set
q Continuous — real numbers
n Discretization:
q divide the range of a continuous attribute into intervals
because some data mining algorithms only accept
categorical attributes.
n Some techniques:
q Binning methods – equal-width, equal-frequency
q Entropy based
q etc
28
Discretization and Concept Hierarchy
n Discretization
q reduce the number of values for a given continuous attribute
by dividing the range of the attribute into intervals. Interval
labels can then be used to replace actual data values
n Concept hierarchies
q reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or
senior)
29
Binning
n Attribute values (for one attribute e.g., age):
q 0, 4, 12, 16, 16, 18, 24, 26, 28
n Equi-width binning – for bin width of e.g., 10:
q Bin 1: 0, 4 [-,10) bin
q Bin 2: 12, 16, 16, 18 [10,20) bin
q Bin 3: 24, 26, 28 [20,+) bin
q – denote negative infinity, + positive infinity
n Equi-frequency binning – for bin density of e.g., 3:
q Bin 1: 0, 4, 12 [-, 14) bin
q Bin 2: 16, 16, 18 [14, 21) bin
q Bin 3: 24, 26, 28 [21,+] bin
30
Entropy-based (1)
n Given attribute-value/class pairs:
q (0,P), (4,P), (12,P), (16,N), (16,N), (18,P), (24,N), (26,N), (28,N)
n Entropy-based binning via binarization:
q Intuitively, find best split so that the bins are as pure as possible
q Formally characterized by maximal information gain.
n Let S denote the above 9 pairs, p=4/9 be fraction of P
pairs, and n=5/9 be fraction of N pairs.
n Entropy(S) = - p log p - n log n.
q Smaller entropy – set is relatively pure; smallest is 0.
q Large entropy – set is mixed. Largest is 1.
31
Entropy-based (2)
n Let v be a possible split. Then S is divided into two sets:
q S1: value <= v and S2: value > v
n The best split is found after examining all possible splits.
32
Summary
n Data preparation is a big issue for data mining
n Data preparation includes
q Data cleaning and data integration
q Data reduction and feature selection
q Discretization
n Many methods have been proposed but still an
active area of research
33