Professional Documents
Culture Documents
prepared By
Avisek Dey
09Bs0000498
December 8, 2021 1
Chapter 3: Data Preprocessing
Preprocess Steps
Data cleaning
Data integration and transformation
Data reduction
December 8, 2021 2
Why Data Preprocessing?
names
No quality data, no quality mining results!
Quality decisions must be based on quality data
quality data
December 8, 2021 3
Multi-Dimensional Measure of Data Quality
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
December 8, 2021 4
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the
same or similar analytical results
December 8, 2021 5
Forms of data preprocessing
December 8, 2021 6
Data Cleaning
December 8, 2021 7
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class to
fill in the missing value: smarter
Use the most probable value to fill in the missing value: inference-
based such as Bayesian formula or decision tree
December 8, 2021 8
Noisy Data
technology limitation
incomplete data
inconsistent data
December 8, 2021 9
Cluster Analysis
Clustering: detect and remove outliers
December 8, 2021 10
Regression
y
Regression: Y1
smooth by fitting
the data into
regression functions Y1’ y=x+1
X1 x
December 8, 2021 11
Data Integration
Data integration:
combines data from multiple sources.
Schema integration
December 8, 2021 12
Handling Redundant Data in Data Integration
December 8, 2021 14
Data Transformation: Normalization
min-max normalization
Min-max normalization performs a linear
transformation on the original data.
December 8, 2021 15
Data Transformation: Normalization
Z-score Normalization:
In z-score normalization, attribute A are normalized
based on the mean and standard deviation of A. a
value v of A is normalized to v’ by computing:
v’ = ( ( v – ) / A )
December 8, 2021 16
Data Transformation: Normalization
December 8, 2021 17
Data Reduction Strategies
Warehouse may store terabytes of data: Complex data
analysis/mining may take a very long time to run on the
complete data set
Data reduction
Obtains a reduced representation of the data set that is
Dimensionality reduction
December 8, 2021 18
Data Compression
String compression
Typically lossless
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
December 8, 2021 19
Thank You
December 8, 2021 20