Data Pre-Processing: Overview & Data Cleaning: Data Warehouse and Mining

DATA WAREHOUSE AND MINING
Data Pre-Processing:
Overview & Data Cleaning
Submitted To: Submitted By:

Dr. Megha Huma Naz
Assistant Professor Roll No-1810981004
M.E Fellowship(C.S.E)
Table Of Content
• Data Pre-processing: An Overview
 Data Quality
 Why Is Data Preprocessing Important?
 Major Task Of Data Pre-processing
• Data Cleaning
 Importance
 Data Cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
What is Data Preprocessing
• Data pre-processing is a data mining technique that involves transforming
raw data into an understandable format. Real-world data is often incomplete,
inconsistent, and/or lacking in certain behaviours or trends, and is likely to
contain many errors.
• Data pre-processing is a proven method of resolving such issues. Data pre-
processing prepares raw data for further processing.
• To make data more suitable for data mining.
• To improve the data mining analysis with respect to time, cost and quality.
Data Quality
• Measures for data quality: A multidimensional view
▫ Accuracy: correct or wrong, accurate or not
▫ Completeness: not recorded, unavailable, …
▫ Consistency: some modified but some not, dangling, …
▫ Timeliness: timely update?
▫ Believability: how trustable the data are correct?
▫ Interpretability: how easily the data can be understood?
Why Data Preprocessing?
• Data in the real world is dirty
▫ incomplete: missing attribute values, lack of certain
attributes of interest, or containing only aggregate data
 e.g., occupation=“”
▫ noisy: containing errors or outliers
 e.g., Salary=“-10”
▫ inconsistent: containing discrepancies in codes or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
Why Is Data Preprocessing Important?
• No quality data, no quality mining results!
 Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or
even misleading statistics.
• Data preparation, cleaning, and transformation comprises the
majority of the work in a data mining application (90%).
Major Tasks in Data Preprocessing
• Data cleaning
▫ Fill in missing values, smooth noisy data, identify or remove outliers
and noisy data, and resolve inconsistencies
• Data integration
▫ Integration of multiple databases, or files
• Data transformation
▫ Normalization and aggregation
• Data reduction
▫ Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization (for numerical data)
Table Of Content
• Data Pre-processing: An Overview
 Data Quality
 Why Is Data Preprocessing Important?
 Major Task Of Data Pre-processing
• Data Cleaning
 Importance
 Data Cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
Data Cleaning
• Importance
▫ “Data cleaning is the number one problem in data
warehousing”
▫ Real-world data tend to be incomplete, noisy, and inconsistent.
▫ Data cleaning (or data cleansing) routines attempt to fill in
missing values, smooth out noise while identifying outliers, and
correct inconsistencies in the data.
• Data cleaning tasks – this routine attempts to
▫ Fill in missing values
▫ Identify outliers and smooth out noisy data
▫ Correct inconsistent data
▫ Resolve redundancy caused by data integration
Missing Data
• Data is not always available
▫ E.g., many tuples have no recorded values for several
attributes, such as customer income in sales data
• Missing data may be due to
▫ equipment malfunction
▫ inconsistent with other recorded data and thus deleted
▫ data not entered due to misunderstanding
▫ certain data may not be considered important at the time of
entry
How to Handle Missing Data?
1. Ignore the tuple
▫ Class label is missing (classification)
▫ Not effective method unless several attributes missing
values
2. Fill in missing values manually: tedious (time consuming) +
infeasible (large db)?
3. Fill in it automatically with
▫ a global constant : e.g., “unknown”, a new class?!
(misunderstanding)
Continue..
4. the attribute mean
▫ Average income of All Electronics customer $28,000 (use
this value to replace)
5. The attribute mean for all samples belonging to the same
class as the given tuple
6. the most probable value
▫ determined with regression, inference-based such as
Bayesian formula, decision tree. (most popular)
Noisy Data
• Noise: random error or variance in a measured variable.
• Incorrect attribute values may due to
▫ faulty data collection instruments
▫ data entry problems
▫ data transmission problems etc
• Other data problems which requires data cleaning
▫ duplicate records, incomplete data, inconsistent data
How to Handle Noisy Data?
• Binning method:
▫ first sort data and partition into (equi-depth) bins
▫ then one can smooth by bin means, smooth by bin median, smooth
by bin boundaries, etc.
• Clustering
▫ Similar values are organized into groups (clusters).
▫ Values that fall outside of clusters considered outliers.
• Combined computer and human inspection
▫ detect suspicious values and check by human (e.g., deal with possible
outliers)
• Regression
▫ Data can be smoothed by fitting the data to a function such as with
regression. (linear regression/multiple linear regression)
Binning Methods for Data Smoothing
Outlier Removal
• Data points inconsistent with the majority of data
• Different outliers
▫ Valid: CEO’s salary,
▫ Noisy: One’s age = 200, widely deviated points
• Removal methods
▫ Clustering
▫ Curve-fitting
▫ Hypothesis-testing with a given model
Summary..
• Data preparation is a big issue for data mining
• Data preparation includes
▫ Data cleaning and data integration
▫ Data reduction and feature selection
▫ Discretization
• Many methods have been proposed but still an active area of
research

Data Pre-Processing: Overview & Data Cleaning: Data Warehouse and Mining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Pre-Processing: Overview & Data Cleaning: Data Warehouse and Mining

Uploaded by

Copyright:

Available Formats

DATA WAREHOUSE AND MINING

Submitted To: Submitted By:

You might also like