You are on page 1of 20

DATA WAREHOUSE AND MINING

Data Pre-Processing:
Overview & Data Cleaning

Submitted To: Submitted By:


Dr. Megha Huma Naz
Assistant Professor Roll No-1810981004
M.E Fellowship(C.S.E)
Table Of Content
• Data Pre-processing: An Overview
 Data Quality
 Why Is Data Preprocessing Important?
 Major Task Of Data Pre-processing
• Data Cleaning
 Importance
 Data Cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
What is Data Preprocessing
• Data pre-processing is a data mining technique that involves transforming
raw data into an understandable format. Real-world data is often incomplete,
inconsistent, and/or lacking in certain behaviours or trends, and is likely to
contain many errors.
• Data pre-processing is a proven method of resolving such issues. Data pre-
processing prepares raw data for further processing.
• To make data more suitable for data mining.
• To improve the data mining analysis with respect to time, cost and quality.
Data Quality
• Measures for data quality: A multidimensional view
▫ Accuracy: correct or wrong, accurate or not
▫ Completeness: not recorded, unavailable, …
▫ Consistency: some modified but some not, dangling, …
▫ Timeliness: timely update?
▫ Believability: how trustable the data are correct?
▫ Interpretability: how easily the data can be understood?
Why Data Preprocessing?
• Data in the real world is dirty
▫ incomplete: missing attribute values, lack of certain
attributes of interest, or containing only aggregate data
 e.g., occupation=“”
▫ noisy: containing errors or outliers
 e.g., Salary=“-10”
▫ inconsistent: containing discrepancies in codes or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
Why Is Data Preprocessing Important?
• No quality data, no quality mining results!
 Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or
even misleading statistics.
• Data preparation, cleaning, and transformation comprises the
majority of the work in a data mining application (90%).
Major Tasks in Data Preprocessing
• Data cleaning
▫ Fill in missing values, smooth noisy data, identify or remove outliers
and noisy data, and resolve inconsistencies
• Data integration
▫ Integration of multiple databases, or files
• Data transformation
▫ Normalization and aggregation
• Data reduction
▫ Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization (for numerical data)
Table Of Content
• Data Pre-processing: An Overview
 Data Quality
 Why Is Data Preprocessing Important?
 Major Task Of Data Pre-processing
• Data Cleaning
 Importance
 Data Cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
Data Cleaning
• Importance
▫ “Data cleaning is the number one problem in data
warehousing”
▫ Real-world data tend to be incomplete, noisy, and inconsistent.
▫ Data cleaning (or data cleansing) routines attempt to fill in
missing values, smooth out noise while identifying outliers, and
correct inconsistencies in the data.
• Data cleaning tasks – this routine attempts to
▫ Fill in missing values
▫ Identify outliers and smooth out noisy data
▫ Correct inconsistent data
▫ Resolve redundancy caused by data integration
Missing Data
• Data is not always available
▫ E.g., many tuples have no recorded values for several
attributes, such as customer income in sales data
• Missing data may be due to
▫ equipment malfunction
▫ inconsistent with other recorded data and thus deleted
▫ data not entered due to misunderstanding
▫ certain data may not be considered important at the time of
entry
How to Handle Missing Data?
1. Ignore the tuple
▫ Class label is missing (classification)
▫ Not effective method unless several attributes missing
values
2. Fill in missing values manually: tedious (time consuming) +
infeasible (large db)?
3. Fill in it automatically with
▫ a global constant : e.g., “unknown”, a new class?!
(misunderstanding)
Continue..
4. the attribute mean
▫ Average income of All Electronics customer $28,000 (use
this value to replace)
5. The attribute mean for all samples belonging to the same
class as the given tuple
6. the most probable value
▫ determined with regression, inference-based such as
Bayesian formula, decision tree. (most popular)
Noisy Data
• Noise: random error or variance in a measured variable.
• Incorrect attribute values may due to
▫ faulty data collection instruments
▫ data entry problems
▫ data transmission problems etc
• Other data problems which requires data cleaning
▫ duplicate records, incomplete data, inconsistent data
How to Handle Noisy Data?
• Binning method:
▫ first sort data and partition into (equi-depth) bins
▫ then one can smooth by bin means, smooth by bin median, smooth
by bin boundaries, etc.
• Clustering
▫ Similar values are organized into groups (clusters).
▫ Values that fall outside of clusters considered outliers.
• Combined computer and human inspection
▫ detect suspicious values and check by human (e.g., deal with possible
outliers)
• Regression
▫ Data can be smoothed by fitting the data to a function such as with
regression. (linear regression/multiple linear regression)
Binning Methods for Data Smoothing
Outlier Removal
• Data points inconsistent with the majority of data
• Different outliers
▫ Valid: CEO’s salary,
▫ Noisy: One’s age = 200, widely deviated points
• Removal methods
▫ Clustering
▫ Curve-fitting
▫ Hypothesis-testing with a given model
Summary..
• Data preparation is a big issue for data mining
• Data preparation includes
▫ Data cleaning and data integration
▫ Data reduction and feature selection
▫ Discretization
• Many methods have been proposed but still an active area of
research

You might also like