Concepts (PPT) - Data Preprocessing

BIS2216/DBS1214
Data Mining & knowledge Discovery
Data Preprocessing
Data Mining & Methodology
• Data mining is a process that uses a variety of data analysis tools to

discover patterns and relationships in data that may be used to
make valid predictions.
• A generic data mining process methodology:
2
Data Preprocessing
• The data preprocessing phase requires data understanding for
preparation tasks.
• It involves transforming raw data
into a clean and consistent format
suitable for analysis.
• It is a crucial phase ensuring data
quality that impacts the accuracy
and effectiveness of subsequent
data mining tasks.
3
Data Preprocessing
• Data pre-processing requires data understanding for data preparation

tasks.
• Data preparation includes:
• Data cleaning (e.g., noisy data, inconsistent data formats)
• Data integration (e.g., combine multiple data sources)
• Data transformation (e.g., convert data into suitable formats)
• Data reduction (e.g., selecting relevant attributes)
• Derived/dummy data creation
4
Why Do We Need To Pre-process the Data?
• Much of the raw data contained in databases is unprocessed,

incomplete, and noisy.
• For example, the databases may contain:
• Attributes that are obsolete or redundant or no longer relevant or expired
• Missing values
• Outliers
• Data in a form not suitable for data mining models
• Data that values not consistent with policy or common sense.
5
Before Data Preparation
It is essential to understand the following before start data

preparation:
• Types of data
• Noisy data
• Data sampling
• Data statistics
• Modelling techniques to be used
6
Missing Data Treatment
Methods to handle the missing values:
• Deletion
• If an attribute contains a lot of missing values, consider to remove the attribute
• If only a few examples contain missing values, consider to remove those cases/rows
• Imputation
• In a categorical attribute with missing values we can introduce a new category, e.g.
“unknown”.
• Mean/ Mode/ Median Imputation
• Prediction model
• Sophisticated method for handling missing data. Here, we create a predictive model
to estimate values that will substitute the missing data.
10
Outliers Treatment
• Deleting observations:
• We delete outlier values if it is due to data entry error, data
processing error or outlier observations are very small in numbers.
We can also use trimming at both ends to remove outliers.
• Transforming variables can also eliminate outliers.
• Natural log of a value reduces the variation caused by extreme
values.
• Binning is also a form of variable transformation. Decision Tree
algorithm allows to deal with outliers well due to binning of
attribute’s values.
14
Data Transformation
• Data transformation, consists of several approaches, has already
demonstrated significant improvements in modelling performance.
• Common approaches:
• Data Generalisation
• Aggregation, Binning (Discretization/Binarization)
• Data Normalisation
• Range Transformation
• Z-Transformation
• Log Transformations
• Square Root
• Square
15
Data Transformation - Aggregation
• Generalization through attribute level

• Combining two or more attributes into a single attribute
• Purpose
• Data reduction
• Reduce the number of attributes or objects
• Change of scale
• Cities aggregated into regions, states, countries, etc.
• More “stable” data
• Aggregated data tends to have less variability (e.g. Age versus birthdate
with date-month-year)
Data Transformation - Binning (Grouping)
• Generalization through value level

• Some algorithms need data be in the form of categorical form or binary form,
so it is necessary to transform a continuous attribute into a categorical
attribute:
• Discretization
• Transform a continuous attribute into a categorical attribute
• Binarization
• Both continuous & discrete attributes to be transformed into binary
attributes
Data Transformation
Which point has the larger distance from point A?
18
Range & Z Transformation
𝒙𝒊 − 𝐦𝐢𝐧 𝒙 𝒙𝒊 − 𝐦𝐞𝐚𝐧 𝒙
𝒙′𝒊 = 𝒙′𝒊 =
𝒎𝒂𝒙 𝒙 − 𝐦𝐢𝐧 𝒙 𝒔𝒕𝒅𝒆𝒗 𝒙
Figure: Range Transformation and Z-Transformation

20
Input Reduction – Redundancy and Irrelevancy
Redundancy Irrelevancy
x2 x4
0.70
Input x2 has the

0.60
x2 x4 same information as
input x1. 0.50
0.40
x
x1 x
x3
1 3
Input x2 has the same Input x3 has the information

information as input x1. that is irrelevant to input x4.
Selection of Attributes (Variable Selection)
• Data sets for analysis may contain hundreds of attributes, many of which may
be irrelevant to the mining task or redundant
• Although it may be possible for a domain expert to pick out some of the useful
attributes, this can be a difficult and time-consuming task
• Leaving out relevant attributes or keeping irrelevant attributes may be
detrimental, causing confusion/bias for the mining algorithm employed.
• Volume of irrelevant or redundant attributes can slow down the mining
process.
• Variable selection reduce the data set size by removing irrelevant or
redundant attributes (or dimensions).
24
Attribute Creation
• A process to generate a new attributes based on existing attribute(s).
• For example, date (dd-mm-yy) as an input variable in a data set. We can
generate new variables like day, month, year, week, weekday that may
have better relationship with target variable. This step is used to
highlight the hidden relationship in a variable:
26
Attribute Creation Methods
• Creating derived attributes:
• This refers to creating new attributes from existing attribute(s) using set of
functions or different methods.
• Methods such as taking log of attribute values, binning attributes and other
transformation methods can also be used to create new attributes.
• Creating dummy attributes:
• Most common application of dummy attribute is to convert categorical variable
into numerical variables.
• Dummy attributes are also called Indicator Variables.
• It is useful to take categorical variable as a predictor in statistical models.
Categorical variable can take values 0 and 1.
27
Data Creation and Transformation
Existing Data Type New Data Type Methods Example
Nominal (Categorical) Numerical Dummy attribute creation In case of existing variable is a non-multi-value attribute, replacing the existing value with a number (NOTE: this
might create misleading meaning to the modelling).
In case of existing variable is a multi-value attribute, dummy variable creation is required.
E.g.{"Green", "Red", "Yellow"} to dummy variables:
v_green: if Green is true, then 0 else 1. v_red: if Red is true, then 0 else 1. v_yellow: if Yellow is true, then 0 else
1.
Ordinal (Categorical) Numerical Derived attribute creation {"Poor", "Average", "Good"} to derived variable values {1,2,3} based on their rank
Numerical Numerical Binning/Aggregation/ Performance marks {0-100} to CGPA points {0-4}; transform yearly salary using log
Normalization
Numerical Nominal (Ordinal) Binning/ Aggregation Age numbers grouped into derived variable value with age ranges e.g. "18-25", "26-30"
Performance score {1, 2, 3, 4, 5} discretized into three groups to {"Poor", "Average", "Good"}
Numerical Nominal Derived attribute attribute Acceptance choice {0, 1} to {"Yes", "No"}
(Categorical)
Nominal (Categorical) Ordinal NOTE: This transformation is rarely happened because it does not bring meaningful or useful derived values.
(Categorical)
Ordinal (Categorial) Ordinal / Nominal Binning Workload level {"L1", "L2", "L3", "L4", "L5"} discretized into three groups to {"Light", "Average", "Heavy"}
(Categorical)
Nominal Nominal Binning/Aggregation {"Light Blue", "Blue", "Dark Blue", "Light Red", "Red", "Dark Red" } to derived variable value {"Blue", "Red"}
(Categorical) (Categorical) 29
Summary of Data Preparation Methods
• Missing Values treatment (treatment to avoid data exclusion or bias)
1. Deletion
2. Imputation
3. Prediction Model
• Outliers (treatment to avoid scale problem)
1. Deletion
2. Transformation (Generalization/Normalization)
• Selection of attributes (another way to reduce dimensionality of data to minimize bias)
1. Delete irrelevant/duplicate data
2. Select useful attributes for modelling
• Attribute/Data Creation (new attributes that can capture the important information in a data set
much more efficiently than the original attributes)
1. Derived attributes
2. Dummy attributes
30

Concepts (PPT) - Data Preprocessing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Concepts (PPT) - Data Preprocessing

Uploaded by

Copyright:

Available Formats

BIS2216/DBS1214

Data Mining & knowledge Discovery

• Data mining is a process that uses a variety of data analysis tools to

• Data pre-processing requires data understanding for data preparation

• Much of the raw data contained in databases is unprocessed,

It is essential to understand the following before start data

• Generalization through attribute level

• Generalization through value level

Which point has the larger distance from point A?

Figure: Range Transformation and Z-Transformation

Input x2 has the

Input x2 has the same Input x3 has the information

You might also like