You are on page 1of 19

BIS2216/DBS1214

Data Mining & knowledge Discovery

Data Preprocessing
Data Mining & Methodology

• Data mining is a process that uses a variety of data analysis tools to


discover patterns and relationships in data that may be used to
make valid predictions.
• A generic data mining process methodology:

2
Data Preprocessing
• The data preprocessing phase requires data understanding for
preparation tasks.
• It involves transforming raw data
into a clean and consistent format
suitable for analysis.
• It is a crucial phase ensuring data
quality that impacts the accuracy
and effectiveness of subsequent
data mining tasks.

3
Data Preprocessing

• Data pre-processing requires data understanding for data preparation


tasks.
• Data preparation includes:
• Data cleaning (e.g., noisy data, inconsistent data formats)
• Data integration (e.g., combine multiple data sources)
• Data transformation (e.g., convert data into suitable formats)
• Data reduction (e.g., selecting relevant attributes)
• Derived/dummy data creation

4
Why Do We Need To Pre-process the Data?

• Much of the raw data contained in databases is unprocessed,


incomplete, and noisy.
• For example, the databases may contain:
• Attributes that are obsolete or redundant or no longer relevant or expired
• Missing values
• Outliers
• Data in a form not suitable for data mining models
• Data that values not consistent with policy or common sense.

5
Before Data Preparation

It is essential to understand the following before start data


preparation:
• Types of data
• Noisy data
• Data sampling
• Data statistics
• Modelling techniques to be used

6
Missing Data Treatment
Methods to handle the missing values:
• Deletion
• If an attribute contains a lot of missing values, consider to remove the attribute
• If only a few examples contain missing values, consider to remove those cases/rows
• Imputation
• In a categorical attribute with missing values we can introduce a new category, e.g.
“unknown”.
• Mean/ Mode/ Median Imputation
• Prediction model
• Sophisticated method for handling missing data. Here, we create a predictive model
to estimate values that will substitute the missing data.

10
Outliers Treatment

• Deleting observations:
• We delete outlier values if it is due to data entry error, data
processing error or outlier observations are very small in numbers.
We can also use trimming at both ends to remove outliers.
• Transforming variables can also eliminate outliers.
• Natural log of a value reduces the variation caused by extreme
values.
• Binning is also a form of variable transformation. Decision Tree
algorithm allows to deal with outliers well due to binning of
attribute’s values.

14
Data Transformation
• Data transformation, consists of several approaches, has already
demonstrated significant improvements in modelling performance.
• Common approaches:
• Data Generalisation
• Aggregation, Binning (Discretization/Binarization)
• Data Normalisation
• Range Transformation
• Z-Transformation
• Log Transformations
• Square Root
• Square
15
Data Transformation - Aggregation

• Generalization through attribute level


• Combining two or more attributes into a single attribute
• Purpose
• Data reduction
• Reduce the number of attributes or objects
• Change of scale
• Cities aggregated into regions, states, countries, etc.
• More “stable” data
• Aggregated data tends to have less variability (e.g. Age versus birthdate
with date-month-year)
Data Transformation - Binning (Grouping)

• Generalization through value level


• Some algorithms need data be in the form of categorical form or binary form,
so it is necessary to transform a continuous attribute into a categorical
attribute:
• Discretization
• Transform a continuous attribute into a categorical attribute
• Binarization
• Both continuous & discrete attributes to be transformed into binary
attributes
Data Transformation

Which point has the larger distance from point A?

18
Range & Z Transformation

𝒙𝒊 − 𝐦𝐢𝐧 𝒙 𝒙𝒊 − 𝐦𝐞𝐚𝐧 𝒙
𝒙′𝒊 = 𝒙′𝒊 =
𝒎𝒂𝒙 𝒙 − 𝐦𝐢𝐧 𝒙 𝒔𝒕𝒅𝒆𝒗 𝒙

Figure: Range Transformation and Z-Transformation


20
Input Reduction – Redundancy and Irrelevancy

Redundancy Irrelevancy

x2 x4
0.70

Input x2 has the


0.60
x2 x4 same information as
input x1. 0.50

0.40

x
x1 x
x3
1 3

Input x2 has the same Input x3 has the information


information as input x1. that is irrelevant to input x4.
Selection of Attributes (Variable Selection)
• Data sets for analysis may contain hundreds of attributes, many of which may
be irrelevant to the mining task or redundant
• Although it may be possible for a domain expert to pick out some of the useful
attributes, this can be a difficult and time-consuming task
• Leaving out relevant attributes or keeping irrelevant attributes may be
detrimental, causing confusion/bias for the mining algorithm employed.
• Volume of irrelevant or redundant attributes can slow down the mining
process.
• Variable selection reduce the data set size by removing irrelevant or
redundant attributes (or dimensions).

24
Attribute Creation
• A process to generate a new attributes based on existing attribute(s).
• For example, date (dd-mm-yy) as an input variable in a data set. We can
generate new variables like day, month, year, week, weekday that may
have better relationship with target variable. This step is used to
highlight the hidden relationship in a variable:

26
Attribute Creation Methods
• Creating derived attributes:
• This refers to creating new attributes from existing attribute(s) using set of
functions or different methods.
• Methods such as taking log of attribute values, binning attributes and other
transformation methods can also be used to create new attributes.
• Creating dummy attributes:
• Most common application of dummy attribute is to convert categorical variable
into numerical variables.
• Dummy attributes are also called Indicator Variables.
• It is useful to take categorical variable as a predictor in statistical models.
Categorical variable can take values 0 and 1.

27
Data Creation and Transformation
Existing Data Type New Data Type Methods Example

Nominal (Categorical) Numerical Dummy attribute creation In case of existing variable is a non-multi-value attribute, replacing the existing value with a number (NOTE: this
might create misleading meaning to the modelling).
In case of existing variable is a multi-value attribute, dummy variable creation is required.
E.g.{"Green", "Red", "Yellow"} to dummy variables:
v_green: if Green is true, then 0 else 1. v_red: if Red is true, then 0 else 1. v_yellow: if Yellow is true, then 0 else
1.
Ordinal (Categorical) Numerical Derived attribute creation {"Poor", "Average", "Good"} to derived variable values {1,2,3} based on their rank

Numerical Numerical Binning/Aggregation/ Performance marks {0-100} to CGPA points {0-4}; transform yearly salary using log
Normalization

Numerical Nominal (Ordinal) Binning/ Aggregation Age numbers grouped into derived variable value with age ranges e.g. "18-25", "26-30"
Performance score {1, 2, 3, 4, 5} discretized into three groups to {"Poor", "Average", "Good"}

Numerical Nominal Derived attribute attribute Acceptance choice {0, 1} to {"Yes", "No"}
(Categorical)

Nominal (Categorical) Ordinal NOTE: This transformation is rarely happened because it does not bring meaningful or useful derived values.
(Categorical)

Ordinal (Categorial) Ordinal / Nominal Binning Workload level {"L1", "L2", "L3", "L4", "L5"} discretized into three groups to {"Light", "Average", "Heavy"}
(Categorical)

Nominal Nominal Binning/Aggregation {"Light Blue", "Blue", "Dark Blue", "Light Red", "Red", "Dark Red" } to derived variable value {"Blue", "Red"}
(Categorical) (Categorical) 29
Summary of Data Preparation Methods
• Missing Values treatment (treatment to avoid data exclusion or bias)
1. Deletion
2. Imputation
3. Prediction Model
• Outliers (treatment to avoid scale problem)
1. Deletion
2. Transformation (Generalization/Normalization)
• Selection of attributes (another way to reduce dimensionality of data to minimize bias)
1. Delete irrelevant/duplicate data
2. Select useful attributes for modelling
• Attribute/Data Creation (new attributes that can capture the important information in a data set
much more efficiently than the original attributes)
1. Derived attributes
2. Dummy attributes
30

You might also like