You are on page 1of 13

DATA PRE-PROCESSING

DATA PRE-PROCESSING

 Preprocessing refers to the steps and techniques used to prepare raw data for analysis or
modeling.

 It is a crucial step in the data science and machine learning

 Preprocessing aims to clean, transform, and organize the data into a suitable format for
further processing.
DATA PRE-PROCESSING

Common preprocessing tasks include:


1. Data Cleaning:
 This involves dealing with missing values and with irrelevant data.

 Missing values might be imputed or removed based on the context and the impact on the
analysis.

 Missing values can be ‘0’ or any string value. This problem can be resolved by calculating the
mean value or by replacing the values.
DATA PRE-PROCESSING

 All blank spaces are filled by ‘0’ values.

 Floating value with alphabets so it must be replaced by binary numbers i.e 0,1.
DATA PRE-PROCESSING

2. Transformation:

 Data transformation involves converting variables into a suitable format for analysis.

This can include –

• Normalization - In this method, the data is transformed to a specific range, usually between 0 and 1.

The formula for normalization is:

Normalization =
DATA PRE-PROCESSING

• Standardization (Z-Score Scaling) - Standardization transforms


the data to have a mean of 0 and a standard deviation of 1.
The formula for standardization is:
Standardization =

DATA PRE-PROCESSING

• Scaling - Scaling is similar to standardization but uses the median and


interquartile range (IQR) instead of the mean and standard deviation.
This makes it more resistant to the influence of outliers.
Scaling =
DATA PRE-PROCESSING

3. Data Integration

 Data integration is a crucial step in the data management process that involves combining and unifying
data from various sources into a single, coherent, and organized view.

 The goal of data integration is to provide a unified and comprehensive view of data, making it easier to
analyze, report on, and derive insights from the information contained in disparate datasets.
DATA PRE-PROCESSING

4. Data Reduction

 Data reduction is a crucial technique in data analysis and data mining that involves reducing the volume but
producing the same or similar analytical results from a dataset.

 The primary goal of data reduction is to simplify the data while retaining the essential information and
patterns, which can be beneficial for various purposes, including improving efficiency, speeding up
algorithms, reducing storage requirements, and gaining a better understanding of the data.
DATA PRE-PROCESSING

Techniques for Data Reduction:

(a) Dimensionality Reduction: Dimensionality reduction techniques aim to reduce the number of
variables or features in the dataset while preserving as much relevant information as possible.
There are 2 types of Dimensionality Reduction:
1. Stepwise forward selection:
2. Stepwise backward elimination
DATA PRE-PROCESSING

(b) Data Compression:

 Data compression is the process of reducing the size of data files or streams while preserving as much of
the original information as possible.

 It is widely used in various applications to save storage space, reduce transmission time over networks,
and improve overall system efficiency.
There are two primary types of data compression:
1. Lossless
2. Lossy
DATA PRE-PROCESSING

(c) Numerosity Reduction:

 It is a data mining technique used to reduce the number of data points in a dataset while retaining its
essential characteristics and patterns.

 The goal of numerosity reduction is to simplify complex datasets by representing them with a smaller set of
representative data points or summary statistics.

 This reduction in data volume can make it more manageable for analysis, visualization, and model building
while still preserving meaningful information.
DATA PRE-PROCESSING

(d) Discretization

 Discretization is a data preprocessing technique used in data mining and machine learning to convert
continuous (numerical) data into discrete (categorical) intervals or bins.

 It involves grouping data points into specific ranges or categories based on their values.

 Discretization is primarily used for several reasons, including simplifying data analysis, improving model
performance, and addressing certain algorithms' requirements.

You might also like