You are on page 1of 8

MODULE 2

What is Data Pre-processing? 

Data pre-processing is the process of transforming raw data into an understandable format.
It is also an important step in data mining as we cannot work with raw data. The quality of
the data should be checked before applying machine learning or data mining algorithms.

Why is Data Pre-processing Important?

Pre-processing of data is mainly to check the data quality. The quality can be checked by the
following:

● Accuracy: To check whether the data entered is correct or not.


● Completeness: To check whether the data is available or not recorded.
● Consistency: To check whether the same data is kept in all the places that do or do not
match.
● Timeliness: The data should be updated correctly.
● Believability: The data should be trustable.
● Interpretability: The understandability of the data.

Major Tasks in Data Pre-processing

There are 4 major tasks in data pre-processing – Data cleaning, Data integration, Data
reduction, and Data transformation.
Data Cleaning
Data cleaning is the process of removing incorrect data, incomplete data, and inaccurate data
from the datasets, and it also replaces the missing values. Here are some techniques for data
cleaning:

Handling Missing Values

● Standard values like “Not Available” or “NA” can be used to replace the missing values.
● Missing values can also be filled manually, but it is not recommended when that dataset is
big.
● The attribute’s mean value can be used to replace the missing value when the data is
normally distributed
wherein in the case of non-normal distribution median value of the attribute can be used.
● While using regression or decision tree algorithms, the missing value can be replaced by the
most probable value.

Handling Noisy Data


Noisy generally means random error or containing unnecessary data points. Handling noisy
data is one of the most important steps as it leads to the optimization of the model we are
using Here are some of the methods to handle noisy data.

● Binning: This method is to smooth or handle noisy data. First, the data is sorted then, and
then the sorted values are separated and stored in the form of bins. There are three
methods for smoothing data in the bin. Smoothing by bin mean method: In this method,
the values in the bin are replaced by the mean value of the bin; Smoothing by bin median:
In this method, the values in the bin are replaced by the median value; Smoothing by bin
boundary: In this method, the using minimum and maximum values of the bin values are
taken, and the closest boundary value replaces the values.
● Regression: This is used to smooth the data and will help to handle data when unnecessary
data is present. For the analysis, purpose regression helps to decide the variable which is
suitable for our analysis.
● Clustering: This is used for finding the outliers and also in grouping the data. Clustering is
generally used in unsupervised learning.
Data Integration
The process of combining multiple sources into a single dataset. The Data integration process
is one of the main components of data management. There are some problems to be
considered during data integration.

● Schema integration: Integrates metadata(a set of data that describes other data) from
different sources.
● Entity identification problem: Identifying entities from multiple databases. For example, the
system or the user should know the student id of one database and studentname of another
database belonging to the same entity.
● Detecting and resolving data value concepts: The data taken from different databases while
merging may differ. The attribute values from one database may differ from another
database. For example, the date format may differ, like “MM/DD/YYYY” or “DD/MM/YYYY”.

Data Reduction
This process helps in the reduction of the volume of the data, which makes the analysis easier
yet produces the same or almost the same result. This reduction also helps to reduce storage
space. Some of the data reduction techniques are dimensionality reduction, numerosity
reduction, and data compression.

● Dimensionality reduction: This process is necessary for real-world applications as the data


size is big. In this process, the reduction of random variables or attributes is done so that the
dimensionality of the data set can be reduced. Combining and merging the attributes of the
data without losing its original characteristics. This also helps in the reduction of storage
space, and computation time is reduced. When the data is highly dimensional, a problem
called the “Curse of Dimensionality” occurs.
● Numerosity Reduction: In this method, the representation of the data is made smaller by
reducing the volume. There will not be any loss of data in this reduction.
● Data compression: The compressed form of data is called data compression. This
compression can be lossless or lossy. When there is no loss of information during
compression, it is called lossless compression. Whereas lossy compression reduces
information, but it removes only the unnecessary information.

Data Transformation
The change made in the format or the structure of the data is called data transformation. This
step can be simple or complex based on the requirements. There are some methods for data
transformation.

● Smoothing: With the help of algorithms, we can remove noise from the dataset, which helps
in knowing the important features of the dataset. By smoothing, we can find even a simple
change that helps in prediction.
● Aggregation: In this method, the data is stored and presented in the form of a summary. The
data set, which is from multiple sources, is integrated into with data analysis description.
This is an important step since the accuracy of the data depends on the quantity and quality
of the data. When the quality and the quantity of the data are good, the results are more
relevant.
● Discretization: The continuous data here is split into intervals. Discretization reduces the
data size. For example, rather than specifying the class time, we can set an interval like (3
pm-5 pm, or 6 pm-8 pm).
● Normalization: It is the method of scaling the data so that it can be represented in a smaller
range. Example ranging from -1.0 to 1.0.

Data collection
Collecting data for training the ML model is the basic step in the machine learning pipeline.
The predictions made by ML systems can only be as good as the data on which they have
been trained. Following are some of the problems that can arise in data collection:
● Inaccurate data. The collected data could be unrelated to the problem statement.
● Missing data. Sub-data could be missing. That could take the form of empty values in
columns or missing images for some class of prediction.
● Data imbalance. Some classes or categories in the data may have a
disproportionately high or low number of corresponding samples. As a result, they
risk being under-represented in the model.
● Data bias. Depending on how the data, subjects and labels themselves are chosen,
the model could propagate inherent biases on gender, politics, age or region, for
example. Data bias is difficult to detect and remove.
Several techniques can be applied to address those problems:
● Pre-cleaned, freely available datasets. If the problem statement (for example, image
classification, object recognition) aligns with a clean, pre-existing, properly
formulated dataset, then take advantage of existing, open-source expertise.
● Web crawling and scraping. Automated tools, bots and headless browsers can crawl
and scrape websites for data.
● Private data. ML engineers can create their own data. This is helpful when the
amount of data required to train the model is small and the problem statement is too
specific to generalize over an open-source dataset.
● Custom data. Agencies can create or crowdsource the data for a fee.

Data pre-processing
Real-world raw data and images are often incomplete, inconsistent and lacking in certain
behaviours or trends. They are also likely to contain many errors. So, once collected, they are
pre-processed into a format the machine learning algorithm can use for the model.
Pre-processing includes a number of techniques and actions:
● Data cleaning. These techniques, manual and automated, remove data incorrectly
added or classified.
● Data imputations. Most ML frameworks include methods and APIs for balancing or
filling in missing data. Techniques generally include imputing missing values with
standard deviation, mean, median and k-nearest neighbours (k-NN) of the data in the
given field.
● Oversampling. Bias or imbalance in the dataset can be corrected by generating more
observations/samples with methods like repetition, bootstrapping or Synthetic
Minority Over-Sampling Technique (SMOTE), and then adding them to the
under-represented classes.
● Data integration. Combining multiple datasets to get a large corpus can overcome
incompleteness in a single dataset.
● Data normalization. The size of a dataset affects the memory and processing required
for iterations during training. Normalization reduces the size by reducing the order
and magnitude of data.

I. DATA INTEGRATION:

Data integration is one of the steps of data pre-processing that involves combining data
residing in different sources and providing users with a unified view of these data.
• It merges the data from multiple data stores (data sources)
• It includes multiple databases, data cubes or flat files.
• Metadata, Correlation analysis, data conflict detection, and resolution of semantic
heterogeneity contribute towards smooth data integration.
• There are mainly 2 major approaches for data integration - commonly known as "tight
coupling approach" and "loose coupling approach".
Tight Coupling
o Here data is pulled over from different sources into a single physical location through the
process of ETL - Extraction, Transformation and Loading.
o The single physical location provides an uniform interface for querying the data.
ETL layer helps to map the data from the sources so as to provide a uniform data
o warehouse. This approach is called tight coupling since in this approach the data is tightly
coupled with the physical repository at the time of query.
ADVANTAGES:
1 Independence (Lesser dependency to source systems since data is physically copied
over)
2 Faster query processing
3 Complex query processing
4 Advanced data summarization and storage possible
5 High Volume data processing
DISADVANTAGES: 1. Latency (since data needs to be loaded using ETL)
1 Costlier (data localization, infrastructure, security)
Loose Coupling
o Here a virtual mediated schema provides an interface that takes the query from the user,
transforms it in a way the source database can understand and then sends the query directly
to the source databases to obtain the result.
o In this approach, the data only remains in the actual source databases.
o However, mediated schema contains several "adapters" or "wrappers" that can connect
back to the source systems in order to bring the data to the front end.
ADVANTAGES:
Data Freshness (low latency - almost real time)
Higher Agility (when a new source system comes or existing source system changes - only
the corresponding adapter is created or changed - largely not affecting the other parts of the
system)
Less costlier (Lot of infrastructure cost can be saved since data localization not required)
DISADVANTAGES:
1 Semantic conflicts
2 Slower query response
3 High order dependency to the data sources
For example, let's imagine that an electronics company is preparing to roll out a new mobile
device. The marketing department might want to retrieve customer information from a sales
department database and compare it to information from the product department to create
a targeted sales list. A good data integration system would let the marketing department
view information from both sources in a unified way, leaving out any information that didn't
apply to the search.

II. DATA TRANSFORMATION:


In data mining pre-processes and especially in metadata and data warehouse, we use data
transformation in order to convert data from a source data format into destination data.
We can divide data transformation into 2 steps:
• Data Mapping:
It maps the data elements from the source to the destination and captures any
transformation that must occur.
• Code Generation:
It creates the actual transformation program.
Data transformation:
• Here the data are transformed or consolidated into forms appropriate for mining.
• Data transformation can involve the following:
 Smoothing:
• It works to remove noise from the data.
• It is a form of data cleaning where users specify transformations to correct data
inconsistencies.
• Such techniques include binning, regression, and clustering.
 Aggregation:
• Here summary or aggregation operations are applied to the data.
• This step is typically used in constructing a data cube for analysis of the data at multiple
granularities.
• Aggregation is a form of data reduction.
 Generalization:
• Here low-level or “primitive” (raw) data are replaced by higher-level concepts through the
use of concept hierarchies.
• For example, attributes, like age, may be mapped to higher-level concepts, like youth,
middle-aged, and senior.
• Generalization is a form of data reduction.
 Normalization:
• Here the attribute data are scaled so as to fall within a small specified range, such as 1:0 to
1:0, or 0:0 to 1:0.
• Normalization is particularly useful for classification algorithms involving neural networks,
or distance measurements such as nearest-neighbor classification and clustering
• For distance-based methods, normalization helps prevent attributes with initially large
ranges (e.g., income).
• There are three methods for data normalization:
1. min-max normalization:
o performs a linear transformation on the original data
o Suppose that minAand maxAare the minimum and maximum values of an attribute, A.
o Min-max normalization maps a value, v, of A to v0 in the range [new minA;newmaxA] by
computing

o Min-max normalization preserves the relationships among the original data values.
2. z-score normalization:
o Here the values for an attribute, A, are normalized based on the mean and standard
deviation of A.
o Value, v of A is normalized to v0 by computing

, where A and σA are the mean and standard deviation, respectively.


o This method of normalization is useful when the actual minimum and maximum of
attribute Aare unknown, or when there are outliers that dominate the min-max
normalization.
3. normalization by decimal scaling:
o Here the normalization is done by moving the decimal point of values of attribute A.
o The number of decimal points moved depends on the maximum absolute value of A.
o Value, v of A is normalized to v0 by computing

where j is the smallest integer such that Max

III. DATA CLEANING

• Real-world data tend to be incomplete, noisy, and inconsistent.


• Data cleaning routines attempt to fill in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data.
• Data cleaning tasks include:
 Fill in missing values
6 Ignore the tuple: This is usually done when the class label is missing
7 Fill in the missing value manually: this approach is time-consuming and may not be
feasible given a large data set with many missing values.
8 Use a global constant to fill in the missing value: Replace all missing attribute values
by the same constant
9 Use the attribute mean to fill in the missing value: Use a particular value to replace
the missing value for an attribute.
10 Use the attribute mean for all samples belonging to the same class as the given
tuple: replace the missing value with the average value of the attribute for the given
tuple.
11 Use the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using a Bayesian formalism, or decision tree
induction.
 Identify outliers and smooth out noisy data
• Noise is a random error or variance in a measured variable. And can be smoothened using
the following steps:
2 Binning: Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it.
3 Regression: Data can be smoothed by fitting the data to a function, such as with
regression. Linear regression involves finding the “best” line to fit two attributes (or
variables), so that one attribute can be used to predict the other. Multiple linear
regressionis an extension of linear regression, where more than two attributes are
involved and the data are fit to a multidimensional surface.
4 Clustering: Outliers may be detected by clustering, where similar values are
organized into groups, or “clusters.”
 Correct inconsistent data
 Resolve redundancy caused by data integration

You might also like