You are on page 1of 23

Data Cleaning.

• Data cleaning is the process of fixing or removing


incorrect, corrupted, incorrectly formatted,
Introduction duplicate, or incomplete data within a dataset
•To handle missing values in the data there is some
techniques which can solve this problem.
• Ignore the tuple
• Fill in the missing value manually
• Use a global constant to fill in the missing value
• Use a measure of central tendency for the

Missing Values attribute (e.g., the mean or median) to fill in the


missing value
• Use the attribute mean or median for all samples
belonging to the same class as the given tuple
• Use the most probable value to fill in the missing
value
1. Ignore the tuples

Delete the row with missing or inconsistent data.But this method we can apply only when there is no such important
information the row contain.

For example :
id sales Month
1 $100 January

2 Feburary

Ignore or delete this tuple


Find the missing values
by is.null() method and
2.Fill the missing value fill them.The problem
arise when there is large
manually datasets and have to fill
millions of data,so this
method is not suggested.
The global constant like NA
or INFINITY can be used for
the missing attributes But the
3. Use a global constant problem with method is that it
can be detected as special
to fill in the missing value values and output or result of
the model can be wrong.To
overcome this problem we
can use the Regression and
classification algorithms like
Random Forest, Lineraor
logistic regression.
Questions about data cleaning

Q1:What is data cleaning? Data cleaning is the Q2:Can you explain why Ans:Data cleansing is Q3:Is it possible to detect Ans: Yes, it is possible to
process of identifying and data cleansing is important for machine missing values from a detect missing values from
cleaning up inaccuracies important for machine learning models because it data set without actually a data set without actually
and inconsistencies in learning models? can help to improve the going through each row going through each row
data. accuracy of the models. If manually? If yes, then manually. This can be
there are errors or how? done by using a technique
inconsistencies in the called imputation, which is
training data, then the a process of replacing
models may learn from missing values with
these and produce estimated values. There
inaccurate results. Data are a number of different
cleansing can help to methods that can be used
remove these errors and for imputation, but the
ensure that the models most common is probably
are learning from high- the mean imputation
quality data. method, which replaces
missing values with the
mean of the non-missing
values in the data set.
Q4:What are the various ways in which you can address missing data in a data set?
There are a few different ways that you can address missing data in a data set. One way is to simply remove any
rows or columns that contain missing data. Another way is to impute the missing data, which means to replace
the missing data with a estimated value.
•4. Use a measure of central tendency for the
attribute (e.g., the mean or median) to fill in the
missing value
•For the missing values another approach which is
generally used is mean ,median,mode.

•For Example: (Mean)


• Suppose we have the following values for salary (in
thousands of dollars),
• shown in increasing order: 30, 36, 47, 50, 52, 52, 56,
60, 63, 70, 70, 110.
•x¯ = (30 + 36 + 47 + 50 + 52 + 52 + 56 + 60 + 63 + 70 +
70 + 110) /12
•= 696 /12
• = 58.
• Thus, the mean salary is $58,000
5.Use the attribute mean or median for all samples belonging to the same
class as the given tuple:
For example, if classifying customers according to credit risk, we may replace
the missing value with the mean income value for customers in the same credit
risk category as that of the given tuple. If the data distribution for a given class
is skewed, the median value is a better choice
6. Use the most probable value to fill in the missing value:

Filling in missing values with the most probable value can be a straightforward approach in
certain situations. This method involves replacing the missing value with the mode (most
common value) of the variable. The mode is considered the most probable value because it
occurs with the highest frequency in the available data .
• Noisy data are data with a large amount of
additional meaningless information called noise.
• This includes data corruption, and the term is often
used as a synonym for corrupt data
Noisy data • Improper procedures (or improperly-documented
procedures) to subtract out the noise in data can
lead to a false sense of accuracy or false
conclusions.
Data = true signal + noise
• Binning is a technique where we sort the data and
then partition the data into equal frequency bins.
• There are three methods for smoothing data in the
bin.
• Smoothing by bin mean method: In this method,
the values in the bin are replaced by the mean value
of the bin.
Binning: • Smoothing by bin median: In this method, the
values in the bin are replaced by the median value.
• Smoothing by bin boundary: In this method, the
using minimum and maximum values of the bin
values are taken, and the closest boundary value
replaces the values.
Example:
Question:

Suppose a group of 12 Partition them into three


5; 10; 11; 13; 15; 35; 50; (a) equal-frequency (b) equal-width
sales price records has bins by each of the
55; 72; 92; 204; 215: partitioning partitioning
been sorted as follows: following methods.

(a) equal-frequency
Answer: bin 1 5,10,11,13 bin 2 15,35,50,55 bin 3 72,92,204,215
partitioning

The width of each bin 1


(b) equal-width
interval is (215 - 5)/=3 = 5,10,11,13,15,35,50,55,7 bin 2 92 bin 3 204,215
partitioning
70. 2
• This is used for finding the outliers and also in
grouping the data. Clustering is generally used in
Clustering: unsupervised learning.
Example:
Question:

Suppose a group of 12 sales price records has been sorted as follows:


5; 10; 11; 13; 15; 35; 50; 55; 72; 92; 204; 215:
Find clustering
• This is used to smooth the data and help handle data
when unnecessary data is present.
• Linear regression refers to finding the best line to
Regression: fit between two variables so that one can be used to
predict the other.
• Multiple linear regression involves more than two
variables
Example of
regression:
• Outliers may be detected by clustering, where
similar or close values are organized into the same
groups or clusters.
• Outliers can be the following kinds, such as:
• Univariate outliers can be found when looking at a
distribution of values in a single feature space.
• Multivariate outliers can be found in an n-
dimensional space (of n-features).
Outlier Analysis • Point outliers are single data points that lay far
from the rest of the distribution.
• Contextual outliers can be noise in data, such as
punctuation symbols when realizing text analysis or
background noise signal when doing speech
recognition.
• Collective outliers can be subsets of novelties in
data, such as a signal that may indicate the
discovery of new phenomena.
Example:
• Outliers are represented as red dots in the figure below,
and you can see that they deviate significantly from the
rest of the data points. Outliers are also often referred to as
anomalies, aberrations, or irregularities.

You might also like