Professional Documents
Culture Documents
Delete the row with missing or inconsistent data.But this method we can apply only when there is no such important
information the row contain.
For example :
id sales Month
1 $100 January
2 Feburary
Q1:What is data cleaning? Data cleaning is the Q2:Can you explain why Ans:Data cleansing is Q3:Is it possible to detect Ans: Yes, it is possible to
process of identifying and data cleansing is important for machine missing values from a detect missing values from
cleaning up inaccuracies important for machine learning models because it data set without actually a data set without actually
and inconsistencies in learning models? can help to improve the going through each row going through each row
data. accuracy of the models. If manually? If yes, then manually. This can be
there are errors or how? done by using a technique
inconsistencies in the called imputation, which is
training data, then the a process of replacing
models may learn from missing values with
these and produce estimated values. There
inaccurate results. Data are a number of different
cleansing can help to methods that can be used
remove these errors and for imputation, but the
ensure that the models most common is probably
are learning from high- the mean imputation
quality data. method, which replaces
missing values with the
mean of the non-missing
values in the data set.
Q4:What are the various ways in which you can address missing data in a data set?
There are a few different ways that you can address missing data in a data set. One way is to simply remove any
rows or columns that contain missing data. Another way is to impute the missing data, which means to replace
the missing data with a estimated value.
•4. Use a measure of central tendency for the
attribute (e.g., the mean or median) to fill in the
missing value
•For the missing values another approach which is
generally used is mean ,median,mode.
Filling in missing values with the most probable value can be a straightforward approach in
certain situations. This method involves replacing the missing value with the mode (most
common value) of the variable. The mode is considered the most probable value because it
occurs with the highest frequency in the available data .
• Noisy data are data with a large amount of
additional meaningless information called noise.
• This includes data corruption, and the term is often
used as a synonym for corrupt data
Noisy data • Improper procedures (or improperly-documented
procedures) to subtract out the noise in data can
lead to a false sense of accuracy or false
conclusions.
Data = true signal + noise
• Binning is a technique where we sort the data and
then partition the data into equal frequency bins.
• There are three methods for smoothing data in the
bin.
• Smoothing by bin mean method: In this method,
the values in the bin are replaced by the mean value
of the bin.
Binning: • Smoothing by bin median: In this method, the
values in the bin are replaced by the median value.
• Smoothing by bin boundary: In this method, the
using minimum and maximum values of the bin
values are taken, and the closest boundary value
replaces the values.
Example:
Question:
(a) equal-frequency
Answer: bin 1 5,10,11,13 bin 2 15,35,50,55 bin 3 72,92,204,215
partitioning