Professional Documents
Culture Documents
• Data preparation is the process of preparing raw data so that it is suitable for further
processing and analysis. Key steps include collecting, cleaning, and labeling raw data into a
form suitable for machine learning (ML) algorithms and then exploring and visualizing the
data.
• Data preparation steps
1. Gather data. The data preparation process begins with finding the right data. ...
2. Discover and assess data. After collecting the data, it is important to discover each dataset. ...
3. Cleanse and validate data. ...
4. Transform and develop data. ...
5. Store data.
Data Cleaning
• In many real-world datasets data values are not recorded for all attributes.
• This can happen simply because there are some attributes that are not applicable for
some instances (e.g. certain medical data may only be meaningful for female patients
or patients over a certain age).
• The best approach here may be to divide the dataset into two (or more) parts, e.g.
treating male and female patients separately
Missing Values
• It can also happen that there are attribute values that should be recorded that are
missing.
• This can occur for several reasons, for example
• There are several possible strategies for dealing with missing values. Two of
the most commonly used are as follows.
1) Discard Instances
2) Replace by Most Frequent/Average Value
Discard Instances
• delete all instances where there is at least one missing value and use the
remainder.
• Its disadvantage is that discarding data may damage the reliability of the results
derived from the data.
• Although it may be worth trying when the proportion of missing values is small, it
is not recommended in general.
• It is clearly not usable when all or a high proportion of all the instances have
missing values.
Replace by Most Frequent/Average Value
• This technique says to replace the missing value with the variable with the
highest frequency or in simple words replacing the values with the Mode of
that column. This technique is also referred to as Mode Imputation.