Professional Documents
Culture Documents
: 16010120202
Experiment 01
Objective:
Course Outcome:
Resources used:
https://www.kaggle.com/datasets/tarundalal/100-richest-people-in-world
______________________________________________________________________
Data pre-processing transforms the data into a format that is more easily and
effectively processed in data mining, machine learning and other data science tasks.
The techniques are generally used at the earliest stages of the machine learning and AI
development pipeline to ensure accurate results.
Page 1
• Finding missing, null values : There are a variety of reasons a data set might
be missing individual fields of data. Data scientists need to decide whether it is
better to discard records with missing fields, ignore them or fill them in with a
probable value. It is good practice to identify and replace missing values for
each column in your input data prior to modeling your prediction task.
• Replacing missing, null values with statistical parameters : Handling
missing values is important because most of the machine learning algorithms
don’t support data with missing values. These missing values in the data are to
be handled properly. If not, it leads to drawing inaccurate inference about the
data. A popular approach is to calculate a statistical value for each column
(such as a mean) and replace all missing values for that column with the
statistic. It is a popular approach because the statistic is easy to calculate using
the training dataset and because it often results in good performance.
• Encoding categorical data : Encoding categorical data is a process of
converting categorical data into integer format so that the data with converted
categorical values can be provided to the models to give and improve the
predictions. Data consisting of finite possible values is considered as
categorical data. There can be 2 kinds of categorical data:
• Nominal data
• Ordinal data
• Normalization : Data normalization is the method of organizing data to
appear similar across all records and fields. This process basically includes
eliminating unstructured data and duplicates. When data normalization is
performed correctly a higher value of insights are generated. In machine
learning, some feature values at times differ from others multiple times. The
features with higher values will always dominate the learning process. Data
normalization transforms the multiscale data all to the same scale. After
normalization, all variables have a similar weightage on the model, hence
improving the stability and performance of the learning algorithm.
Working :
Page 2
Page 3
Conclusion (Students should write in their own words):
We learned how to prepare the dataset for further process and various steps in Data –
Preprocessing
Page 4