DA1SHRUTI202

Batch: C3 Roll No.
: 16010120202
Experiment 01
Title: Dataset preparing/ pre-processing
Objective:
1. To learn how to prepare the dataset

2. To learn various steps in Data –Preprocessing
3. Perform the Descriptive analytics (obtain measure of central tendency,
dispersion, skewness, Kurtosis, and graphical representation of the given dataset)
4. Perform Exploratory Data Analysis over dataset
Course Outcome:
CO1 Understand basic concepts of data analytics to solve real-world problems

CO2 Experiment using advanced software techniques and tools to conduct thorough
and insightful analysis
Books/ Journals/ Websites referred:

https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/
https://online.hbs.edu/blog/post/how-to-analyze-datasets
https://www.geckoboard.com/blog/how-to-analyze-data/
Resources used:
https://www.kaggle.com/datasets/tarundalal/100-richest-people-in-world
______________________________________________________________________
Theory (About Data Preprocessing):
Data pre-processing, a component of data preparation, describes any type of

processing performed on raw data to prepare it for another data processing procedure.
It has traditionally been an important preliminary step for the data mining process.
More recently, data pre-processing techniques have been adapted for training machine
learning models and AI models and for running inferences against them.
Data pre-processing transforms the data into a format that is more easily and
effectively processed in data mining, machine learning and other data science tasks.
The techniques are generally used at the earliest stages of the machine learning and AI
development pipeline to ensure accurate results.
Following points should be written by students
Different steps in Data Preprocessing:
Page 1
• Finding missing, null values : There are a variety of reasons a data set might
be missing individual fields of data. Data scientists need to decide whether it is
better to discard records with missing fields, ignore them or fill them in with a
probable value. It is good practice to identify and replace missing values for
each column in your input data prior to modeling your prediction task.
• Replacing missing, null values with statistical parameters : Handling
missing values is important because most of the machine learning algorithms
don’t support data with missing values. These missing values in the data are to
be handled properly. If not, it leads to drawing inaccurate inference about the
data. A popular approach is to calculate a statistical value for each column
(such as a mean) and replace all missing values for that column with the
statistic. It is a popular approach because the statistic is easy to calculate using
the training dataset and because it often results in good performance.
• Encoding categorical data : Encoding categorical data is a process of
converting categorical data into integer format so that the data with converted
categorical values can be provided to the models to give and improve the
predictions. Data consisting of finite possible values is considered as
categorical data. There can be 2 kinds of categorical data:
• Nominal data
• Ordinal data
• Normalization : Data normalization is the method of organizing data to
appear similar across all records and fields. This process basically includes
eliminating unstructured data and duplicates. When data normalization is
performed correctly a higher value of insights are generated. In machine
learning, some feature values at times differ from others multiple times. The
features with higher values will always dominate the learning process. Data
normalization transforms the multiscale data all to the same scale. After
normalization, all variables have a similar weightage on the model, hence
improving the stability and performance of the learning algorithm.
Source of dataset: https://www.kaggle.com/datasets/tarundalal/100-richest-people-in-

world
Platform used by the student: Google Colab
Working :
Page 2
Page 3
Conclusion (Students should write in their own words):
We learned how to prepare the dataset for further process and various steps in Data –
Preprocessing
Page 4

DA1SHRUTI202

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DA1SHRUTI202

Uploaded by

Copyright:

Available Formats

Batch: C3 Roll No.

Title: Dataset preparing/ pre-processing

1. To learn how to prepare the dataset

CO1 Understand basic concepts of data analytics to solve real-world problems

Books/ Journals/ Websites referred:

Theory (About Data Preprocessing):

Data pre-processing, a component of data preparation, describes any type of

Following points should be written by students

Different steps in Data Preprocessing:

Source of dataset: https://www.kaggle.com/datasets/tarundalal/100-richest-people-in-

Platform used by the student: Google Colab

You might also like