You are on page 1of 4

Batch: C3 Roll No.

: 16010120202

Experiment 01

Title: Dataset preparing/ pre-processing

Objective:

1. To learn how to prepare the dataset


2. To learn various steps in Data –Preprocessing
3. Perform the Descriptive analytics (obtain measure of central tendency,
dispersion, skewness, Kurtosis, and graphical representation of the given dataset)
4. Perform Exploratory Data Analysis over dataset

Course Outcome:

CO1 Understand basic concepts of data analytics to solve real-world problems


CO2 Experiment using advanced software techniques and tools to conduct thorough
and insightful analysis

Books/ Journals/ Websites referred:


https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/
https://online.hbs.edu/blog/post/how-to-analyze-datasets
https://www.geckoboard.com/blog/how-to-analyze-data/

Resources used:

https://www.kaggle.com/datasets/tarundalal/100-richest-people-in-world
______________________________________________________________________

Theory (About Data Preprocessing):

Data pre-processing, a component of data preparation, describes any type of


processing performed on raw data to prepare it for another data processing procedure.
It has traditionally been an important preliminary step for the data mining process.
More recently, data pre-processing techniques have been adapted for training machine
learning models and AI models and for running inferences against them.

Data pre-processing transforms the data into a format that is more easily and
effectively processed in data mining, machine learning and other data science tasks.
The techniques are generally used at the earliest stages of the machine learning and AI
development pipeline to ensure accurate results.

Following points should be written by students

Different steps in Data Preprocessing:

Page 1
• Finding missing, null values : There are a variety of reasons a data set might
be missing individual fields of data. Data scientists need to decide whether it is
better to discard records with missing fields, ignore them or fill them in with a
probable value. It is good practice to identify and replace missing values for
each column in your input data prior to modeling your prediction task.
• Replacing missing, null values with statistical parameters : Handling
missing values is important because most of the machine learning algorithms
don’t support data with missing values. These missing values in the data are to
be handled properly. If not, it leads to drawing inaccurate inference about the
data. A popular approach is to calculate a statistical value for each column
(such as a mean) and replace all missing values for that column with the
statistic. It is a popular approach because the statistic is easy to calculate using
the training dataset and because it often results in good performance.
• Encoding categorical data : Encoding categorical data is a process of
converting categorical data into integer format so that the data with converted
categorical values can be provided to the models to give and improve the
predictions. Data consisting of finite possible values is considered as
categorical data. There can be 2 kinds of categorical data:
• Nominal data
• Ordinal data
• Normalization : Data normalization is the method of organizing data to
appear similar across all records and fields. This process basically includes
eliminating unstructured data and duplicates. When data normalization is
performed correctly a higher value of insights are generated. In machine
learning, some feature values at times differ from others multiple times. The
features with higher values will always dominate the learning process. Data
normalization transforms the multiscale data all to the same scale. After
normalization, all variables have a similar weightage on the model, hence
improving the stability and performance of the learning algorithm.

Source of dataset: https://www.kaggle.com/datasets/tarundalal/100-richest-people-in-


world

Platform used by the student: Google Colab

Working :

Page 2
Page 3
Conclusion (Students should write in their own words):

We learned how to prepare the dataset for further process and various steps in Data –
Preprocessing

Page 4

You might also like