Welcome to Scribd!

Data Cleaning and Preprocessing Techniques

Uploaded by

0% found this document useful (0 votes)

18 views13 pages

The document discusses data cleaning and preprocessing techniques for machine learning models. It describes 3 main steps: 1) exploratory data analysis to understand the dataset, 2) dealing with missing values through methods like imputation, and 3) handling duplicates and outliers. The first step involves analyzing dataset structure, variable distributions and relationships. Missing numerical values can be filled with mean, median or group-based imputations while categorical values use the mode.

Original Description:

Copyright

Available Formats

PPTX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

18 views13 pages

Data Cleaning and Preprocessing Techniques

Uploaded by

Sivam Chinna

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 13

Search inside document

Data Cleaning and

Preprocessing Techniques
2
 First, you have to dig deep in the problem, understand what
clues you are missing and what information you can extract.

 After understanding the problem, you need to prepare the

dataset for your machine learning model since the data in its
initial condition is never enough.

3
Step 1: Exploratory Data Analysis
 The first step in a data science project is the exploratory
analysis, that helps in understanding the problem and taking
decisions in the next steps.
 It tends to be skipped, but it’s the worst error because you’ll
lose a lot of time later to find the reason why the model gives
errors or didn’t perform as expected.

4
Step 1: Exploratory Data Analysis
 Exploratory analysis into three parts:
1. Check the structure of the dataset, the statistics, the missing
values, the duplicates, the unique values of the categorical
variables
2. Understand the meaning and the distribution of the variables
3. Study the relationships between variables

5
Step 1: Exploratory Data Analysis
 To analyse how the dataset is organised, there are the following
Pandas methods that can help you:
df.head()
df.info()
df.isnull().sum()
df.duplicated().sum()
df.describe([x*0.1 for x in range(10)])
for c in list(df):
print(df[c].value_counts())

6
Step 1: Exploratory Data Analysis
 When trying to understand the variables, it’s useful to split the
analysis into two further parts: numerical features and
categorical features.
 First, we can focus on the numerical features that can be
visualized through
 histograms
 boxplots.

7
Step 1: Exploratory Data Analysis
 After, it’s the turn for the categorical variables.
 In case it’s a binary problem, it’s better to start by looking if the
classes are balanced.
 After focused on the remaining categorical variables using the bar
plots.
 Finally check the correlation between each pair of numerical
variables.
 Other useful data visualizations can be the scatter plots and
boxplots to observe the relations between a numerical and a
categorical variable.
8
Step 2: Deal with Missing values
 In the first step, investigate missing values in each variable.
 In case there are missing values, we need to understand how to
handle the issue.
 The easiest way would be to remove the variables or the rows
that contain NaN values,
 but we would prefer to avoid it because we risk losing useful
information that can help our machine learning model on
solving the problem.

9
Step 2: Deal with Missing values
 If we are dealing with a numerical variable, there are several approaches
to fill it.
 The most popular method consists in filling the missing values with the
mean/median of that feature:
df['age'].fillna(df['age'].mean())
df['age'].fillna(df['age'].median())

 Another way is to substitute the blanks with group by imputations:

df['price'].fillna(df.group('type_building')['price'].transform('mean'),inplace=True)
 It can be a better option in case there is a strong relationship between a
numerical feature and a categorical feature.
10
 fill the missing values of categorical based on the mode of that
variable:

df['type_building'].fillna(df['type_building'].mode()[0])

11
Step 3: Deal with Duplicates and Outliers

12
13

Dungeon Magazine - 118-121 Greyhawk Map - GRz9Zb
Document6 pages
Dungeon Magazine - 118-121 Greyhawk Map - GRz9Zb
wisomi4762
No ratings yet
District Cooling Best Guide
Document175 pages
District Cooling Best Guide
rama_eas
100% (4)
Vol 32 I 1
Document72 pages
Vol 32 I 1
Andrea Marino
No ratings yet
Sap BPC Business Blueprint Document
Document12 pages
Sap BPC Business Blueprint Document
gene
No ratings yet
Neural Networks Study Notes
Document11 pages
Neural Networks Study Notes
pekalu
100% (2)
Fundamental of Cloud Computing and Iot: Prepared: Mebiratu B
Document30 pages
Fundamental of Cloud Computing and Iot: Prepared: Mebiratu B
Nahom Dires
No ratings yet
Coincent - Data Science With Python Assignment
Document23 pages
Coincent - Data Science With Python Assignment
Sai Nikhil Nellore
100% (2)
Data Mining and Visualization Question Bank
Document11 pages
Data Mining and Visualization Question Bank
ghost
100% (1)
RTN 950A V100R008C10 RFU User Manual 01
Document287 pages
RTN 950A V100R008C10 RFU User Manual 01
Iwan Ridwan
100% (2)
ISV3 - ISM8 Configuration Guide
Document5 pages
ISV3 - ISM8 Configuration Guide
Jonatan Silvera
0% (1)
Lecture 3: Applications of Machine Learning Algorithms Jul. 06 & 09, 2018
Document3 pages
Lecture 3: Applications of Machine Learning Algorithms Jul. 06 & 09, 2018
Akash Gupta
No ratings yet
DS Unit 1 Essay Answers.
Document18 pages
DS Unit 1 Essay Answers.
Savitha Elluru
No ratings yet
Object Oriented Metrics in SE
Document12 pages
Object Oriented Metrics in SE
Panu
No ratings yet
Reading 5 - Data Preparation
Document23 pages
Reading 5 - Data Preparation
NR Yalife
No ratings yet
Dev Answer Key
Document17 pages
Dev Answer Key
jayapriya kce
100% (1)
SECTION 1: Basic Concepts and Notations, Arrays and Recursion
Document6 pages
SECTION 1: Basic Concepts and Notations, Arrays and Recursion
YT Premone
No ratings yet
Classification DecisionTreesNaiveBayeskNN
Document75 pages
Classification DecisionTreesNaiveBayeskNN
Dev kartik Agarwal
No ratings yet
Decision Tree Using Sci-Kit Learn
Document9 pages
Decision Tree Using Sci-Kit Learn
sudeepvmenon
No ratings yet
DWDM Unit 4 PDF
Document18 pages
DWDM Unit 4 PDF
indira
No ratings yet
Data Mining Ch-3
Document51 pages
Data Mining Ch-3
Hasset Tiss Abay Genji
No ratings yet
ML Unit 2
Document41 pages
ML Unit 2
abhijit kate
No ratings yet
Whole ML PDF 1614408656
Document214 pages
Whole ML PDF 1614408656
Kshatrapati Singh
100% (1)
Introduction. Binary Classification and Bayes Optimal Classifier
Document7 pages
Introduction. Binary Classification and Bayes Optimal Classifier
Yiwei Chen
No ratings yet
Intro To Data Science Summary
Document17 pages
Intro To Data Science Summary
Hussein ElGhoul
No ratings yet
Data Exploration Preparation
Document12 pages
Data Exploration Preparation
hamidsithole65
No ratings yet
LDA
Document10 pages
LDA
Prashant pandey
No ratings yet
Lab 08 - Data Preprocessing
Document9 pages
Lab 08 - Data Preprocessing
rida
No ratings yet
DMWH M3
Document21 pages
DMWH M3
BINESH
No ratings yet
PDS Imp
Document43 pages
PDS Imp
purvesh
No ratings yet
Ques 1.give Some Examples of Data Preprocessing Techniques?: Assignment - DWDM Submitted By-Tanya Sikka 1719210284
Document7 pages
Ques 1.give Some Examples of Data Preprocessing Techniques?: Assignment - DWDM Submitted By-Tanya Sikka 1719210284
Sachin Chauhan
No ratings yet
Assignment No: 1
Document6 pages
Assignment No: 1
Anurag Singh
No ratings yet
Lecturenotes DecisionTree Spring15
Document16 pages
Lecturenotes DecisionTree Spring15
newforall732
No ratings yet
Assignment 1 - LP1
Document14 pages
Assignment 1 - LP1
bbad070105
No ratings yet
Summary - Data Analytics& Machine Learning
Document18 pages
Summary - Data Analytics& Machine Learning
wzq0308chn
No ratings yet
What Is Machine Learning?
Document8 pages
What Is Machine Learning?
Pooja Patwari
No ratings yet
Data Cleaning: Remove Unwanted Observations
Document5 pages
Data Cleaning: Remove Unwanted Observations
test one
No ratings yet
Data Mining Questions and Answers
Document22 pages
Data Mining Questions and Answers
debmatra
No ratings yet
Analysis and Prediction of House Prices by Linear Regression Model
Document91 pages
Analysis and Prediction of House Prices by Linear Regression Model
2001 Since
No ratings yet
Working With Data: V Balachandra 19F41A0471
Document17 pages
Working With Data: V Balachandra 19F41A0471
balu
No ratings yet
Predicting The Term Deposit Subscription
Document38 pages
Predicting The Term Deposit Subscription
MUSA AHMED ABDULLAHI ALI
No ratings yet
A110 Rayyan Expt4dep
Document9 pages
A110 Rayyan Expt4dep
saransh.sahay263
No ratings yet
ES2D7 System and Software Engineering Principles - Object Orientated Approaches
Document9 pages
ES2D7 System and Software Engineering Principles - Object Orientated Approaches
Namita Gera
No ratings yet
g3 Mathematics Florida Standards
Document7 pages
g3 Mathematics Florida Standards
api-290541111
No ratings yet
Clustering in R
Document12 pages
Clustering in R
Renuka
No ratings yet
Machine Learning
Document15 pages
Machine Learning
Kommi Venkat saketh
No ratings yet
Lecture 4 CRCClassDiagram
Document27 pages
Lecture 4 CRCClassDiagram
Mishal Hassan
No ratings yet
Lecture 00
Document11 pages
Lecture 00
AVLEEN S KALRA
No ratings yet
Data Mining Algorithms Classification L4
Document7 pages
Data Mining Algorithms Classification L4
u- m-
No ratings yet
Assignment 1.1: First 10 Rows Looks Like Below in Notepad++
Document6 pages
Assignment 1.1: First 10 Rows Looks Like Below in Notepad++
priyam
100% (1)
UNIT 1 Exploratory Data Analysis
Document8 pages
UNIT 1 Exploratory Data Analysis
parimala balamurugan
100% (1)
Unit - 5: Anuj Khanna Assistant Profesor (Kiot, Kanpur)
Document23 pages
Unit - 5: Anuj Khanna Assistant Profesor (Kiot, Kanpur)
Arti Dwivedi
No ratings yet
Group A Assignment No2 Writeup
Document9 pages
Group A Assignment No2 Writeup
403 Chaudhari Sanika Sagar
No ratings yet
DWDM Unit-3: What Is Classification? What Is Prediction?
Document12 pages
DWDM Unit-3: What Is Classification? What Is Prediction?
Sai Venkat Gudla
No ratings yet
An Extensive Step by Step Guide To Exploratory Data Analysis
Document26 pages
An Extensive Step by Step Guide To Exploratory Data Analysis
ojeifoissy
No ratings yet
An Introduction To Supervised Learning With Scikit-Learn: Machine Learning: The Problem Setting
Document4 pages
An Introduction To Supervised Learning With Scikit-Learn: Machine Learning: The Problem Setting
nimeshscr
No ratings yet
CS168: The Modern Algorithmic Toolbox Lecture #5: Generalization (Or, How Much Data Is Enough?)
Document16 pages
CS168: The Modern Algorithmic Toolbox Lecture #5: Generalization (Or, How Much Data Is Enough?)
Danish Shabbir
No ratings yet
DSA - Introduction
Document70 pages
DSA - Introduction
Ayush Kumar
No ratings yet
Machine Learning
Document9 pages
Machine Learning
jetlin
No ratings yet
German Dataset Tasks
Document6 pages
German Dataset Tasks
Prateek Singh
No ratings yet
Robotics AI& ML Sample Questions
Document11 pages
Robotics AI& ML Sample Questions
Samson Mumba
No ratings yet
A Complete Guide To KNN
Document16 pages
A Complete Guide To KNN
cinculiranje
No ratings yet
Chapter 5
Document20 pages
Chapter 5
Martin.c.figueroa
No ratings yet
Week 1 HW
Document3 pages
Week 1 HW
Sunny
No ratings yet
General Mathematics: Quarter 1 - Module 1 Functions
Document17 pages
General Mathematics: Quarter 1 - Module 1 Functions
cyken
No ratings yet
02 Tidyverse
Document44 pages
02 Tidyverse
Cotta Lee
No ratings yet
4 Types of Classification Tasks in Machine Learning
Document14 pages
4 Types of Classification Tasks in Machine Learning
Harish Sreenivas
No ratings yet
Exploring the World of Data Science and Machine Learning
From Everand
Exploring the World of Data Science and Machine Learning
NIBEDITA Sahu
No ratings yet
Unit 2 Transmission Lines
Document5 pages
Unit 2 Transmission Lines
Truy kích Gaming
100% (1)
Cit353 2023 - 1 Tma12&3 27 - 30
Document4 pages
Cit353 2023 - 1 Tma12&3 27 - 30
obanus eric
No ratings yet
Kevin Dixon Laser Scanning
Document41 pages
Kevin Dixon Laser Scanning
ajocea
No ratings yet
Rut Gon
Document8 pages
Rut Gon
NGUYỄN TRỌNG HUY 20080149
No ratings yet
Abu Dhabi University Electrical and Computer Engineering Lab Report
Document3 pages
Abu Dhabi University Electrical and Computer Engineering Lab Report
Muhammad Ahmed
No ratings yet
RZLLCW - GS REV0.0.2 GroundStudio Jade Nano+ Pinout
Document1 page
RZLLCW - GS REV0.0.2 GroundStudio Jade Nano+ Pinout
BIosFile
No ratings yet
Logcat
Document976 pages
Logcat
Bima Setiaji
No ratings yet
Bungholio CB Codes
Document15 pages
Bungholio CB Codes
Luis Antonio Islas
No ratings yet
Abit Ah4t Manual
Document33 pages
Abit Ah4t Manual
dcurvers1980
No ratings yet
Extracted Pagdaftar Simak Produk
Document6 pages
Extracted Pagdaftar Simak Produk
running technology
No ratings yet
Follow Up FMT PDF Love
Document1 page
Follow Up FMT PDF Love
Mrs Javierre Angelica
No ratings yet
Ma 520 00 en 02
Document16 pages
Ma 520 00 en 02
Procurement Pardisan
No ratings yet
4.3-10 Female Connector For 1/2" Coaxial Cable, OMNI FIT Standard, O-Ring Sealing
Document2 pages
4.3-10 Female Connector For 1/2" Coaxial Cable, OMNI FIT Standard, O-Ring Sealing
Luciano Silvério Leite
No ratings yet
Ibd Roland Electronic Manuale E20
Document130 pages
Ibd Roland Electronic Manuale E20
Fidel R
No ratings yet
DSA Theory Final
Document8 pages
DSA Theory Final
Talha Syed
No ratings yet
14 File System Implementation
Document46 pages
14 File System Implementation
demelash belay
No ratings yet
Liebeherr All Set Service Manual Operation and Maintenance Manual
Document16 pages
Liebeherr All Set Service Manual Operation and Maintenance Manual
jacquelinelee021200sfj
100% (14)
Pengenalan Dan Pengantar: Scada (Supervisory Control and Data Aquisition)
Document21 pages
Pengenalan Dan Pengantar: Scada (Supervisory Control and Data Aquisition)
Arief Kurniawan
No ratings yet
Muet Cefr Reading Paper Part 1 and Part 2 Worksheet
Document4 pages
Muet Cefr Reading Paper Part 1 and Part 2 Worksheet
k8y 1010
100% (1)
Moxa Tech Note - EtherNet IP Scanner Configuration For MGate 5105-MB-EIP
Document9 pages
Moxa Tech Note - EtherNet IP Scanner Configuration For MGate 5105-MB-EIP
Sergio Rivera
No ratings yet
Fintech Report by SK
Document13 pages
Fintech Report by SK
Shubham Kala
No ratings yet
ASSIGNMENT
Document5 pages
ASSIGNMENT
aaadibakbar
No ratings yet
Activity#tachometer - MEC 0326.1-2
Document4 pages
Activity#tachometer - MEC 0326.1-2
MarkJude Morla
No ratings yet