You are on page 1of 5

Practical - 7

AIM:
write a code using Pandas to load a CSV file containing sample data and perform data cleaning

THEORY:
In this practical, we will demonstrate how to clean and preprocess data from a CSV file using Python's
Pandas library. We'll cover essential data cleaning techniques, such as handling missing values,
removing duplicates, converting data types, handling outliers, and standardizing or normalizing
numerical columns. The data will resemble a typical database structure, and we'll create a sample CSV
file to showcase the cleaning process.

Key Concepts
1. Data Cleaning and Its Importance
Data cleaning, also known as data preprocessing or data wrangling, is a critical step in the data
analysis and machine learning lifecycle. It involves identifying and correcting errors,
inconsistencies, and inaccuracies in the dataset to ensure that the data is accurate, reliable, and
suitable for analysis. The process often includes handling missing or incomplete data, removing
duplicates, transforming data types, dealing with outliers, and normalizing or standardizing
data.

Key Concepts in Data Cleaning:


1) Handling Missing Values: Missing data is a common issue in datasets. It can adversely
affect the quality and accuracy of any analysis or model. Data cleaning involves strategies
to either fill missing values or remove rows or columns with missing data.
2) Removing Duplicates: Duplicates can skew analyses by inflating the importance of certain
data points. Data cleaning identifies and removes these redundant records to maintain data
integrity.
3) Data Type Conversion: Data often needs to be converted to the appropriate data types for
accurate analysis. For instance, converting strings to numerical or datetime formats for
meaningful calculations.
4) Outlier Detection and Handling: Outliers are data points significantly different from
other observations and can distort statistical analyses. Data cleaning involves identifying
and either removing or handling outliers appropriately.
5) Normalization and Standardization: These techniques transform numerical data to a
standard scale, making it easier to compare and analyze. Normalization scales data to a
range of [0, 1], while standardization transforms data to have a mean of 0 and a standard
deviation of 1.

Importance of Data Cleaning:


1) Improves Data Quality: Data cleaning enhances the overall quality and accuracy of the
dataset by eliminating errors, inconsistencies, and inaccuracies.
2) Increases Analysis Reliability: Clean data leads to more reliable and accurate analyses,
aiding in making informed decisions and drawing meaningful insights.
3) Enhances Model Performance: Preprocessing, including data cleaning, significantly
impacts model performance in machine learning. Clean, well-prepared data can lead to
better predictive models.
4) Saves Time and Resources: By addressing data quality issues early on, data cleaning saves
time during analysis and prevents wasted resources on faulty insights derived from flawed
data.
5) Ensures Data Consistency: Data cleaning ensures consistency in the dataset, making it
easier to combine and integrate with other datasets or systems.
6) Increases Data Usability: Cleaned data is more usable and understandable, facilitating
easier sharing, collaboration, and reuse of the dataset.
2. Pandas and Its Role in Data Preprocessing
Pandas is an open-source Python library that provides powerful data structures and data analysis
tools. It is widely used for data manipulation, analysis, and preparation. The primary data
structures in Pandas are Series (1-dimensional labeled arrays) and DataFrame (2-dimensional
labeled data structures), allowing for easy handling and processing of tabular data.

Key Features of Pandas:


1) DataFrame:
 A two-dimensional, size-mutable, and potentially heterogeneous tabular data structure.
 Contains labeled axes (rows and columns) and supports arithmetic operations on rows
and columns.
2) Series:
 A one-dimensional labeled array capable of holding any data type (integer, string, float,
Python objects, etc.).
 Essentially, it's a single column of a DataFrame.
3) Data Manipulation:
 Supports various operations such as merging, reshaping, slicing, indexing, and more.
 Facilitates handling missing data and time-series data.
4) Data Input/Output:
 Provides utilities for reading and writing data in different formats, including CSV,
Excel, SQL databases, and more.
5) Data Cleaning:
 Offers a wide array of functions to handle missing values, duplicates, and outliers.
 Allows for data type conversion, scaling, and normalization.
6) Data Analysis:
 Supports statistical, arithmetic, and mathematical operations on data.
 Provides tools for descriptive statistics, correlation, groupby operations, and more.
7) Integration with Other Libraries:
 Integrates seamlessly with other Python libraries used in data science, such as NumPy,
Matplotlib, Seaborn, and Scikit-Learn.

Role of Pandas in Data Preprocessing:


1) Data Loading: Pandas allows easy loading of data from various sources such as CSV,
Excel, SQL, JSON, HTML, and more into a DataFrame, providing a convenient starting
point for data preprocessing.
2) Handling Missing Data: Pandas offers functions like isnull(), notnull(), dropna(), and
fillna() to identify, drop, or fill missing data, ensuring the dataset is complete.
3) Removing Duplicates: The duplicated() and drop_duplicates() functions in Pandas help
identify and remove duplicate records from the dataset.
4) Data Transformation: Pandas facilitates data type conversions using functions like
astype(), making it easy to change data types as needed.
5) Data Normalization and Standardization: The library enables scaling and standardizing
numerical data using functions like MinMaxScaler and StandardScaler from scikit-learn,
which are often used for data preprocessing in machine learning.
6) Exploratory Data Analysis (EDA): Pandas supports data exploration through various
functions such as describe(), value_counts(), groupby(), and more, aiding in understanding
the dataset's characteristics.

Algorithm:
1. Import the Necessary Libraries:
 Import Pandas, NumPy, SciPy's stats module, and scikit-learn's MinMaxScaler and
StandardScaler.
2. Load the CSV File:
 Define the file path to the CSV file.
 Load the CSV file into a Pandas DataFrame.
3. Display the Initial Data:
 Print a message indicating the display of the initial DataFrame.
 Display the first few rows of the DataFrame to understand the initial data.
4. Handle Missing Values:
 Fill missing numerical values with the mean of each respective column.
5. Drop Duplicate Rows:
 Remove duplicate rows from the DataFrame.
6. Drop Rows with Specific Missing Values:
 Drop rows with missing values in specific columns.
7. Fill Missing Values in Specific Columns:
 Fill missing values in specified columns with a default value.
8. Data Type Conversion:
 Convert a specific column to the desired data type.
9. Convert a Column to Datetime Type:
 Convert a column to the datetime data type.
10. Remove Outliers:
 Define a function to remove outliers using z-score.
 Remove outliers from a specific numerical column.
11. Normalize a Numerical Column:
 Normalize a numerical column to the range [0, 1].
12. Standardize a Numerical Column:
 Standardize a numerical column (mean=0, standard deviation=1).
13. Display the Cleaned DataFrame:
 Print a message indicating the display of the cleaned DataFrame.
 Display the first few rows of the cleaned DataFrame.

CODE AND OUTOUT:

You might also like