Professional Documents
Culture Documents
AIM:
write a code using Pandas to load a CSV file containing sample data and perform data cleaning
THEORY:
In this practical, we will demonstrate how to clean and preprocess data from a CSV file using Python's
Pandas library. We'll cover essential data cleaning techniques, such as handling missing values,
removing duplicates, converting data types, handling outliers, and standardizing or normalizing
numerical columns. The data will resemble a typical database structure, and we'll create a sample CSV
file to showcase the cleaning process.
Key Concepts
1. Data Cleaning and Its Importance
Data cleaning, also known as data preprocessing or data wrangling, is a critical step in the data
analysis and machine learning lifecycle. It involves identifying and correcting errors,
inconsistencies, and inaccuracies in the dataset to ensure that the data is accurate, reliable, and
suitable for analysis. The process often includes handling missing or incomplete data, removing
duplicates, transforming data types, dealing with outliers, and normalizing or standardizing
data.
Algorithm:
1. Import the Necessary Libraries:
Import Pandas, NumPy, SciPy's stats module, and scikit-learn's MinMaxScaler and
StandardScaler.
2. Load the CSV File:
Define the file path to the CSV file.
Load the CSV file into a Pandas DataFrame.
3. Display the Initial Data:
Print a message indicating the display of the initial DataFrame.
Display the first few rows of the DataFrame to understand the initial data.
4. Handle Missing Values:
Fill missing numerical values with the mean of each respective column.
5. Drop Duplicate Rows:
Remove duplicate rows from the DataFrame.
6. Drop Rows with Specific Missing Values:
Drop rows with missing values in specific columns.
7. Fill Missing Values in Specific Columns:
Fill missing values in specified columns with a default value.
8. Data Type Conversion:
Convert a specific column to the desired data type.
9. Convert a Column to Datetime Type:
Convert a column to the datetime data type.
10. Remove Outliers:
Define a function to remove outliers using z-score.
Remove outliers from a specific numerical column.
11. Normalize a Numerical Column:
Normalize a numerical column to the range [0, 1].
12. Standardize a Numerical Column:
Standardize a numerical column (mean=0, standard deviation=1).
13. Display the Cleaned DataFrame:
Print a message indicating the display of the cleaned DataFrame.
Display the first few rows of the cleaned DataFrame.