Data Preprocessing in Python Pandas (With Code)

1
Python Data Processing (with code)
1. Introduction
Definition of data pre-processing
Data preprocessing is the process of preparing data for analysis by cleaning, transforming, and
selecting relevant features. It involves identifying and handlingmissing or duplicate data, scaling
features, encoding categorical data, reducingdimensionality, and splitting data into training and
testing sets.
Proper data preprocessing helps to ensure data accuracy and consistency and leads to more
accurate and reliable results.
Python is a popular programming language used in data analysis and machine learning. It offers a
wide range of libraries and tools that can be used for data preprocessing tasks, such as data cleaning,
feature scaling, encoding categoricaldata, and reducing dimensionality.
Some of the popular libraries used for data preprocessing in Python include NumPy, Pandas,
Scikit-learn, and Matplotlib. These libraries provide various functions and methods that make it
easier to perform data preprocessing tasksefficiently and effectively.
Importance of data pre-processing
Data preprocessing is an essential step in data analysis and machine learning. It helps to ensure
data accuracy, consistency, and suitability for downstream analysis.Some of the reasons why data
preprocessing is important are:
1. Improves data accuracy: By identifying and handling missing or duplicate data,data scientists
can improve data accuracy, reducing the risk of errors and inaccuracies in the results.
2. Handles outliers: Outliers can skew the results of data analysis or machine learning
models. Data preprocessing techniques such as normalization or standardization can
help to handle outliers and improve the performance ofmodels.
3. Enables feature scaling: Scaling features is an important step in data preprocessing that
helps to ensure that all features have the same scale. This isimportant for some machine
learning algorithms that are sensitive to the scaleof features.
4. Encodes categorical data: Many machine learning algorithms cannot handlecategorical
data. Therefore, data preprocessing techniques such as one-hot encoding or label
encoding can be used to convert categorical data into numerical data that can be used in
machine learning models.
5. Reduces dimensionality: Data preprocessing techniques such as principal component
analysis (PCA) can be used to reduce the dimensionality of data, making it easier to
analyze or model.
Overall, data preprocessing is critical to ensure that data is suitable for analysis and to obtain
reliable and accurate results. It helps to eliminate errors and inaccuracies,improve the
performance of machine learning models for decisions based on the data.
Jignesh Sanghvi
2
2. Data pre-processing techniques
2.1 data cleaning
Data cleaning involves various techniques that can be used to identify and handle missing or
erroneous data. Some of the techniques used in data cleaning are:
1. Removing duplicates: Duplicates can skew the results of data analysis or machine learning
models. Removing duplicates can improve the accuracy ofresults and reduce the risk of
errors.
2. Handling missing data: Missing data can be handled using various techniques, such as
deleting missing data, imputing missing data, or replacing missing datawith values such as
mean or median.
3. Handling outliers: Outliers can also be considered as missing data. Various techniques, such
as winsorization or replacing outliers with missing data, canbe used to handle outliers.
4. Standardizing or normalizing data: Standardizing or normalizing data involvesscaling the
data to a common scale. This is important for some machine learning algorithms that are
sensitive to the scale of features.
5. Encoding categorical data: Categorical data can be encoded into numerical data using
techniques such as one-hot encoding or label encoding. This is important for some machine
learning algorithms that cannot handle categorical data.
6. Feature selection: Feature selection involves selecting relevant features foranalysis or
modeling. This is important for reducing dimensionality and improving the
performance of machine learning models.
7. Handling data errors: Data errors, such as data entry errors or formatting errors, can be
handled using various techniques, such as data validation or dataprofiling.
Overall, data cleaning is an important step in data preprocessing that ensures data accuracy,
consistency, and suitability for downstream analysis. It involves various techniques that can be
used to identify and handle missing or erroneous data and improve the performance of machine
learning models.
2.2 Data transformation
Data transformation is the process of converting data from one format or structureto another. It
is an important step in data preprocessing that can help to improve the quality of data and make
it more suitable for analysis or modeling. Some of thetechniques used in data transformation are:
1. Scaling: Scaling involves rescaling the data to a common scale, such as between 0 and 1 or -1
and 1. This is important for some machine learning algorithms thatare sensitive to the scale of
features.
2. Normalization: Normalization involves transforming the data so that it has a normal
distribution. This is important for some statistical analyses and machinelearning algorithms
that assume a normal distribution.
Jignesh Sanghvi
3
3. Aggregation: Aggregation involves combining multiple data points into a singledata point.
This can be useful for summarizing data and reducing dimensionality.
4. Discretization: Discretization involves converting continuous data into categorical data. This
can be useful for some machine learning algorithms thatcannot handle continuous data.
5. Encoding: Encoding involves converting categorical data into numerical data. This is
important for some machine learning algorithms that cannot handle categorical data.
6. Feature engineering: Feature engineering involves creating new features from existing
features. This can be useful for improving the performance of machinelearning models.
Overall, data transformation is an important step in data preprocessing that can help to
improve the quality of data and make it more suitable for downstream analysis or modeling.
It involves various techniques that can be used to rescale, normalize, aggregate, discretize,
encode, or engineer features.
2.3 Data selection
Data selection is the process of selecting a subset of data from a larger dataset basedon certain
criteria. It is an important step in data preprocessing that can help to reduce the size of the dataset
and focus on relevant data for analysis or modeling.
There are several techniques used for data selection, including:
1. Random sampling: Random sampling involves selecting a random subset ofdata from
the larger dataset. This is useful when the dataset is too large to process as a whole and
a representative sample is needed.
2. Stratified sampling: Stratified sampling involves dividing the dataset into subgroups based
on a specific variable and then selecting a random sample from each subgroup. This is
useful when the variable is important for analysisor modeling.
3. Feature selection: Feature selection involves selecting a subset of features fromthe dataset
based on their relevance to the analysis or modeling task. This is useful for reducing the
dimensionality of the dataset and improving the performance of the model.
4. Instance selection: Instance selection involves selecting a subset of instances from the
dataset based on their relevance to the analysis or modeling task. Thisis useful for reducing
the size of the dataset and focusing on relevant data.
Overall, data selection is an important step in data preprocessing that can help to reduce the size
of the dataset and focus on relevant data for analysis or modeling. There are several techniques
that can be used for data selection, including random sampling, stratified sampling, feature
selection, and instance selection. The choiceof technique will depend on the specific needs of the
analysis or modeling task.
Jignesh Sanghvi
4
3. Pandas for data pre-processing

Pandas is a popular Python library used for data manipulation and analysis. It provides data structures for
efficiently storing and processing large datasets, as well as tools for data cleaning, aggregation, and
transformation. In this section, we will provide an overview of the Pandas library, its main features, and
techniques for data preprocessing using Pandas, along with code examples.
3.1 Overview of the Pandas Library
Pandas provides two primary data structures for storing and manipulating data: Series and DataFrame. A Series
is a one-dimensional array-like object that can holdany data type, while a DataFrame is a two-dimensional table-
like object consisting of rows and columns.
The main features of Pandas include:
1. Data cleaning and transformation: Pandas provides tools for cleaning and transforming data,
including methods for handling missing data, removingduplicates, and replacing values.
2. Data aggregation: Pandas can group data based on one or more variables and perform aggregate
operations on each group, such as sum, mean, and count.
3. Data merging and joining: Pandas can merge multiple datasets based on common columns or
indices, or join two datasets based on a common key.
4. Time series analysis: Pandas provides functionality for working with time seriesdata, including resampling,
moving window statistics, and time zone handling.
3.2 Techniques for Data Preprocessing using Pandas
3.2.1 Data Cleaning with Pandas
One common task in data preprocessing is cleaning the data, which involves handling missing values,
removing duplicates, and correcting errors. Pandas provides several methods for cleaning data,
including:
1. Handling missing data: Pandas provides methods for filling in missing data ordropping missing data points.
For example, the dropna() method drops any rows or columns that contain missing data, while the fillna()
method fills in missing data with a specified value.
2. Removing duplicates: Pandas provides a drop_duplicates() method thatremoves duplicate rows
from a DataFrame.
3. Correcting errors: Pandas provides methods for replacing or removing incorrect values. For example,
the replace() method can be used to replacespecific values with new values.
Jignesh Sanghvi
5
3.2.2 Data Transformation with Pandas
Another important task in data preprocessing is transforming the data to make it more suitable for analysis
or modeling. Pandas provides several methods for transforming data, including:
1. Filtering data: Pandas provides methods for selecting specific rows or columnsbased on criteria such as a
specific value, a range of values, or a boolean expression. For example, the loc[] method can be used to
select rows and columns by label, while the iloc[] method can be used to select rows andcolumns by index.
2. Sorting data: Pandas provides a sort_values() method for sorting a DataFrameby one or more columns or
indices.
3. Grouping data: Pandas provides a groupby() method for grouping a DataFrameby one or more variables
and performing aggregate operations on each group, such as sum, mean, and count.
3.2.3 Data Merging and Joining with Pandas
When working with multiple datasets, it is often necessary to merge or join them together based on a common
column or key. Pandas provides several methods formerging and joining data, including:
merge(): merges two DataFrames based on a common column or key.
join(): joins two DataFrames based on their indices.
concat(): concatenates multiple DataFrames along a specified axis.
Jignesh Sanghvi
6
3.2.4 Examples of Data Cleaning and Transformation with Pandas
1. Reading a CSV file
import pandas as pd
df = pd.read_csv('filename.csv')
2. Checking the shape of the DataFrame
print(df.shape)
3. Checking the data types of the columns
print(df.dtypes)
4. Checking the number of missing values in each column
print(df.isnull().sum())
Jignesh Sanghvi
7
1. Dropping columns
df.drop(['column1', 'column2'], axis=1, inplace=True)
2. Renaming columns
df.rename(columns={'old_name': 'new_name'}, inplace=True)
3. Changing the data type of a column:
df['column'] = df['column'].astype('float')
4. Handling missing data (dropping rows with missing values):
df.dropna(inplace=True)
5. Handling missing data (imputing missing values with the median)
df.fillna(df.median(), inplace=True)
Jignesh Sanghvi
8
6. Handling missing data (imputing missing values with the mean)
df.fillna(df.mean(), inplace=True)
7. Handling missing data (imputing missing values with a constant)
df.fillna(0, inplace=True)
8. Handling categorical data (creating dummy variables)
df = pd.get_dummies(df, columns=['categorical_column'])
9. Handling categorical data (label encoding)
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['encoded_column'] = encoder.fit_transform(df['categorical_column'])
Jignesh Sanghvi
9
10. Handling numerical data (binning)
df['binned_column'] = pd.cut(
df['numerical_column'],
bins=5,
labels=['very_low', 'low', 'medium', 'high', 'very_high'])
11. Handling numerical data (scaling to a range)
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df['scaled_column'] = scaler.fit_transform(df[['numerical_column']])
12. Handling numerical data (standardization)
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df['standardized_column'] = scaler.fit_transform(df[['numerical_column']])
13. Handling datetime data (converting to datetime format)
df['datetime_column'] = pd.to_datetime(df['datetime_column'])
Jignesh Sanghvi
10
14. Handling datetime data (extracting year)
df['year'] = df['datetime_column'].dt.year
15. Handling datetime data (extracting month)
df['month'] = df['datetime_column'].dt.month
16. Handling datetime data (extracting day)
df['day'] = df['datetime_column'].dt.day
17. Handling text data (converting to lowercase)
df['text_column'] = df['text_column'].str.lower()
Jignesh Sanghvi
11
18. Handling text data (removing punctuation)
import string
df['text_column'] = df['text_column'].str.translate(
str.maketrans('', '', string.punctuation))
19. Handling text data (removing stop words)
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
df['text_column'] = df['text_column'].apply(lambda x: ' '.join([word f
Jignesh Sanghvi

Data Preprocessing in Python Pandas (With Code)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Preprocessing in Python Pandas (With Code)

Uploaded by

Copyright:

Available Formats

1

Python Data Processing (with code)

Importance of data pre-processing

2. Data pre-processing techniques

2.1 data cleaning

2.2 Data transformation

2.3 Data selection

There are several techniques used for data selection, including:

3. Pandas for data pre-processing

3.1 Overview of the Pandas Library

The main features of Pandas include:

3.2 Techniques for Data Preprocessing using Pandas

3.2.1 Data Cleaning with Pandas

3.2.2 Data Transformation with Pandas

3.2.3 Data Merging and Joining with Pandas

merge(): merges two DataFrames based on a common column or key.

join(): joins two DataFrames based on their indices.

concat(): concatenates multiple DataFrames along a specified axis.

3.2.4 Examples of Data Cleaning and Transformation with Pandas

1. Reading a CSV file

2. Checking the shape of the DataFrame

3. Checking the data types of the columns

4. Checking the number of missing values in each column

df.drop(['column1', 'column2'], axis=1, inplace=True)

df.rename(columns={'old_name': 'new_name'}, inplace=True)

3. Changing the data type of a column:

4. Handling missing data (dropping rows with missing values):

5. Handling missing data (imputing missing values with the median)

6. Handling missing data (imputing missing values with the mean)

7. Handling missing data (imputing missing values with a constant)

8. Handling categorical data (creating dummy variables)

9. Handling categorical data (label encoding)

from sklearn.preprocessing import LabelEncoder

10. Handling numerical data (binning)

11. Handling numerical data (scaling to a range)

from sklearn.preprocessing import MinMaxScaler

12. Handling numerical data (standardization)

from sklearn.preprocessing import StandardScaler

13. Handling datetime data (converting to datetime format)

14. Handling datetime data (extracting year)

15. Handling datetime data (extracting month)

16. Handling datetime data (extracting day)

17. Handling text data (converting to lowercase)

18. Handling text data (removing punctuation)

19. Handling text data (removing stop words)

from nltk.corpus import stopwords

You might also like