You are on page 1of 31

Introduction to Pandas: A Powerful Python Library

By: Logesh R
Overview of Pandas vs
Pandas NumPy

Installing and
Pandas Data
Importing
Structures
Pandas
Agenda
Merging and
Operating on
Concatenating
Data in Pandas
DataFrames

Handling
Missing Data
Introduction to Pandas
Definition and Importance Applications in Data Analysis Key Features

• Definition: • Pandas, a potent Python library, • Pandas, a powerful Python library


• Pandas is a Python library for data finds applications in data analysis for data analysis, boasts key
manipulation and analysis. It offers by streamlining tasks like data features like versatile data
powerful data structures like cleaning, exploration, and structures (Series, DataFrame),
Series and DataFrame, facilitating manipulation, facilitating efficient seamless data manipulation,
efficient handling and exploration insights extraction from diverse effective handling of missing data,
of structured data. datasets in various industries. and integrated data visualization
capabilities.
• Importance:
• Pandas streamlines data cleaning,
analysis, and manipulation tasks,
enabling seamless exploration
and extraction of insights from
diverse datasets in Python.
Aspect Pandas NumPy
Primary Use Case Data manipulation and analysis, Numerical operations,

Pandas vs Primary Use Case


working with labeled data

Series (1D labeled array),


mathematical and logical
operations on arrays
ndarray (N-dimensional array)
NumPy Indexing
DataFrame (2D labeled table)
Supports both label-based and Primarily integer-based indexing
integer-based indexing
Differences and Complementary Missing Data Built-in methods for identifying, Limited support for handling
Roles Handling dropping, and filling missing data missing data
Performance Generally slower for numerical Faster for numerical operations,
operations but optimized for data especially on large datasets
manipulation
Flexibility Offers more flexibility for working Limited flexibility when dealing
with heterogeneous and labeled with structured data
data
Functionality Extensive functions for data Rich set of mathematical and
cleaning, grouping, and analysis array manipulation functions
When to Use Pandas vs NumPy

NumPy:
• Ideal for numerical operations, array manipulation, and mathematical functions.
Pandas:
• Suited for data manipulation, analysis, and cleaning, especially when dealing with
structured data in tabular form.
Use NumPy for:
• Mathematical and array operations.
Use Pandas for:
• Data cleaning, analysis, and manipulation in tabular datasets.
Installing Pandas
Code:
!pip install pandas
Output:
Confirmation of successful installation.
Importing Pandas
Pandas Data Structures
Introduction to Series and DataFrame
Series:
Definition:
A one-dimensional labeled array capable of holding any data type.
Characteristics:
Indexed, Homogeneous data type, Size Immutable.
Use Cases:
Often used for representing a column in a dataset or a single-dimensional dataset.
DataFrame:
Definition:
A two-dimensional labeled data structure with columns that can be of different data types.
Characteristics:
Tabular structure, Indexed, Heterogeneous data types, Size Mutable.
Use Cases:
Represents a complete dataset, similar to a spreadsheet or SQL table.
Code to create a
Pandas Series
Code:
import pandas as pd
data = [1, 2, 3, 4, 5]
series = pd.Series(data)
print(series)
Creating a DataFrame

Code:
import pandas as pd
data= {'Name':'Alice','Bob', 'Charlie'],
'Age': [25, 30, 35]}
df = pd.DataFrame(data)
print(df)
Code:
import pandas as pd
Reading a Dataset df = pd.read_csv('your_dataset.csv')
print(df.head())
Output:
Operating on Data in
Pandas
Pandas in action! Perform basic operations,
statistical summaries, and mean calculations
effortlessly.

Basic Operations
• head(), tail(), sample()
• describe(), info(), dtypes()
• min(), max(), mean()
• df.head() - In Pandas head is used to display the ordered
data from the top.
head(), tail(), • df.tail() - The Tail is opposite to the head. It displays the
sample() ordered data from below.
• df.sample() - Using the Sample method, you can display the
random data from your dataset.
• df.describe() - The describe() method
returns description of the data in the
DataFrame.
• df.info() - Pandas info() function is
describe(), used to get a concise summary of the
info(), dtypes() dataframe.
• df.dtypes() - Pandas dtypes attribute
return the data types in the
DataFrame. It returns a Series with
the data type of each column.
df.describe()
df.info()
df.dtypes()
min(), max(), mean()

• df['column'].mean() - Pandas df.mean() function


returns the mean of the values for the requested
axis.
• df['column'].min() - min() method finds the
minimum of the values in the object and returns it.
• df['column'].max() - max() method finds the
maximum of the values in the object and returns
it.
Data Selection in Pandas
Code for data selection
df['Name'] - Selecting a column
df.loc[n] - Selecting a row
by label
df.iloc[n] - Selecting a row by
index
Filtering Data
df[condition] - Selecting
rows with condition.
Grouping and
Aggregating Data
Pandas dataframe.groupby() function is
used to split the data into groups based on
some criteria.It also helps to aggregate data
efficiently.
Code for grouping and aggregating data
grouped_df =
df.groupby('column_name')

grouped_df.first() #print the first


entries in all the groups formed.

grouped_df.get_group('group_name') #
Finding the values contained in the
any group.
Merging and Concatenating
DataFrames

Concatenating DataFrames
Code:
concatenated_df = pd.concat([df1, df2])
print(concatenated_df)
Merging DataFrames
The merge() method updates the content
of two DataFrame by merging them
together, using the specified method(s).

There are five types of Joins in pandas.


• Inner Join
• Left Outer Join
• Right Outer Join
• Full Outer Join or simply Outer Join
• Index Join
Merging DataFrames
• Inner Join - It returns a Dataframe with only
those rows that have common characteristics.​
• Left Outer Join - For a left join, all the records
from the first Dataframe will be displayed.
Merging DataFrames However, only the records with the keys in the
second Dataframe that can be found in the
first Dataframe will be displayed.
• Right Outer Join - For a right join, all the records
from the second Dataframe will be
Merging DataFrames displayed. However, only the records with the
keys in the first Dataframe that can be found in
the second Dataframe will be displayed.
• Full Outer Join or simply Outer Join - A full outer
join returns all the rows from the
Merging DataFrames left Dataframe, and all the rows from the
right Dataframe, and matches up rows where
possible, with NaNs elsewhere.
• Index Join - To merge the Dataframe on indices
pass the left_index and right_index arguments
Merging DataFrames as True i.e. both the Dataframes are merged on
an index using default Inner Join.
Handling
Missing Data
Identifying Missing Data
Python code:
df.isnull()
df.isnull().sum()
Dropping
Missing Values
Python code:
df.dropna()
Filling
Missing Values

Python code:
df.fillna(<any statement>)
Conclusion
Recap of Key Concepts
Pandas Fundamentals:
Series and DataFrame are the core structures for data manipulation and analysis.
Effortless installation and import with pip install pandas and import pandas as pd.
Data Exploration:
Reading datasets using pd.read_csv() to kickstart analysis.
Basic operations, statistical summaries, and mean calculations for quick insights.
Data Manipulation:
Data selection using labels (df.loc[]) and indexes (df.iloc[]).
Filtering data based on specific conditions.
Advanced Operations:
Grouping and aggregating data for more in-depth analysis.
Merging and concatenating DataFrames to create comprehensive datasets.
Handling Missing Data:
Identifying missing values with df.isnull().
Dropping missing values using df.dropna() and filling them with df.fillna()

You might also like