L32, 33 Pandas

L 32
Pandas:
Pandas is an open-source powerful Python library for data manipulation and analysis. The name
"Pandas" is derived from "Panel Data." It provides two primary data structures: Series and DataFrame.
Additionally, Pandas includes functions to read data from various file formats, such as CSV, TXT, Excel,
and SQL databases. Pandas simplifies data cleaning tasks, such as handling missing data, removing
duplicates, and transforming data.
Series:
A Pandas Series is a one-dimensional array-like object that can hold any data type. It is similar to a
column in a table or a list in Python. Series automatically assigns an index to each element.
Example:
import pandas as pd
# Creating a Series from a list
data_list = [10, 20, 30, 40, 50]
series_from_list = pd.Series(data_list)
# Creating a Series from a NumPy array
import numpy as np
data_array = np.array([1, 2, 3, 4, 5])
series_from_array = pd.Series(data_array)
# Creating a Series with custom index
data_dict = {'a': 100, 'b': 200, 'c': 300, 'd': 400, 'e': 500}
series_custom_index = pd.Series(data_dict)
# Displaying the Series
print("Series from List:")
print(series_from_list)
print("\nSeries from Array:")
print(series_from_array)
print("\nSeries with Custom Index:")
print(series_custom_index)
Output:
Series from List:
0 10
1 20
2 30
3 40
4 50
Series from Array:
0 1
1 2
2 3
3 4
4 5
Series with Custom Index:
a 100
b 200
c 300
d 400
e 500
DataFrame:
A Pandas DataFrame is a two-dimensional labeled data structure with columns that can be of different
data types. It is similar to a spreadsheet or SQL table. You can create a DataFrame from dictionaries,
lists, or NumPy arrays.
Example:
import pandas as pd
# Creating a DataFrame from a dictionary of lists
data = {'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']}
df_from_dict = pd.DataFrame(data)
# Creating a DataFrame from a NumPy array with custom index and columns
import numpy as np
data_array = np.random.randn(3, 3)
df_from_array = pd.DataFrame(data_array, index=['Row1', 'Row2', 'Row3'], columns=['Column1',

'Column2', 'Column3'])
# Displaying the DataFrames
print("DataFrame from Dictionary:")
print(df_from_dict)
print("\nDataFrame from NumPy Array:")
print(df_from_array)
Output:
DataFrame from Dictionary:
Name Age City
0 Alice 25 New York
1 Bob 30 San Francisco
2 Charlie 35 Los Angeles
DataFrame from NumPy Array:
Column1 Column2 Column3
Row1 0.643650 0.162998 1.635538
Row2 0.093151 0.156120 0.667025
Row3 0.152770 -0.430975 -1.330561
The index parameter specifies custom row indices, and the columns parameter specifies custom
column names.
Reading Data:
Pandas provides functions like read_csv() for CSV files, pd.read_excel() for Excel files, pd.read_json()
for JSON files, pd.read_sql() for SQL databases respectively.
Example 1:
Let's say we have a CSV file named "example.csv" with the following content:
Name,Age,City
Alice,25,New York
Bob,30,San Francisco
Charlie,35,Los Angeles
import pandas as pd
# Reading from a CSV file
df = pd.read_csv('example.csv')
print(df)
Output:
Name Age City
0 Alice 25 New York
Example 2:
Suppose you have a tab-separated values (TSV) file named "example.txt" with the following content:
Name Age City
Alice 25 New York
Bob 30 San Francisco
Charlie 35 Los Angeles
import pandas as pd
# Reading data from a TSV text file

df = pd.read_csv('example.txt', delimiter='\t')
# Displaying the DataFrame
print("DataFrame from TSV:")
print(df)
Output:
DataFrame from TSV:
Name Age City
0 Alice 25 New York
L 33
ANALYZING AND CLEANING DATA
Imagine a dataset of students with information like their names, ages, grades, and any missing or
inconsistent data. Assuming you have a CSV file named student_data.csv with the following content:
Name,Age,Grade
John,18,A
Sara,20,B
Tom,,C
Alice,22,D
Bob,19,
Here's a step-by-step guide to analyzing and cleaning this dataset using Pandas:
Step 1: Import Pandas
import pandas as pd
Step 2: Load the Dataset
# Load the dataset into a Pandas DataFrame
df = pd.read_csv('student_data.csv')
Step 3: Explore the Dataset
# Display the first few rows of the DataFrame
print(df.head())
Output:
Name Age Grade
0 John 18.0 A
1 Sara 20.0 B
2 Tom NaN C
3 Alice 22.0 D
4 Bob 19.0 NaN
Step 4: Check for Missing Values
# Check for missing values in each column
print(df.isnull().sum())
Output:
Name 0
Age 1
Grade 1
Step 5: Handle Missing Values
In this case, we'll fill missing ages with the median age and missing grades with a default value, say
'F'.
# Fill missing values in the 'Age' column with the median age
df['Age'].fillna(df['Age'].median(), inplace=True)
# Fill missing values in the 'Grade' column with 'F'
df['Grade'].fillna('F', inplace=True)
Step 6: Convert Data Types
Ensure that the 'Age' column is of the correct data type.
# Convert 'Age' column to integer
df['Age'] = df['Age'].astype(int)
Step 7: Data Summary
Get a summary of the cleaned data.
# Display summary statistics of the DataFrame
print(df.describe())
Output:
Age
count 5.000000
mean 19.800000
std 1.923538
min 18.000000
25% 19.000000
50% 19.000000
75% 20.000000
max 22.000000

L32, 33 Pandas

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L32, 33 Pandas

Uploaded by

Copyright:

Available Formats

L 32

# Creating a Series from a list

data_list = [10, 20, 30, 40, 50]

# Creating a Series from a NumPy array

data_array = np.array([1, 2, 3, 4, 5])

# Creating a Series with custom index

# Displaying the Series

print("Series from List:")

print("\nSeries from Array:")

print("\nSeries with Custom Index:")

Series from List:

Series from Array:

Series with Custom Index:

# Creating a DataFrame from a dictionary of lists

data = {'Name': ['Alice', 'Bob', 'Charlie'],

'City': ['New York', 'San Francisco', 'Los Angeles']}

df_from_array = pd.DataFrame(data_array, index=['Row1', 'Row2', 'Row3'], columns=['Column1',

# Displaying the DataFrames

print("DataFrame from Dictionary:")

print("\nDataFrame from NumPy Array:")

DataFrame from Dictionary:

Name Age City

0 Alice 25 New York

1 Bob 30 San Francisco

2 Charlie 35 Los Angeles

DataFrame from NumPy Array:

Column1 Column2 Column3

Row1 0.643650 0.162998 1.635538

Row2 0.093151 0.156120 0.667025

Row3 0.152770 -0.430975 -1.330561

# Reading from a CSV file

Name Age City

0 Alice 25 New York

1 Bob 30 San Francisco

2 Charlie 35 Los Angeles

Name Age City

Alice 25 New York

Bob 30 San Francisco

Charlie 35 Los Angeles

# Reading data from a TSV text file

# Displaying the DataFrame

print("DataFrame from TSV:")

DataFrame from TSV:

Name Age City

0 Alice 25 New York

1 Bob 30 San Francisco

2 Charlie 35 Los Angeles

ANALYZING AND CLEANING DATA

Step 1: Import Pandas

Step 2: Load the Dataset

# Load the dataset into a Pandas DataFrame

# Display the first few rows of the DataFrame

Name Age Grade

4 Bob 19.0 NaN

Step 4: Check for Missing Values

# Check for missing values in each column

Step 5: Handle Missing Values

# Fill missing values in the 'Grade' column with 'F'

Ensure that the 'Age' column is of the correct data type.

# Convert 'Age' column to integer

Step 7: Data Summary

Get a summary of the cleaned data.

# Display summary statistics of the DataFrame