You are on page 1of 7

L 32

Pandas:

Pandas is an open-source powerful Python library for data manipulation and analysis. The name
"Pandas" is derived from "Panel Data." It provides two primary data structures: Series and DataFrame.
Additionally, Pandas includes functions to read data from various file formats, such as CSV, TXT, Excel,
and SQL databases. Pandas simplifies data cleaning tasks, such as handling missing data, removing
duplicates, and transforming data.

Series:

A Pandas Series is a one-dimensional array-like object that can hold any data type. It is similar to a
column in a table or a list in Python. Series automatically assigns an index to each element.

Example:

import pandas as pd

# Creating a Series from a list

data_list = [10, 20, 30, 40, 50]

series_from_list = pd.Series(data_list)

# Creating a Series from a NumPy array

import numpy as np

data_array = np.array([1, 2, 3, 4, 5])

series_from_array = pd.Series(data_array)

# Creating a Series with custom index

data_dict = {'a': 100, 'b': 200, 'c': 300, 'd': 400, 'e': 500}

series_custom_index = pd.Series(data_dict)

# Displaying the Series

print("Series from List:")

print(series_from_list)

print("\nSeries from Array:")

print(series_from_array)

print("\nSeries with Custom Index:")

print(series_custom_index)
Output:

Series from List:

0 10

1 20

2 30

3 40

4 50

Series from Array:

0 1

1 2

2 3

3 4

4 5

Series with Custom Index:

a 100

b 200

c 300

d 400

e 500

DataFrame:

A Pandas DataFrame is a two-dimensional labeled data structure with columns that can be of different
data types. It is similar to a spreadsheet or SQL table. You can create a DataFrame from dictionaries,
lists, or NumPy arrays.

Example:

import pandas as pd

# Creating a DataFrame from a dictionary of lists

data = {'Name': ['Alice', 'Bob', 'Charlie'],


'Age': [25, 30, 35],

'City': ['New York', 'San Francisco', 'Los Angeles']}

df_from_dict = pd.DataFrame(data)

# Creating a DataFrame from a NumPy array with custom index and columns

import numpy as np

data_array = np.random.randn(3, 3)

df_from_array = pd.DataFrame(data_array, index=['Row1', 'Row2', 'Row3'], columns=['Column1',


'Column2', 'Column3'])

# Displaying the DataFrames

print("DataFrame from Dictionary:")

print(df_from_dict)

print("\nDataFrame from NumPy Array:")

print(df_from_array)

Output:

DataFrame from Dictionary:

Name Age City

0 Alice 25 New York

1 Bob 30 San Francisco

2 Charlie 35 Los Angeles

DataFrame from NumPy Array:

Column1 Column2 Column3

Row1 0.643650 0.162998 1.635538

Row2 0.093151 0.156120 0.667025

Row3 0.152770 -0.430975 -1.330561

The index parameter specifies custom row indices, and the columns parameter specifies custom
column names.
Reading Data:

Pandas provides functions like read_csv() for CSV files, pd.read_excel() for Excel files, pd.read_json()
for JSON files, pd.read_sql() for SQL databases respectively.

Example 1:

Let's say we have a CSV file named "example.csv" with the following content:

Name,Age,City

Alice,25,New York

Bob,30,San Francisco

Charlie,35,Los Angeles

import pandas as pd

# Reading from a CSV file

df = pd.read_csv('example.csv')

print(df)

Output:

Name Age City

0 Alice 25 New York

1 Bob 30 San Francisco

2 Charlie 35 Los Angeles

Example 2:

Suppose you have a tab-separated values (TSV) file named "example.txt" with the following content:

Name Age City

Alice 25 New York

Bob 30 San Francisco

Charlie 35 Los Angeles

import pandas as pd

# Reading data from a TSV text file


df = pd.read_csv('example.txt', delimiter='\t')

# Displaying the DataFrame

print("DataFrame from TSV:")

print(df)

Output:

DataFrame from TSV:

Name Age City

0 Alice 25 New York

1 Bob 30 San Francisco

2 Charlie 35 Los Angeles

L 33

ANALYZING AND CLEANING DATA

Imagine a dataset of students with information like their names, ages, grades, and any missing or
inconsistent data. Assuming you have a CSV file named student_data.csv with the following content:

Name,Age,Grade

John,18,A

Sara,20,B

Tom,,C

Alice,22,D

Bob,19,

Here's a step-by-step guide to analyzing and cleaning this dataset using Pandas:

Step 1: Import Pandas

import pandas as pd

Step 2: Load the Dataset

# Load the dataset into a Pandas DataFrame

df = pd.read_csv('student_data.csv')
Step 3: Explore the Dataset

# Display the first few rows of the DataFrame

print(df.head())

Output:

Name Age Grade

0 John 18.0 A

1 Sara 20.0 B

2 Tom NaN C

3 Alice 22.0 D

4 Bob 19.0 NaN

Step 4: Check for Missing Values

# Check for missing values in each column

print(df.isnull().sum())

Output:

Name 0

Age 1

Grade 1

Step 5: Handle Missing Values

In this case, we'll fill missing ages with the median age and missing grades with a default value, say
'F'.

# Fill missing values in the 'Age' column with the median age

df['Age'].fillna(df['Age'].median(), inplace=True)

# Fill missing values in the 'Grade' column with 'F'

df['Grade'].fillna('F', inplace=True)
Step 6: Convert Data Types

Ensure that the 'Age' column is of the correct data type.

# Convert 'Age' column to integer

df['Age'] = df['Age'].astype(int)

Step 7: Data Summary

Get a summary of the cleaned data.

# Display summary statistics of the DataFrame

print(df.describe())

Output:

Age

count 5.000000

mean 19.800000

std 1.923538

min 18.000000

25% 19.000000

50% 19.000000

75% 20.000000

max 22.000000

You might also like