Professional Documents
Culture Documents
Pandas:
Pandas is an open-source powerful Python library for data manipulation and analysis. The name
"Pandas" is derived from "Panel Data." It provides two primary data structures: Series and DataFrame.
Additionally, Pandas includes functions to read data from various file formats, such as CSV, TXT, Excel,
and SQL databases. Pandas simplifies data cleaning tasks, such as handling missing data, removing
duplicates, and transforming data.
Series:
A Pandas Series is a one-dimensional array-like object that can hold any data type. It is similar to a
column in a table or a list in Python. Series automatically assigns an index to each element.
Example:
import pandas as pd
series_from_list = pd.Series(data_list)
import numpy as np
series_from_array = pd.Series(data_array)
data_dict = {'a': 100, 'b': 200, 'c': 300, 'd': 400, 'e': 500}
series_custom_index = pd.Series(data_dict)
print(series_from_list)
print(series_from_array)
print(series_custom_index)
Output:
0 10
1 20
2 30
3 40
4 50
0 1
1 2
2 3
3 4
4 5
a 100
b 200
c 300
d 400
e 500
DataFrame:
A Pandas DataFrame is a two-dimensional labeled data structure with columns that can be of different
data types. It is similar to a spreadsheet or SQL table. You can create a DataFrame from dictionaries,
lists, or NumPy arrays.
Example:
import pandas as pd
df_from_dict = pd.DataFrame(data)
# Creating a DataFrame from a NumPy array with custom index and columns
import numpy as np
data_array = np.random.randn(3, 3)
print(df_from_dict)
print(df_from_array)
Output:
The index parameter specifies custom row indices, and the columns parameter specifies custom
column names.
Reading Data:
Pandas provides functions like read_csv() for CSV files, pd.read_excel() for Excel files, pd.read_json()
for JSON files, pd.read_sql() for SQL databases respectively.
Example 1:
Let's say we have a CSV file named "example.csv" with the following content:
Name,Age,City
Alice,25,New York
Bob,30,San Francisco
Charlie,35,Los Angeles
import pandas as pd
df = pd.read_csv('example.csv')
print(df)
Output:
Example 2:
Suppose you have a tab-separated values (TSV) file named "example.txt" with the following content:
import pandas as pd
print(df)
Output:
L 33
Imagine a dataset of students with information like their names, ages, grades, and any missing or
inconsistent data. Assuming you have a CSV file named student_data.csv with the following content:
Name,Age,Grade
John,18,A
Sara,20,B
Tom,,C
Alice,22,D
Bob,19,
Here's a step-by-step guide to analyzing and cleaning this dataset using Pandas:
import pandas as pd
df = pd.read_csv('student_data.csv')
Step 3: Explore the Dataset
print(df.head())
Output:
0 John 18.0 A
1 Sara 20.0 B
2 Tom NaN C
3 Alice 22.0 D
print(df.isnull().sum())
Output:
Name 0
Age 1
Grade 1
In this case, we'll fill missing ages with the median age and missing grades with a default value, say
'F'.
# Fill missing values in the 'Age' column with the median age
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Grade'].fillna('F', inplace=True)
Step 6: Convert Data Types
df['Age'] = df['Age'].astype(int)
print(df.describe())
Output:
Age
count 5.000000
mean 19.800000
std 1.923538
min 18.000000
25% 19.000000
50% 19.000000
75% 20.000000
max 22.000000