Professional Documents
Culture Documents
AI Workflow
Data Preparation with
NumPy and Pandas
Mr Hew Ka Kian
hew_ka_kian@rp.edu.sg
OFFICIAL (CLOSED) \ NON-SENSITIVE
Series vs DataFrame
Pandas Series
• A series is a one-dimensional data structure. It can have any data structure like integer,
float, and string. It is useful when you want to perform computation or return a one-
dimensional array. A series, by definition, cannot have multiple columns. For the latter case,
please use the data frame structure.
Pandas DataFrame
• Pandas DataFrame is a two-dimensional array with labelled data structure having different
column types.
• A DataFrame is a standard way to store data in a tabular format, with rows to store the
information and columns to name the information.
• For instance, the price can be the name of a column and 2,3,4 can be the price values.
Source: https://www.datacamp.com/cheat-sheet/pandas-cheat-sheet-for-data-science-in-python
OFFICIAL (CLOSED) \ NON-SENSITIVE
Import Pandas
• Just like NumPy, we will only learn the basics of Pandas. To use it, we have to
import the NumPy and Pandas libraries like the following code does. We need to
import NumPy as Pandas is often used together with NumPy.
import numpy as np
import pandas as pd
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Broadcast
• Like NumPy, can perform operation between a Series and a scalar value via
Broadcast
• Example below increases everyone’s age by 5
ages = {'Sammy':5,'Frank':10,'Spike':7}
ppl = pd.Series(ages)
ppl = ppl + 5
Source: https://learncsc.udemy.com/course/python-for-machine-learning-data-science-masterclass/learn/lecture/17770126#overview
OFFICIAL (CLOSED) \ NON-SENSITIVE
sales_Q1 = pd.Series({'Japan':80,'China':450,'India':200,'US':250})
sales_Q2 = pd.Series({'Brazil':180,'China':300,'India':340,'US':390})
sales_half_yr = sales_Q1 + sales_Q2
• Result in below: (Brazil and Japan values are NaN as their values are missing in one of the Series)
Brazil NaN
China 750.0
India 540.0
Japan NaN
US 640.0
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
DataFrames
• Basically, a DataFrame us made up of a few Series that have the same index.
columns
index
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Conditional Filtering
To filter a DataFrame
• 1st create the bool series that matches the condition
bool_series = df['total_bill'] > 30
• AND logic - both condition must be True
bool_series = (df['total_bill'] > 30) & (df['gender'] == 'M')
• OR logic - only one needs be True
bool_series = (df['total_bill'] > 30) | (df['gender'] == 'M')
• If day IS IN 'Sat' or 'Sun'
bool_series = (df['day'].isin(['Sat','Sun'])
• Finally, use the bool series in the DataFrame and the row with the corresponding False position
will be filtered out
df[bool_series]
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
2 NaN Dropped
dtype: float64
w3resource.com
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Student Activity
1. How to access the 2nd item in a Pandas series
myser = pd.Series([1965,1965,1957])
myser[1]
2. How to acces the 3rd item using named index?
myser=pd.Series([1965,1965,1957],['Singapore','Maldives','Malaysia'])
myser.Malaysia # or myser['Malaysia’]
3. Can you create one more column named 'tip_per_person' that is
actually the tip divided by the size and rounded to 2 decimals?
df['tip_per_person']=np.round(df['tip'] / df['size'],2)
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Student Activity
4. Can you grab the 200th row and beyond?
df.iloc[199:]
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
print(diamonds.shape)
3. Write a Pandas program to find the data type of each column of the diamonds
Dataframe.
print(diamonds.dtypes)
OFFICIAL (CLOSED) \ NON-SENSITIVE
diamonds['quality–color'] =
diamonds.cut + ' ' + diamonds['color']
OFFICIAL (CLOSED) \ NON-SENSITIVE
result = diamonds.sort_values('color')
OFFICIAL (CLOSED) \ NON-SENSITIVE
print("Original Dataframe:")
print(diamonds)
print("\nAfter cleaning:")
diamonds = diamonds.dropna()
print(diamonds)
OFFICIAL (CLOSED) \ NON-SENSITIVE