You are on page 1of 26

OFFICIAL (CLOSED) \ NON-SENSITIVE

AI Workflow
Data Preparation with
NumPy and Pandas
Mr Hew Ka Kian
hew_ka_kian@rp.edu.sg
OFFICIAL (CLOSED) \ NON-SENSITIVE

Series vs DataFrame
Pandas Series
• A series is a one-dimensional data structure. It can have any data structure like integer,
float, and string. It is useful when you want to perform computation or return a one-
dimensional array. A series, by definition, cannot have multiple columns. For the latter case,
please use the data frame structure.

Pandas DataFrame
• Pandas DataFrame is a two-dimensional array with labelled data structure having different
column types.
• A DataFrame is a standard way to store data in a tabular format, with rows to store the
information and columns to name the information.
• For instance, the price can be the name of a column and 2,3,4 can be the price values.

Image Source: https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/


OFFICIAL (CLOSED) \ NON-SENSITIVE

Pandas Cheat Sheet


https://www.datacamp.com/cheat-sheet/pandas-cheat-sheet-for-data-science-in-python

Source: https://www.datacamp.com/cheat-sheet/pandas-cheat-sheet-for-data-science-in-python
OFFICIAL (CLOSED) \ NON-SENSITIVE

Import Pandas
• Just like NumPy, we will only learn the basics of Pandas. To use it, we have to
import the NumPy and Pandas libraries like the following code does. We need to
import NumPy as Pandas is often used together with NumPy.

import numpy as np
import pandas as pd

• After running the code, we can use pd as pandas.

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Creating a Series from List


• Create from a Python List
mydata = [1965,1965,1957]
myser = pd.Series(data=mydata)
• Get value based on number index
myser[0]
• Create from a Python List with named index. Index allows us to grab the item
using the named index
mydata = [1965,1965,1957]
myindex = ['Singapore','Maldives','Malaysia']
myser = pd.Series(data=mydata,index=myindex)
• Get value based on named index
myser['Malaysia']
myser.Malaysia Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Creating a Series from Dictionary


• Dictionary has key and corresponding value
• Create from a Dictionary. Note that it is JSON format with
curly braces like { key1:value1, key2:value2 }
ages = {'Sammy':5,'Frank':10,'Spike':7}
ppl = pd.Series(ages)
• Get value using the number index or key
ppl['Sammy']
ppl[0]
• Grab the keys
ppl.keys()

Image Source: https://programmathically.com/dictionaries-tuples-and-sets-in-python/


OFFICIAL (CLOSED) \ NON-SENSITIVE

Broadcast
• Like NumPy, can perform operation between a Series and a scalar value via
Broadcast
• Example below increases everyone’s age by 5
ages = {'Sammy':5,'Frank':10,'Spike':7}
ppl = pd.Series(ages)
ppl = ppl + 5

Source: https://learncsc.udemy.com/course/python-for-machine-learning-data-science-masterclass/learn/lecture/17770126#overview
OFFICIAL (CLOSED) \ NON-SENSITIVE

Operations between Series


• Operation can be done between Series by operating on values of the same key

sales_Q1 = pd.Series({'Japan':80,'China':450,'India':200,'US':250})
sales_Q2 = pd.Series({'Brazil':180,'China':300,'India':340,'US':390})
sales_half_yr = sales_Q1 + sales_Q2

• Result in below: (Brazil and Japan values are NaN as their values are missing in one of the Series)

Brazil NaN
China 750.0
India 540.0
Japan NaN
US 640.0

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

DataFrames
• Basically, a DataFrame us made up of a few Series that have the same index.

columns

index

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Create DataFrame from Python Objects


mydata = np.random.randint(0,101,(4,3))
myindex = ['CA','NY','AZ','TX’]
mycolumns = ['Jan','Feb','Mar’]
df = pd.DataFrame(data=mydata,index=myindex,columns=mycolumns)

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Getting the Properties of the DataFrame


• The column names
df.columns
• Index
df.index
• Number of rows
len(df)

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Selection and Indexing


• First n rows or first 5 rows if n is not specified
df.head(n)
df.head()
• Last n rows or last 5 rows if n is not specified
df.tail(n)
• Get specified columns only
df[column1, column2]
• The number of rows shown is set to the default 10. To change to show all rows,
use the following code
pd.set_option('display.max_rows', None)

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Selection and Indexing


• Create a new column or modify an existing column
df[new_column] = value
• Remove a column
df.drop(column_name, axis=1)
• We can grab a row by its default index, a running number starting from 0. OR we can set a column
as the index column and grab a row by the new named index. Note we CANNOT access with
squared bracket alone like what we did with series but will need to use iloc[] or loc[]
df = df.set_index('Payment ID')
df.iloc[4] # to access with number index
df.loc['Albert'] # to access with named index and is case-sensitive
• Remove a row
df.drop(0, axis=0) # only if no named index
df.drop('Albert',axis=0)

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Conditional Filtering
To filter a DataFrame
• 1st create the bool series that matches the condition
bool_series = df['total_bill'] > 30
• AND logic - both condition must be True
bool_series = (df['total_bill'] > 30) & (df['gender'] == 'M')
• OR logic - only one needs be True
bool_series = (df['total_bill'] > 30) | (df['gender'] == 'M')
• If day IS IN 'Sat' or 'Sun'
bool_series = (df['day'].isin(['Sat','Sun'])
• Finally, use the bool series in the DataFrame and the row with the corresponding False position
will be filtered out
df[bool_series]

Image Source: https://medium.datadriveninvestor.com/pandas-data-frame-101-filtering-data-loc-iloc-939301489088


OFFICIAL (CLOSED) \ NON-SENSITIVE

Sorting and Max and Min Index


• To sort by one or more columns, ascending can be True or False
df.sort_values([colmn1,column2],ascending=False)
• Find max value of a column
df[column_name].max()
• Find index of the row containing the max value of the column
df[column_name].idxmax()
• Find index of the row containing the min value of the column
df[column_name].idxmin()

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Dropping Rows with Missing Data


• Often the dataset we get is not complete as some rows contain missing data.
• In the previous lesson, we remove the rows with missing data by using
dropna()
component in Azure.
• To get Pandas to drop the rows with missing data:
df = df.dropna()
0 2.0
1 3.0

2 NaN Dropped

dtype: float64
w3resource.com

Image Source: https://www.w3resource.com/pandas/series/series-dropna.php


OFFICIAL (CLOSED) \ NON-SENSITIVE

Converting Column to Datetime


• Very often, we deal with Pandas Series that have datetime as one of the columns –
it means the data is timestamped (correspond to certain datetime).
• These are therefore called time series
• The datetime column is often stored in String format and therefore we cannot
perform datetime operation.
• Pandas can convert them into datetime format easily
series[column_name] = pd.to_datetime(series[column_name])
• Pandas can even attempt to convert the column to datetime format when reading
from a CSV file
series=pd.read_csv('filename.csv',parse_dates=[datetime_column_index])

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
1. How to access the 2nd item in a Pandas series
myser = pd.Series([1965,1965,1957])
myser[1]
2. How to acces the 3rd item using named index?
myser=pd.Series([1965,1965,1957],['Singapore','Maldives','Malaysia'])
myser.Malaysia # or myser['Malaysia’]
3. Can you create one more column named 'tip_per_person' that is
actually the tip divided by the size and rounded to 2 decimals?
df['tip_per_person']=np.round(df['tip'] / df['size'],2)

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
4. Can you grab the 200th row and beyond?

df.iloc[199:]

Why does it start at 199?


What does empty end index mean?

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Putting together what you have learnt


2. Write a Pandas program to find the number of rows and columns of the
diamonds Dataframe.

print(diamonds.shape)

3. Write a Pandas program to find the data type of each column of the diamonds
Dataframe.

print(diamonds.dtypes)
OFFICIAL (CLOSED) \ NON-SENSITIVE

Putting together what you have learnt


4. Write a Pandas program to create a new 'quality-color' columns that
concatenate the data from the 'cut' column with the data from the 'color'
column.
Eg, if 'cut' is 'Ideal' and 'color is 'E', 'quality-color' is 'Ideal E'

diamonds['quality–color'] =
diamonds.cut + ' ' + diamonds['color']
OFFICIAL (CLOSED) \ NON-SENSITIVE

Putting together what you have learnt


5. Write a Pandas program to remove the 'cut' column of the diamonds
Dataframe.

diamonds = diamonds.drop('cut', axis=1)


OFFICIAL (CLOSED) \ NON-SENSITIVE

Putting together what you have learnt


6. Write a Pandas program to sort by 'color' in ascending order.

result = diamonds.sort_values('color')
OFFICIAL (CLOSED) \ NON-SENSITIVE

Putting together what you have learnt


7. Write the Pandas program to drop a row if any or all values in a row are
missing. Print the DataFrame before and aftet dropping the missing rows and
compare.

print("Original Dataframe:")
print(diamonds)
print("\nAfter cleaning:")
diamonds = diamonds.dropna()
print(diamonds)
OFFICIAL (CLOSED) \ NON-SENSITIVE

Putting together what you have learnt


8. Write the Pandas program to grab the data of the diamonds where length>5,
width>5 and depth>5.
OFFICIAL (CLOSED) \ NON-SENSITIVE

Putting together what you have learnt


8. Write the Pandas program to grab the data of the diamonds where length>5,
width>5 and depth>5.

result = diamonds[(diamonds.x>5) & (diamonds.y>5)


& (diamonds.z>5)]
print(result)

You might also like