Ai Workflow Data Preparation With Numpy and Pandas: MR Hew Ka Kian Hew - Ka - Kian@Rp - Edu.Sg

OFFICIAL (CLOSED) \ NON-SENSITIVE
AI Workflow
Data Preparation with
NumPy and Pandas
Mr Hew Ka Kian
hew_ka_kian@rp.edu.sg
Series vs DataFrame
Pandas Series
• A series is a one-dimensional data structure. It can have any data structure like integer,
float, and string. It is useful when you want to perform computation or return a one-
dimensional array. A series, by definition, cannot have multiple columns. For the latter case,
please use the data frame structure.
Pandas DataFrame
• Pandas DataFrame is a two-dimensional array with labelled data structure having different
column types.
• A DataFrame is a standard way to store data in a tabular format, with rows to store the
information and columns to name the information.
• For instance, the price can be the name of a column and 2,3,4 can be the price values.
Image Source: https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/

Pandas Cheat Sheet

https://www.datacamp.com/cheat-sheet/pandas-cheat-sheet-for-data-science-in-python
Source: https://www.datacamp.com/cheat-sheet/pandas-cheat-sheet-for-data-science-in-python
Import Pandas
• Just like NumPy, we will only learn the basics of Pandas. To use it, we have to
import the NumPy and Pandas libraries like the following code does. We need to
import NumPy as Pandas is often used together with NumPy.
import numpy as np
import pandas as pd
• After running the code, we can use pd as pandas.
Source:
Creating a Series from List

• Create from a Python List
mydata = [1965,1965,1957]
myser = pd.Series(data=mydata)
• Get value based on number index
myser[0]
• Create from a Python List with named index. Index allows us to grab the item
using the named index
mydata = [1965,1965,1957]
myindex = ['Singapore','Maldives','Malaysia']
myser = pd.Series(data=mydata,index=myindex)
• Get value based on named index
myser['Malaysia']
myser.Malaysia Source:
Creating a Series from Dictionary

• Dictionary has key and corresponding value
• Create from a Dictionary. Note that it is JSON format with
curly braces like { key1:value1, key2:value2 }
ages = {'Sammy':5,'Frank':10,'Spike':7}
ppl = pd.Series(ages)
• Get value using the number index or key
ppl['Sammy']
ppl[0]
• Grab the keys
ppl.keys()
Image Source: https://programmathically.com/dictionaries-tuples-and-sets-in-python/

Broadcast
• Like NumPy, can perform operation between a Series and a scalar value via
Broadcast
• Example below increases everyone’s age by 5
ages = {'Sammy':5,'Frank':10,'Spike':7}
ppl = pd.Series(ages)
ppl = ppl + 5
Source: https://learncsc.udemy.com/course/python-for-machine-learning-data-science-masterclass/learn/lecture/17770126#overview
Operations between Series

• Operation can be done between Series by operating on values of the same key
sales_Q1 = pd.Series({'Japan':80,'China':450,'India':200,'US':250})
sales_Q2 = pd.Series({'Brazil':180,'China':300,'India':340,'US':390})
sales_half_yr = sales_Q1 + sales_Q2
• Result in below: (Brazil and Japan values are NaN as their values are missing in one of the Series)
Brazil NaN
China 750.0
India 540.0
Japan NaN
US 640.0
Source:
DataFrames
• Basically, a DataFrame us made up of a few Series that have the same index.
columns
index
Source:
Create DataFrame from Python Objects

mydata = np.random.randint(0,101,(4,3))
myindex = ['CA','NY','AZ','TX’]
mycolumns = ['Jan','Feb','Mar’]
df = pd.DataFrame(data=mydata,index=myindex,columns=mycolumns)
Source:
Getting the Properties of the DataFrame

• The column names
df.columns
• Index
df.index
• Number of rows
len(df)
Source:
Selection and Indexing

• First n rows or first 5 rows if n is not specified
df.head(n)
df.head()
• Last n rows or last 5 rows if n is not specified
df.tail(n)
• Get specified columns only
df[column1, column2]
• The number of rows shown is set to the default 10. To change to show all rows,
use the following code
pd.set_option('display.max_rows', None)
Source:
Selection and Indexing

• Create a new column or modify an existing column
df[new_column] = value
• Remove a column
df.drop(column_name, axis=1)
• We can grab a row by its default index, a running number starting from 0. OR we can set a column
as the index column and grab a row by the new named index. Note we CANNOT access with
squared bracket alone like what we did with series but will need to use iloc[] or loc[]
df = df.set_index('Payment ID')
df.iloc[4] # to access with number index
df.loc['Albert'] # to access with named index and is case-sensitive
• Remove a row
df.drop(0, axis=0) # only if no named index
df.drop('Albert',axis=0)
Source:
Conditional Filtering
To filter a DataFrame
• 1st create the bool series that matches the condition
bool_series = df['total_bill'] > 30
• AND logic - both condition must be True
bool_series = (df['total_bill'] > 30) & (df['gender'] == 'M')
• OR logic - only one needs be True
bool_series = (df['total_bill'] > 30) | (df['gender'] == 'M')
• If day IS IN 'Sat' or 'Sun'
bool_series = (df['day'].isin(['Sat','Sun'])
• Finally, use the bool series in the DataFrame and the row with the corresponding False position
will be filtered out
df[bool_series]
Image Source: https://medium.datadriveninvestor.com/pandas-data-frame-101-filtering-data-loc-iloc-939301489088

Sorting and Max and Min Index

• To sort by one or more columns, ascending can be True or False
df.sort_values([colmn1,column2],ascending=False)
• Find max value of a column
df[column_name].max()
• Find index of the row containing the max value of the column
df[column_name].idxmax()
• Find index of the row containing the min value of the column
df[column_name].idxmin()
Source:
Dropping Rows with Missing Data

• Often the dataset we get is not complete as some rows contain missing data.
• In the previous lesson, we remove the rows with missing data by using
dropna()
component in Azure.
• To get Pandas to drop the rows with missing data:
df = df.dropna()
0 2.0
1 3.0
2 NaN Dropped
dtype: float64
w3resource.com
Image Source: https://www.w3resource.com/pandas/series/series-dropna.php

Converting Column to Datetime

• Very often, we deal with Pandas Series that have datetime as one of the columns –
it means the data is timestamped (correspond to certain datetime).
• These are therefore called time series
• The datetime column is often stored in String format and therefore we cannot
perform datetime operation.
• Pandas can convert them into datetime format easily
series[column_name] = pd.to_datetime(series[column_name])
• Pandas can even attempt to convert the column to datetime format when reading
from a CSV file
series=pd.read_csv('filename.csv',parse_dates=[datetime_column_index])
Source:
Student Activity
1. How to access the 2nd item in a Pandas series
myser = pd.Series([1965,1965,1957])
myser[1]
2. How to acces the 3rd item using named index?
myser=pd.Series([1965,1965,1957],['Singapore','Maldives','Malaysia'])
myser.Malaysia # or myser['Malaysia’]
3. Can you create one more column named 'tip_per_person' that is
actually the tip divided by the size and rounded to 2 decimals?
df['tip_per_person']=np.round(df['tip'] / df['size'],2)
Source:
Student Activity
4. Can you grab the 200th row and beyond?
df.iloc[199:]
Why does it start at 199?

What does empty end index mean?
Source:
Putting together what you have learnt

2. Write a Pandas program to find the number of rows and columns of the
diamonds Dataframe.
print(diamonds.shape)
3. Write a Pandas program to find the data type of each column of the diamonds
Dataframe.
print(diamonds.dtypes)

4. Write a Pandas program to create a new 'quality-color' columns that
concatenate the data from the 'cut' column with the data from the 'color'
column.
Eg, if 'cut' is 'Ideal' and 'color is 'E', 'quality-color' is 'Ideal E'
diamonds['quality–color'] =
diamonds.cut + ' ' + diamonds['color']

5. Write a Pandas program to remove the 'cut' column of the diamonds
Dataframe.
diamonds = diamonds.drop('cut', axis=1)


6. Write a Pandas program to sort by 'color' in ascending order.
result = diamonds.sort_values('color')

7. Write the Pandas program to drop a row if any or all values in a row are
missing. Print the DataFrame before and aftet dropping the missing rows and
compare.
print("Original Dataframe:")
print(diamonds)
print("\nAfter cleaning:")
diamonds = diamonds.dropna()
print(diamonds)

8. Write the Pandas program to grab the data of the diamonds where length>5,
width>5 and depth>5.

8. Write the Pandas program to grab the data of the diamonds where length>5,
width>5 and depth>5.
result = diamonds[(diamonds.x>5) & (diamonds.y>5)

& (diamonds.z>5)]
print(result)

Ai Workflow Data Preparation With Numpy and Pandas: MR Hew Ka Kian Hew - Ka - Kian@Rp - Edu.Sg

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ai Workflow Data Preparation With Numpy and Pandas: MR Hew Ka Kian Hew - Ka - Kian@Rp - Edu.Sg

Uploaded by

Copyright:

Available Formats

OFFICIAL (CLOSED) \ NON-SENSITIVE

Image Source: https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/

Pandas Cheat Sheet

• After running the code, we can use pd as pandas.

Creating a Series from List

Creating a Series from Dictionary

Image Source: https://programmathically.com/dictionaries-tuples-and-sets-in-python/

Operations between Series

Create DataFrame from Python Objects

Getting the Properties of the DataFrame

Selection and Indexing

Selection and Indexing

Image Source: https://medium.datadriveninvestor.com/pandas-data-frame-101-filtering-data-loc-iloc-939301489088

Sorting and Max and Min Index

Dropping Rows with Missing Data

Image Source: https://www.w3resource.com/pandas/series/series-dropna.php

Converting Column to Datetime

Why does it start at 199?

Putting together what you have learnt

Putting together what you have learnt

Putting together what you have learnt

diamonds = diamonds.drop('cut', axis=1)

Putting together what you have learnt

Putting together what you have learnt

Putting together what you have learnt

Putting together what you have learnt

result = diamonds[(diamonds.x>5) & (diamonds.y>5)

You might also like