You are on page 1of 35

எங் கள் வாழ் வும் எங் கள் வளமும்

மங் காத தமிழ் என்று சங் கக முழங் கு … புரட்சிக்கவி

www.DataScienceInTamil.com

Pandas - Latest version: 1.5.3


Day 18 -Mar 03 2023
Chapter 18 : Pandas – Dataframe
(Prerequisites to learn Pandas: Python and NumPy- watch these recorded videos
from our Youtube channel)

Watch this class’ video in You Tube :

Official Website
https://DataScienceInTamil.com/
மேலும் முக்கிய மகள்விகள் பதில்களுக்கு :
https://www.DatascienceInTamil.com/#faq

To watch recorded Python, NumPy videos in YouTube


https://www.youtube.com/channel/UCTCMjShTpZg96cXloCO9q1w

To join DataScienceInTamil Telegram group:


இந்த குழுவில் உங்கள் நண்பர்களை இளைக்க லிங்க்
https://t.me/joinchat/lUZEsr-zidpjZjEx

To Join the class, please fill the form


https://forms.gle/mCtYHUg1QL31wNQW9

Join Zoom Meeting


https://us06web.zoom.us/j/83090030937?pwd=ZDRDMmNFb3MxWjlKL1U4QWV5Q
1Fldz09

Meeting ID: 830 9003 0937


Passcode: DSIT
(Monday through Friday 8 PM to 10 PM IST)
----------------------------
We support open-source products to spread Technology to the
mass.
➢This is completely a FREE training course to provide introduction to
Python language
➢All materials / contents / images/ examples and logo used in this
document are owned by the respective companies / websites. We use
those contents for FREE teaching purposes only.
➢We take utmost care to provide credits whenever we use materials
from external source/s. If we missed to acknowledge any content that
we had used here, please feel free to inform us at
info@DataScienceInTamil.com.
➢All the programming examples in this document are for FREE
teaching purposes only.

Thanks to all the open-source community and to the below websites from where we take references / content /code
example. definitions, please use these websites for further reading:

• Book : Python Notes For Professionals


• https://www.w3schools.com
• https://www.geeksforgeeks.org
• https://www.askpython.com
• https://docs.python.org
• https://www.programiz.com/
• https://www.openriskmanagement.com/
• https://pynative.com/python-sets/
• https://www.alphacodingskills.com/
• https://codedestine.com/
• https://appdividend.com/
• https://freecontent.manning.com/
• https://stackoverflow.com/
• https://datagy.io/python-isdigit
• https://www.datacamp.com/community/tutorials/functions-python-tutorial
• https://data-flair.training/blogs/python-function/
• https://problemsolvingwithpython.com/07-Functions-and-Modules/07.07-Positional-and-Keyword-Arguments/
• https://www.tutorialsteacher.com/python/callable-method
• https://pynative.com/python-function-arguments/
• https://data-flair.training/blogs/python-built-in-functions/
• https://www.geeksforgeeks.org/higher-order-functions-in-python/
• https://pandas.pydata.org/
• https://numpy.org/
• https://docs.python.org/3/tutorial/
• https://docs.python.org/3.9/tutorial/index.html
• https://blog.finxter.com/what-are-advantages-of-numpy-over-regular-python-lists/
• https://towardsdatascience.com/lets-talk-about-numpy-for-datascience-beginners-b8088722309f
• https://data-flair.training/blogs/numpy-applications/
• https://data-flair.training/blogs/numpy-features/amp/
• https://www.tutorialspoint.com/python_pandas/index.htm
• https://www.tutorialspoint.com/python_pandas/python_pandas_discussion.htm
• https://github.com/codebasics/py/blob/master/pandas/7_group_by/split_apply_combine.png
• all the below notes are taken from the above urls

What topics are to be covered today


1. Outer Join -Combine all cols from 2 dfs (union)
2. Inner Join – intersection (Common values) //this is default join
3. Pandas - Concatenation
4. Concatenating Objects
5. code with ignore_index=True + NaN
6. If two objects need to be added along axis=1, then the 2 Dfs columns will be
appended / concatenated.
7. Time Series
8. Get Current Time
9. Create a Range of Time with custom time interval
10. Change the Frequency of Time
11. Converting to Timestamps (pd.Timesstamp)
12. Pandas - Date Functionality
13. Create a Range of Dates
14. Change the Date Frequency
15. bdate_range - business date
16. How to find number of working days for the year
17. Offset Aliases
18. Pandas – Timedelta - string
Day 18
Outer Join -Combine all cols from 2 dfs (union)
print(pd.merge(left, right, how='outer', on='subject_id')) // same code as above

id_x Name_x subject_id id_y Name_y


0 1.0 Alex sub1 NaN NaN
1 2.0 Amy sub2 1.0 Billy
2 3.0 Allen sub4 2.0 Brian
3 4.0 Alice sub6 4.0 Bryce
4 5.0 Ayoung sub5 5.0 Betty
5 NaN NaN sub3 3.0 Bran
----------

Inner Join – intersection (Common values) //this is default join ( Cehck with venn
diagram)
print(pd.merge(left, right, on = 'subject_id', how = 'inner'))
output

id_x Name_x subject_id marks id_y Name_y


0 2 Amy sub2 30 1 Billy
1 3 Allen sub4 40 2 Amy
2 4 Alice sub6 50 4 Alice
3 5 Ayoung sub5 60 5 Betty
-------
Pandas - Concatenation
Pandas provides various facilities for easily combining together Series, DataFrame, and
Panel objects.

pd.concat(objs,axis=0,join='outer',join_axes=None,
ignore_index=False)

objs − This is a sequence or mapping of Series, DataFrame, or Panel objects.


axis − {0, 1, ...}, default 0. This is the axis to concatenate along.
join − {‘inner’, ‘outer’}, default ‘outer’. How to handle indexes on other axis(es). Outer for
union and inner for intersection.
ignore_index − boolean, default False. If True, do not use the index values on the
concatenation axis. The resulting axis will be labeled 0, ..., n - 1.
join_axes − This is the list of Index objects. Specific indexes to use for the other (n-1) axes
instead of performing inner/outer set logic.

Concatenating Objects
The concat function does all of the heavy lifting of performing concatenation operations
along an axis. Let us create different objects and do concatenation.

#import the pandas library

import pandas as pd
import numpy as np

one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])

two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print(pd.concat([one, two]))
print("=============")
print(pd.concat([one, two], ignore_index=True))

Output

Name subject_id Marks_scored


1 Alex sub1 98
2 Amy sub2 90
3 Allen sub4 87
4 Alice sub6 69
5 Ayoung sub5 78
1 Billy sub2 89
2 Brian sub4 80
3 Bran sub3 79
4 Bryce sub6 97
5 Betty sub5 88
-------
code with ignore_index=True + NaN
Watch the col names 'Marks_scored1 and 'Marks_scored2 in DFs, we get NaN values

import pandas as pd
import numpy as np

one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored1':[98,90,87,69,78]},
index=[1,2,3,4,5])

two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored2':[89,80,79,97,88]},
index=[1,2,3,4,5])

print(pd.concat([one, two]))
print("=============")
print(pd.concat([one, two], ignore_index=True))

output

Name subject_id Marks_scored1 Marks_scored2


0 Alex sub1 98.0 NaN
1 Amy sub2 90.0 NaN
2 Allen sub4 87.0 NaN
3 Alice sub6 69.0 NaN
4 Ayoung sub5 78.0 NaN
5 Billy sub2 NaN 89.0
6 Brian sub4 NaN 80.0
7 Bran sub3 NaN 79.0
8 Bryce sub6 NaN 97.0
9 Betty sub5 NaN 88.0
-------------
Note: try with print(pd.concat([one, two], ignore_index=False))
--------------
Suppose we wanted to associate specific keys with each of the pieces of the concated
DataFrame. We can do this by using the keys argument

import pandas as pd
import numpy as np

one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])

print(pd.concat([one, two],keys=['DF1', 'DF2']))

output

Name subject_id Marks_scored


DF1 1 Alex sub1 98
2 Amy sub2 90
3 Allen sub4 87
4 Alice sub6 69
5 Ayoung sub5 78
DF2 1 Billy sub2 89
2 Brian sub4 80
3 Bran sub3 79
4 Bryce sub6 97
5 Betty sub5 88
-----------
The index of the resultant is duplicated; each index is repeated.
If the resultant object has to follow its own indexing, set ignore_index to True.

print(pd.concat([one, two],keys=['x','y'],ignore_index=True))
output
Name subject_id Marks_scored
0 Alex sub1 98
1 Amy sub2 90
2 Allen sub4 87
3 Alice sub6 69
4 Ayoung sub5 78
5 Billy sub2 89
6 Brian sub4 80
7 Bran sub3 79
8 Bryce sub6 97
9 Betty sub5 88
Note: Observe, the index changes completely and the Keys are also overridden.
---------
If two objects need to be added along axis=1, then the 2
Dfs columns will be appended / concatenated.
import pandas as pd
import numpy as np

one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])

two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,4,5,7])

print(pd.concat([one, two], axis=1))

output

based on the row id, it concatenate all the cols of 2 Dfs

Name subject_id Marks_scored Name subject_id Marks_scored


1 Alex sub1 98 Billy sub2 89
2 Amy sub2 90 Brian sub4 80
3 Allen sub4 87 Bran sub3 79
4 Alice sub6 69 Bryce sub6 97
5 Ayoung sub5 78 Betty sub5 88

------
Concatenating Using append()
It is going to be deprecated
A useful shortcut to concat are the append instance methods on Series and DataFrame.
These methods actually predated concat. They concatenate along axis=0, namely the
index

FutureWarning: The frame.append method is deprecated and will be removed from


pandas in a future version. Use pandas.concat instead.
print(one.append(two))

print(one.append(two))
output

Name subject_id Marks_scored


1 Alex sub1 98
2 Amy sub2 90
3 Allen sub4 87
4 Alice sub6 69
5 Ayoung sub5 78
1 Billy sub2 89
2 Brian sub4 80
3 Bran sub3 79
4 Bryce sub6 97
5 Betty sub5 88
---------

The append function can take multiple objects as well


print([two, one, two]) // same code as above

[ Name subject_id Marks_scored


1 Billy sub2 89
2 Brian sub4 80
3 Bran sub3 79
4 Bryce sub6 97
5 Betty sub5 88, Name subject_id Marks_scored
1 Alex sub1 98
2 Amy sub2 90
3 Allen sub4 87
4 Alice sub6 69
5 Ayoung sub5 78, Name subject_id Marks_scored
1 Billy sub2 89
2 Brian sub4 80
3 Bran sub3 79
4 Bryce sub6 97
5 Betty sub5 88]

-------

Time Series
Pandas provide a robust tool for working time with Time series data, especially in the
financial sector. While working with time series data, we frequently come across the
following −

• Generating sequence of time


• Convert the time series to different frequencies

Pandas provides a relatively compact and self-contained set of tools for performing the
above tasks.

Get Current Time


datetime.now() gives you the current date and time.

import pandas as pd
import datetime
print(pd.datetime.now())

FutureWarning: The pandas.datetime class is deprecated and will be removed from


pandas in a future version. Import from datetime module instead.
print(pd.datetime.now())
2020-08-30 19:59:06.805605

--------
Using datatime package
import pandas as pd
import datetime as dt
print(dt.datetime.now()) # print(datetime.datetime.now())

output
2020-08-30 20:01:31.065810
----
Create a TimeStamp

Time-stamped data is the most basic type of timeseries data that associates values with points in time. For pandas
objects, it means using the points in time. Let’s take an example −

import pandas as pd
print(pd.Timestamp('2023-03-14'))

output
2022-03-14 00:00:00
---------------
What happens if don’t give any parameter to the fn

import pandas as pd
print(pd.Timestamp())
output
TypeError: function missing required argument 'year' (pos 1)
--------------
If we give only one parameter ie year

import pandas as pd
print(pd.Timestamp('2022'))
output
2022-01-01 00:00:00
-----------
If we give the year in integer / number format. It takes the value as milli second

import pandas as pd
print(pd.Timestamp(2022))

Note: If you pass a single integer or float value to the Timestamp constructor, it returns a timestamp
equivalent to the number of nanoseconds after the Unix epoch (Jan 1, 1970):

ouput
1970-01-01 00:00:00.000002022
---------
Another ex for nano second (will be converted to time)

import pandas as pd
print(pd.Timestamp(2022022222222))

output
1970-01-01 00:33:42.022222222
----------------

import pandas as pd
print(pd.Timestamp(year=1982, month=9, day=4, hour=1, minute=35,
second=10))

output
1982-09-04 01:35:10
-----------------

It is also possible to convert integer or float epoch times. The default unit for these is
nanoseconds (since these are how Timestamps are stored). However, often epochs are
stored in another unit which can be specified. Let’s take another example

import pandas as pd
print(pd.Timestamp(1_000_000_000, unit='ns')) # one billion nano second is euqal to 1
second
print(pd.Timestamp(3661, unit='s')) # one hour one minute and one second= 3661
seconds
print(pd.Timestamp(62, unit='m'))
print(pd.Timestamp(25, unit='h'))

output
1976-05-29 07:17:02
1970-01-02 09:42:00
1970-01-02 01:00:00
-------

Create a Range of Time with custom time interval


import pandas as pd
print(pd.date_range("11:00", "13:30", freq='20min'))
print(type(pd.date_range("11:00", "13:30", freq='20min')))
print("===============")
print(pd.date_range("11:00", "13:30", freq='20min').time)
print(type(pd.date_range("11:00", "13:30", freq='20min').time))

output
DatetimeIndex(['2022-03-14 11:00:00', '2022-03-14 11:20:00',
'2022-03-14 11:40:00', '2022-03-14 12:00:00',
'2022-03-14 12:20:00', '2022-03-14 12:40:00',
'2022-03-14 13:00:00', '2022-03-14 13:20:00'],
dtype='datetime64[ns]', freq='20T')
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
===============
[datetime.time(11, 0) datetime.time(11, 20) datetime.time(11, 40)
datetime.time(12, 0) datetime.time(12, 20) datetime.time(12, 40)
datetime.time(13, 0) datetime.time(13, 20)]
<class 'numpy.ndarray'>
--------

Change the Frequency of Time


(Frequency distribution)
import pandas as pd
a = (pd.date_range("11:00", "13:30", freq='H').time) # Returns in hour segmentation
print(a)
print(type(a))
output
[datetime.time(11, 0) datetime.time(12, 0) datetime.time(13, 0)]
<class 'numpy.ndarray'>
Note: a = (pd.date_range("11:00", "13:30", freq='2H').time) also possible

----------

Converting to Timestamps (pd.Timesstamp)


To convert a Series or list-like object of date-like objects, for example strings, epochs, or a
mixture, you can use the to_datetime function. When passed, this returns a Series (with
the same index), while a list-like is converted to a DatetimeIndex. This DatetimeIndex is
used to create a Dataframe

import pandas as pd
a = pd.to_datetime(pd.Series(data=(['Feb 28 2023', '2023-01-03', '2023-11-
03', "", None, '2023-5', ''])))
print(a)
print(type(a))
output

0 2022-02-28
1 2022-01-03
2 2022-11-03
3 NaT
4 NaT
dtype: datetime64[ns]
<class 'pandas.core.series.Series'>
Note:
NaT means Not a Time (equivalent to NaN)

-------

Let’s take another example.


import pandas as pd
a = pd.to_datetime(['2023/11/23', '2010.12.31', None, '2023/11/23',])
print(a)
print(type(a))
b = pd.DataFrame(data=a)
print(b)

output
DatetimeIndex(['2023-11-23', '2010-12-31', 'NaT', '2023-11-23'], dtype='datetime64[ns]',
freq=None)
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
0
0 2023-11-23
1 2010-12-31
2 NaT
3 2023-11-23
-------

Pandas - Date Functionality


Old code ..use Offset Aliases along with the code
Extending the Time series, Date functionalities play major role in financial data analysis.
While working with Date data, we will frequently come across the following −

• Generating sequence of dates


• Convert the date series to different frequencies

Create a Range of Dates


Using the date.range() function by specifying the periods and the frequency, we can
create the date series. By default, the frequency of range is Days
import pandas as pd
import numpy as np
a = pd.date_range('1/1/2020', periods=5)
print(a)
print(type(a))
output

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',


'2020-01-05'],
dtype='datetime64[ns]', freq='D')
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
Note : a = pd.date_range('2021-01-11', periods=5, freq='1H') // also possible

------

Change the Date Frequency (useful fns,make use of


these for real time projects)
import pandas as pd
import numpy as np
a = pd.date_range('1/1/2020', freq='M', periods=5)
print(a)
print(type(a))
output
DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
'2020-05-31'],
dtype='datetime64[ns]', freq='M')
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
-----------

bdate_range - business date


bdate_range() stands for business date ranges. Unlike date_range(), it excludes Saturday
and Sunday.

import pandas as pd
import numpy as np
a = pd.bdate_range('8/29/2020', periods=5)
print(a)
print(type(a))
output
DatetimeIndex(['2020-08-31', '2020-09-01', '2020-09-02', '2020-09-03',
'2020-09-04'],
dtype='datetime64[ns]', freq='B')
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>

Observe, it skips 08/29 (because Saturday) the date jumps to 08 / 31 (ie Monday) Just
check your calendar for the days.
=========

Convenience functions like date_range and bdate_range utilize a variety of frequency


aliases. The default frequency for date_range is a calendar day while the default for
bdate_range is a business day.

import pandas as pd
import numpy as np
start = pd.datetime(2023, 8, 1)
end = pd.datetime (2023, 8, 5 )

print(pd.date_range(start, end))
output
DatetimeIndex(['2020-08-01', '2020-08-02', '2020-08-03', '2020-08-04',
'2020-08-05'],
dtype='datetime64[ns]', freq='D')

Note: The pandas.datetime class is deprecated and will be removed from pandas in a
future version. Import from datetime module instead.
-----------
How to find number of working days for the year

import pandas as pd
a = pd.bdate_range(start='2023 01 01', end='2023 12 31')
print(a)
print(type(a))
print(len(a))

output
DatetimeIndex(['2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05',
'2023-01-06', '2023-01-09', '2023-01-10', '2023-01-11',
'2023-01-12', '2023-01-13',
...
'2023-12-18', '2023-12-19', '2023-12-20', '2023-12-21',
'2023-12-22', '2023-12-25', '2023-12-26', '2023-12-27',
'2023-12-28', '2023-12-29'],
dtype='datetime64[ns]', length=260, freq='B')
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
260
----------------
Another way is below

import pandas as pd
import numpy as np
start = pd.datetime(2023, 1, 1)
end = pd.datetime (2023, 12, 31 )
a = (pd.bdate_range(start,end))
for i in a:
# print("Calander days : >> ", i)
pass
#
print("Workiing day till Mach 31 ", len(a))
print("Number of HolNathanys ", 365-len(a), "Days")
output
Workiing day till Mach 31 261
Number of HolNathanys 104 Days

=========

if it is for business date_range(), the below will be the answer (Sat and Sun are excluded)
print(pd.bdate_range(start, end))
output
DatetimeIndex(['2020-08-03', '2020-08-04', '2020-08-05'], dtype='datetime64[ns]', freq='B')
----------

Offset Aliases
A number of string aliases are given to useful common time series frequencies. We will refer to these aliases as offset
aliases.

Alias
Description
Alias
Description
B
business day frequency
BQS
business quarter start frequency
D
calendar day frequency
A
annual(Year) end frequency
W
weekly frequency
BA
business year end frequency
M
month end frequency
BAS
business year start frequency
SM
semi-month end frequency
BH
business hour frequency
BM
business month end frequency
H
hourly frequency
MS
month start frequency
T, min
minutely frequency
SMS
SMS semi month start frequency
S
secondly frequency
BMS
business month start frequency
L, ms
milliseconds
Q
quarter end frequency
U, us
microseconds
BQ
business quarter end frequency
N
nanoseconds
QS
quarter start frequency

Pandas – Timedelta
Timedeltas are differences in times, expressed in difference units, for example, days,
hours, minutes, seconds. They can be both positive and negative.
We can create Timedelta objects using various arguments as shown below
String

By passing a string literal, we can create a timedelta object.

import pandas as pd
import numpy as np

a = pd.to_timedelta('2 days 2 hours 15 minutes 30 seconds')


print(a)
print(type(a)

output
2 days 02:15:30
<class 'pandas._libs.tslibs.timedeltas.Timedelta'>
-------------------
The below formats also acceptable / Execute line by line , predict the result, then see the
output and then discuss
import pandas as pd
import numpy as np

# a = pd.to_timedelta('2 days 2 hours 15 minutes 30 seconds')


# a = pd.to_timedelta('2 days 2 hours 15 minutes 30 secondS')
# a = pd.to_timedelta('2 days 2 hours 15 minutes 30 sec')
# a = pd.to_timedelta('2 days 2 hours 15 min 30 sec')
# a = pd.to_timedelta('2 days 2 hr 15 min 30 sec')
# a = pd.to_timedelta('2 days 2 h 15 m 30 s')
a = pd.to_timedelta('2 d 2 h 15 m 30 s')

print(a)
print(type(a))

output
2 days 02:15:30
<class 'pandas._libs.tslibs.timedeltas.Timedelta'>
-------------
Try below

import pandas as pd
import numpy as np

a = pd.to_timedelta('2 days 26 hours 15 minutes 30 seconds')


print(a)
print("++++++++")

a = pd.to_timedelta('256 hours')
print(a)
print("++++++++")

a = pd.to_timedelta('256 minutes')
print(a)
print("++++++++")

a = pd.to_timedelta('256 seconds')
print(a)
print(type(a))

output

3 days 02:15:30
++++++++
10 days 16:00:00
++++++++
0 days 04:16:00
++++++++
0 days 00:04:16
<class 'pandas._libs.tslibs.timedeltas.Timedelta'>

You might also like