Professional Documents
Culture Documents
www.DataScienceInTamil.com
Official Website
https://DataScienceInTamil.com/
மேலும் முக்கிய மகள்விகள் பதில்களுக்கு :
https://www.DatascienceInTamil.com/#faq
Thanks to all the open-source community and to the below websites from where we take references / content /code
example. definitions, please use these websites for further reading:
Inner Join – intersection (Common values) //this is default join ( Cehck with venn
diagram)
print(pd.merge(left, right, on = 'subject_id', how = 'inner'))
output
pd.concat(objs,axis=0,join='outer',join_axes=None,
ignore_index=False)
Concatenating Objects
The concat function does all of the heavy lifting of performing concatenation operations
along an axis. Let us create different objects and do concatenation.
import pandas as pd
import numpy as np
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print(pd.concat([one, two]))
print("=============")
print(pd.concat([one, two], ignore_index=True))
Output
import pandas as pd
import numpy as np
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored1':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored2':[89,80,79,97,88]},
index=[1,2,3,4,5])
print(pd.concat([one, two]))
print("=============")
print(pd.concat([one, two], ignore_index=True))
output
import pandas as pd
import numpy as np
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
output
print(pd.concat([one, two],keys=['x','y'],ignore_index=True))
output
Name subject_id Marks_scored
0 Alex sub1 98
1 Amy sub2 90
2 Allen sub4 87
3 Alice sub6 69
4 Ayoung sub5 78
5 Billy sub2 89
6 Brian sub4 80
7 Bran sub3 79
8 Bryce sub6 97
9 Betty sub5 88
Note: Observe, the index changes completely and the Keys are also overridden.
---------
If two objects need to be added along axis=1, then the 2
Dfs columns will be appended / concatenated.
import pandas as pd
import numpy as np
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,4,5,7])
output
------
Concatenating Using append()
It is going to be deprecated
A useful shortcut to concat are the append instance methods on Series and DataFrame.
These methods actually predated concat. They concatenate along axis=0, namely the
index
print(one.append(two))
output
-------
Time Series
Pandas provide a robust tool for working time with Time series data, especially in the
financial sector. While working with time series data, we frequently come across the
following −
Pandas provides a relatively compact and self-contained set of tools for performing the
above tasks.
import pandas as pd
import datetime
print(pd.datetime.now())
--------
Using datatime package
import pandas as pd
import datetime as dt
print(dt.datetime.now()) # print(datetime.datetime.now())
output
2020-08-30 20:01:31.065810
----
Create a TimeStamp
Time-stamped data is the most basic type of timeseries data that associates values with points in time. For pandas
objects, it means using the points in time. Let’s take an example −
import pandas as pd
print(pd.Timestamp('2023-03-14'))
output
2022-03-14 00:00:00
---------------
What happens if don’t give any parameter to the fn
import pandas as pd
print(pd.Timestamp())
output
TypeError: function missing required argument 'year' (pos 1)
--------------
If we give only one parameter ie year
import pandas as pd
print(pd.Timestamp('2022'))
output
2022-01-01 00:00:00
-----------
If we give the year in integer / number format. It takes the value as milli second
import pandas as pd
print(pd.Timestamp(2022))
Note: If you pass a single integer or float value to the Timestamp constructor, it returns a timestamp
equivalent to the number of nanoseconds after the Unix epoch (Jan 1, 1970):
ouput
1970-01-01 00:00:00.000002022
---------
Another ex for nano second (will be converted to time)
import pandas as pd
print(pd.Timestamp(2022022222222))
output
1970-01-01 00:33:42.022222222
----------------
import pandas as pd
print(pd.Timestamp(year=1982, month=9, day=4, hour=1, minute=35,
second=10))
output
1982-09-04 01:35:10
-----------------
It is also possible to convert integer or float epoch times. The default unit for these is
nanoseconds (since these are how Timestamps are stored). However, often epochs are
stored in another unit which can be specified. Let’s take another example
import pandas as pd
print(pd.Timestamp(1_000_000_000, unit='ns')) # one billion nano second is euqal to 1
second
print(pd.Timestamp(3661, unit='s')) # one hour one minute and one second= 3661
seconds
print(pd.Timestamp(62, unit='m'))
print(pd.Timestamp(25, unit='h'))
output
1976-05-29 07:17:02
1970-01-02 09:42:00
1970-01-02 01:00:00
-------
output
DatetimeIndex(['2022-03-14 11:00:00', '2022-03-14 11:20:00',
'2022-03-14 11:40:00', '2022-03-14 12:00:00',
'2022-03-14 12:20:00', '2022-03-14 12:40:00',
'2022-03-14 13:00:00', '2022-03-14 13:20:00'],
dtype='datetime64[ns]', freq='20T')
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
===============
[datetime.time(11, 0) datetime.time(11, 20) datetime.time(11, 40)
datetime.time(12, 0) datetime.time(12, 20) datetime.time(12, 40)
datetime.time(13, 0) datetime.time(13, 20)]
<class 'numpy.ndarray'>
--------
----------
import pandas as pd
a = pd.to_datetime(pd.Series(data=(['Feb 28 2023', '2023-01-03', '2023-11-
03', "", None, '2023-5', ''])))
print(a)
print(type(a))
output
0 2022-02-28
1 2022-01-03
2 2022-11-03
3 NaT
4 NaT
dtype: datetime64[ns]
<class 'pandas.core.series.Series'>
Note:
NaT means Not a Time (equivalent to NaN)
-------
output
DatetimeIndex(['2023-11-23', '2010-12-31', 'NaT', '2023-11-23'], dtype='datetime64[ns]',
freq=None)
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
0
0 2023-11-23
1 2010-12-31
2 NaT
3 2023-11-23
-------
------
import pandas as pd
import numpy as np
a = pd.bdate_range('8/29/2020', periods=5)
print(a)
print(type(a))
output
DatetimeIndex(['2020-08-31', '2020-09-01', '2020-09-02', '2020-09-03',
'2020-09-04'],
dtype='datetime64[ns]', freq='B')
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
Observe, it skips 08/29 (because Saturday) the date jumps to 08 / 31 (ie Monday) Just
check your calendar for the days.
=========
import pandas as pd
import numpy as np
start = pd.datetime(2023, 8, 1)
end = pd.datetime (2023, 8, 5 )
print(pd.date_range(start, end))
output
DatetimeIndex(['2020-08-01', '2020-08-02', '2020-08-03', '2020-08-04',
'2020-08-05'],
dtype='datetime64[ns]', freq='D')
Note: The pandas.datetime class is deprecated and will be removed from pandas in a
future version. Import from datetime module instead.
-----------
How to find number of working days for the year
import pandas as pd
a = pd.bdate_range(start='2023 01 01', end='2023 12 31')
print(a)
print(type(a))
print(len(a))
output
DatetimeIndex(['2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05',
'2023-01-06', '2023-01-09', '2023-01-10', '2023-01-11',
'2023-01-12', '2023-01-13',
...
'2023-12-18', '2023-12-19', '2023-12-20', '2023-12-21',
'2023-12-22', '2023-12-25', '2023-12-26', '2023-12-27',
'2023-12-28', '2023-12-29'],
dtype='datetime64[ns]', length=260, freq='B')
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
260
----------------
Another way is below
import pandas as pd
import numpy as np
start = pd.datetime(2023, 1, 1)
end = pd.datetime (2023, 12, 31 )
a = (pd.bdate_range(start,end))
for i in a:
# print("Calander days : >> ", i)
pass
#
print("Workiing day till Mach 31 ", len(a))
print("Number of HolNathanys ", 365-len(a), "Days")
output
Workiing day till Mach 31 261
Number of HolNathanys 104 Days
=========
if it is for business date_range(), the below will be the answer (Sat and Sun are excluded)
print(pd.bdate_range(start, end))
output
DatetimeIndex(['2020-08-03', '2020-08-04', '2020-08-05'], dtype='datetime64[ns]', freq='B')
----------
Offset Aliases
A number of string aliases are given to useful common time series frequencies. We will refer to these aliases as offset
aliases.
Alias
Description
Alias
Description
B
business day frequency
BQS
business quarter start frequency
D
calendar day frequency
A
annual(Year) end frequency
W
weekly frequency
BA
business year end frequency
M
month end frequency
BAS
business year start frequency
SM
semi-month end frequency
BH
business hour frequency
BM
business month end frequency
H
hourly frequency
MS
month start frequency
T, min
minutely frequency
SMS
SMS semi month start frequency
S
secondly frequency
BMS
business month start frequency
L, ms
milliseconds
Q
quarter end frequency
U, us
microseconds
BQ
business quarter end frequency
N
nanoseconds
QS
quarter start frequency
Pandas – Timedelta
Timedeltas are differences in times, expressed in difference units, for example, days,
hours, minutes, seconds. They can be both positive and negative.
We can create Timedelta objects using various arguments as shown below
String
import pandas as pd
import numpy as np
output
2 days 02:15:30
<class 'pandas._libs.tslibs.timedeltas.Timedelta'>
-------------------
The below formats also acceptable / Execute line by line , predict the result, then see the
output and then discuss
import pandas as pd
import numpy as np
print(a)
print(type(a))
output
2 days 02:15:30
<class 'pandas._libs.tslibs.timedeltas.Timedelta'>
-------------
Try below
import pandas as pd
import numpy as np
a = pd.to_timedelta('256 hours')
print(a)
print("++++++++")
a = pd.to_timedelta('256 minutes')
print(a)
print("++++++++")
a = pd.to_timedelta('256 seconds')
print(a)
print(type(a))
output
3 days 02:15:30
++++++++
10 days 16:00:00
++++++++
0 days 04:16:00
++++++++
0 days 00:04:16
<class 'pandas._libs.tslibs.timedeltas.Timedelta'>