Daily Task 6 & 7 - Explore Merge Function & Perform Data Cleaning - Jupyter Notebook

7/11/22, 1:23 PM Daily Task 6 & 7 - Explore Merge Function & Perform Data Cleaning - Jupyter Notebook
Daily Task 6 - Explore Merge Function
Example - 1
In [1]:
import pandas as pd
In [11]:
temp = pd.DataFrame({"City": ['Mumbai','Chennai','Nashik','Pune','Delhi','Banglore'],

"Temp": [25,23,22,21,20,26]})
temp
Out[11]:
City Temp
0 Mumbai 25
1 Chennai 23
2 Nashik 22
3 Pune 21
4 Delhi 20
5 Banglore 26
In [12]:
humidity = pd.DataFrame({"City": ['Pune','Mumbai','Chennai','Nashik','Delhi','Tamilnadu'],

"Humidity": [75,83,85,78,53,69]})
humidity
Out[12]:
City Humidity
0 Pune 75
1 Mumbai 83
2 Chennai 85
3 Nashik 78
4 Delhi 53
5 Tamilnadu 69
localhost:8888/notebooks/Python by John/Daily Tasks/Daily Task 6 %26 7 - Explore Merge Function %26 Perform Data Cleaning.ipynb 1/23
In [13]:
weather = pd.merge(temp,humidity) ## It will merge only for same values in both dataframes
weather ## By default how = "inner" i.e intersection of both dat
Out[13]:
City Temp Humidity
0 Mumbai 25 83
1 Chennai 23 85
2 Nashik 22 78
3 Pune 21 75
4 Delhi 20 53
In [17]:
weather = pd.merge(temp,humidity,on = "City",how="outer") ## It will merge both datasets as

weather
Out[17]:
City Temp Humidity
0 Mumbai 25.0 83.0
1 Chennai 23.0 85.0
2 Nashik 22.0 78.0
3 Pune 21.0 75.0
4 Delhi 20.0 53.0
5 Banglore 26.0 NaN
6 Tamilnadu NaN 69.0
In [21]:
weather = pd.merge(temp,humidity,on = "City",how="left",indicator = True) ## It will merge

weather
Out[21]:
City Temp Humidity _merge
0 Mumbai 25 83.0 both
1 Chennai 23 85.0 both
2 Nashik 22 78.0 both
3 Pune 21 75.0 both
4 Delhi 20 53.0 both
5 Banglore 26 NaN left_only
In [20]:
weather = pd.merge(temp,humidity,on = "City",how="right",indicator = True) ## It will merge

weather
Out[20]:
City Temp Humidity _merge
0 Pune 21.0 75 both
1 Mumbai 25.0 83 both
2 Chennai 23.0 85 both
3 Nashik 22.0 78 both
4 Delhi 20.0 53 both
5 Tamilnadu NaN 69 right_only
Example - 2
In [4]:
electronics = pd.DataFrame({"Brands" :["HP","LG","Panasonic","Sony"],

"Devices" :["Laptop","Washing Machine","TV","Keyborad"],
"Department":["Purchase","HR","Quality","Design"]
})
electronics
Out[4]:
Brands Devices Department
0 HP Laptop Purchase
1 LG Washing Machine HR
2 Panasonic TV Quality
3 Sony Keyborad Design
In [5]:
electronics_new = pd.DataFrame({"Brands" :["Intel","LG","Panasonic","Sony","Haier"],

"Devices" :["Computer","Fridge","TV","AC","Oven"],
"Department":["Production","HR","Quality","Design","Purchase"]
})
electronics_new
Out[5]:
Brands Devices Department
0 Intel Computer Production
1 LG Fridge HR
2 Panasonic TV Quality
3 Sony AC Design
4 Haier Oven Purchase
In [8]:
accessories = pd.merge(electronics,electronics_new, on = "Brands")

accessories
Out[8]:
Brands Devices_x Department_x Devices_y Department_y
0 LG Washing Machine HR Fridge HR
1 Panasonic TV Quality TV Quality
2 Sony Keyborad Design AC Design
In [10]:
accessories = pd.merge(electronics,electronics_new, on = "Brands",suffixes = ('_left', '_ri

accessories
Out[10]:
Brands Devices_left Department_left Devices_right Department_right
0 LG Washing Machine HR Fridge HR
1 Panasonic TV Quality TV Quality
2 Sony Keyborad Design AC Design
===============================================
Daily Task 7 - Perform Data Cleaning
In [2]:
sales_data_2017 = pd.read_csv('E:\Data Science by John\pandas\Sales Transactions-2017.csv')

sales_data_2017
Out[2]:
Date Voucher Party Product Qty Rate Gr
SOLANKI DONA-VAI-
0 1/4/2017 Sal:1 2 1,690.00 3,380
PLASTICS 9100
SOLANKI LITE
1 1/4/2017 Sal:1 6 1,620.00 9,720
PLASTICS FOAM(1200)
VISHNU
SARNESWARA
2 1/4/2017 Sal:2 CHOTA 500 23 11,500
TRADERS
WINE
SARNESWARA LITE
3 1/4/2017 Sal:2 6 1,620.00 9,720
TRADERS FOAM(1200)
SARNESWARA DONA-VAI-
4 1/4/2017 Sal:2 5 1,690.00 8,450
TRADERS 9100
... ... ... ... ... ... ...
10*10
47285 31/03/2018 Sal:10042 Vkp 25 137 3,425
SHEET
47286 NaN NaN NaN NaN NaN NaN N
47287 NaN NaN NaN NaN NaN NaN N
47288 NaN Total NaN NaN 607,734.60 669,300.49 9,953,816
47289 NaN Total NaN NaN 7,593,062.00 8,309,116.00 115,778,725
47290 rows × 9 columns
In [3]:

sales_data_2018
Out[3]:
Date Voucher Party Product Qty Rate Gross
SILVER
0 1/4/2018 Sal:146 TP13 POUCH 50 85 4,250.00
9*12
1 1/4/2018 Sal:146 TP13 RUBBER 5 290 1,450.00
DURGA
2 1/4/2018 Sal:146 TP13 10*12 1,600.00 5.5 8,800.00
Blue
DURGA
3 1/4/2018 Sal:146 TP13 13*16 400 11 4,400.00
BLUE
10*12
4 1/4/2018 Sal:146 TP13 SARAS- 600 8.1 4,860.00
NAT
... ... ... ... ... ... ... ...
HAMPI SPOON
44735 31/03/2019 Sal:9610 200 40 8,000.00
FOODS SOOFY
44736 NaN NaN NaN NaN NaN NaN NaN
44738 NaN Total NaN NaN 666,056.00 1,067,808.80 10,796,991.30 29,9
44739 NaN Total NaN NaN 7,097,803.00 10,024,197.00 117,897,671.80 720,2
In [4]:

sales_data_2019
Out[4]:
Date Voucher Party Product Qty Rate Gross
BALAJI DONA-
0 1/4/2019 Sal:687 1 1,730.00 1,730.00
PLASTICS VAI-9100
BALAJI SMART
1 1/4/2019 Sal:687 1 1,730.00 1,730.00
PLASTICS BOUL(48)
BALAJI Vishnu
2 1/4/2019 Sal:688 110 18.5 2,035.00
PLASTICS Ice
3 28/3 0 0
BALAJI 100LEAF
4 1/4/2019 Sal:689 3 585 1,755.00
PLASTICS -SP
... ... ... ... ... ... ... ...
13*16
19171 10/10/2019 Sal:4935 K.SRIHARI WHITE 400 16 6,400.00
RK
19174 NaN Total NaN NaN 99,284.90 175,381.65 2,203,649.50 20
19175 NaN Total NaN NaN 2,710,193.00 5,519,888.40 53,360,791.40 672
In [5]:
import warnings
warnings.filterwarnings(action = 'ignore')
In [6]:
sales_complete_data = sales_data_2017.append([sales_data_2018,sales_data_2019])
sales_complete_data
Out[6]:
Date Voucher Party Product Qty Rate Gros
SOLANKI DONA-VAI-
0 1/4/2017 Sal:1 2 1,690.00 3,380.0
PLASTICS 9100
SOLANKI LITE
1 1/4/2017 Sal:1 6 1,620.00 9,720.0
PLASTICS FOAM(1200)
VISHNU
SARNESWARA
2 1/4/2017 Sal:2 CHOTA 500 23 11,500.0
TRADERS
WINE
SARNESWARA LITE
3 1/4/2017 Sal:2 6 1,620.00 9,720.0
TRADERS FOAM(1200)
4 1/4/2017 Sal:2 5 1,690.00 8,450.0
TRADERS 9100
... ... ... ... ... ... ...
13*16
19171 10/10/2019 Sal:4935 K.SRIHARI 400 16 6,400.0
WHITE RK
19172 NaN NaN NaN NaN NaN NaN Na
19173 NaN NaN NaN NaN NaN NaN Na
19174 NaN Total NaN NaN 99,284.90 175,381.65 2,203,649.5
19175 NaN Total NaN NaN 2,710,193.00 5,519,888.40 53,360,791.4
In [7]:
sales_complete_data.shape
Out[7]:
(111206, 9)
In [8]:
sales_complete_data.head(20)
Out[8]:
Voucher
Date Voucher Party Product Qty Rate Gross Disc
Amount
SOLANKI DONA-VAI-
0 1/4/2017 Sal:1 2 1,690.00 3,380.00 NaN 13,100.00
PLASTICS 9100
SOLANKI LITE
1 1/4/2017 Sal:1 6 1,620.00 9,720.00 NaN NaN
PLASTICS FOAM(1200)
VISHNU
SARNESWARA
2 1/4/2017 Sal:2 CHOTA 500 23 11,500.00 NaN 30,990.00
TRADERS
WINE
SARNESWARA LITE
3 1/4/2017 Sal:2 6 1,620.00 9,720.00 NaN NaN
TRADERS FOAM(1200)
4 1/4/2017 Sal:2 5 1,690.00 8,450.00 NaN NaN
TRADERS 9100
SARNESWARA CLASSIC
5 1/4/2017 Sal:2 1 1,320.00 1,320.00 NaN NaN
TRADERS ENJOY(750)
Vishnu
6 1/4/2017 Sal:898 Lock 100 30 3,000.00 100 5,400.00
250ml
BLACK
7 1/4/2017 Sal:898 Lock 100 26 2,600.00 100 NaN
DOG-350ML
khader vali late

8
en
9 try
VAMSI
10 1/4/2017 Sal:2497 KRISHNA Loose Items 1 800 800 NaN 800
FANCY
11 NaN NaN #NAME? NaN NaN NaN NaN NaN NaN
DUMMY
12
ENTRY
VAMSI
13 1/4/2017 Sal:9263 KRISHNA Loose Items 1 280 280 NaN 280
FANCY
14 NaN NaN #NAME? NaN NaN NaN NaN NaN NaN
15 dummy entry
16 1/4/2017 Sal:9545 Vkp Loose Items 1 695 695 NaN 695
17 dummy entry
LITE
18 2/4/2017 Sal:16 KPR 1 1,620.00 1,620.00 NaN 1,620.00
FOAM(1200)
BALAJI 90ML
19 3/4/2017 Sal:3 150 14.5 2,175.00 NaN 2,175.00
PLASTICS RANGEELA
In [9]:
sales_cleaned_data = pd.read_csv('E:\Data Science by John\pandas\Sales-Transactions-Edited.

sales_cleaned_data
Out[9]:
Date Voucher Party Product Qty Rate
0 1/4/2017 1 SOLANKI PLASTICS DONA-VAI-9100 2 1690.0
1 1/4/2017 1 SOLANKI PLASTICS LITE FOAM(1200) 6 1620.0
2 1/4/2017 2 SARNESWARA TRADERS VISHNU CHOTA WINE 500 23.0
3 1/4/2017 2 SARNESWARA TRADERS LITE FOAM(1200) 6 1620.0
4 1/4/2017 2 SARNESWARA TRADERS DONA-VAI-9100 5 1690.0
... ... ... ... ... ... ...
95557 12/9/2019 4265 TP13 SPOON MED M.W 20 11.0
95558 12/9/2019 4266 K.SRIHARI SMART BOUL(48) 1 1830.0
95559 12/9/2019 4267 SMS SMARTBOUL GLA(4000) 1 1520.0
95560 12/9/2019 4268 ANILFANCY RR WINEGLASS 100 20.0
95561 12/9/2019 4268 ANILFANCY RR WATER GLASS 100 20.0
In [10]:
sales_cleaned_data.shape
Out[10]:
(95562, 6)
In [11]:
sales_complete_data.dtypes
Out[11]:
Date object
Voucher object
Party object
Product object
Qty object
Rate object
Gross object
Disc object
Voucher Amount object
dtype: object
In [12]:
sales_cleaned_data.dtypes
Out[12]:
Date object
Voucher int64
Party object
Product object
Qty int64
Rate float64
dtype: object
In [15]:
sales_complete_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 111206 entries, 0 to 19175
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 98615 non-null object
1 Voucher 98649 non-null object
2 Party 111166 non-null object
3 Product 98615 non-null object
4 Qty 98649 non-null object
5 Rate 98648 non-null object
6 Gross 98648 non-null object
7 Disc 5597 non-null object
8 Voucher Amount 27560 non-null object
dtypes: object(9)
memory usage: 8.5+ MB
Step 1 - Detecting NaN Values
In [17]:
sales_complete_data.isna().sum()
Out[17]:
Date 12591
Voucher 12557
Party 40
Product 12591
Qty 12557
Rate 12558
Gross 12558
Disc 105609
Voucher Amount 83646
dtype: int64
In [18]:
sales_cleaned_data.isna().sum()
Out[18]:
Date 0
Voucher 0
Party 0
Product 0
Qty 0
Rate 1
dtype: int64
Step 2 - Remove the all NaN values
In [19]:
sales_complete_data.drop(labels=["Gross","Disc","Voucher Amount"],axis = 1, inplace=True)
In [20]:
sales_complete_data
Out[20]:
0 1/4/2017 Sal:1 SOLANKI PLASTICS DONA-VAI-9100 2 1,690.00
1 1/4/2017 Sal:1 SOLANKI PLASTICS LITE FOAM(1200) 6 1,620.00
SARNESWARA VISHNU CHOTA

2 1/4/2017 Sal:2 500 23
TRADERS WINE
SARNESWARA
3 1/4/2017 Sal:2 LITE FOAM(1200) 6 1,620.00
TRADERS
SARNESWARA
4 1/4/2017 Sal:2 DONA-VAI-9100 5 1,690.00
TRADERS
... ... ... ... ... ... ...
19171 10/10/2019 Sal:4935 K.SRIHARI 13*16 WHITE RK 400 16
19172 NaN NaN NaN NaN NaN NaN
19173 NaN NaN NaN NaN NaN NaN
19174 NaN Total NaN NaN 99,284.90 175,381.65
19175 NaN Total NaN NaN 2,710,193.00 5,519,888.40
In [21]:
sales_complete_data.dropna(inplace = True)
sales_complete_data
Out[21]:
0 1/4/2017 Sal:1 SOLANKI PLASTICS DONA-VAI-9100 2 1,690.00
1 1/4/2017 Sal:1 SOLANKI PLASTICS LITE FOAM(1200) 6 1,620.00
2 1/4/2017 Sal:2 SARNESWARA TRADERS VISHNU CHOTA WINE 500 23
3 1/4/2017 Sal:2 SARNESWARA TRADERS LITE FOAM(1200) 6 1,620.00
4 1/4/2017 Sal:2 SARNESWARA TRADERS DONA-VAI-9100 5 1,690.00
... ... ... ... ... ... ...
19167 10/10/2019 Sal:4935 K.SRIHARI 16*20(100-W) 140 26
19168 10/10/2019 Sal:4935 K.SRIHARI 10*12 KRISHNA-BK(10 600 8.4
19169 10/10/2019 Sal:4935 K.SRIHARI 13*16 Bk(100)KRISHN 320 16
19170 10/10/2019 Sal:4935 K.SRIHARI 10*12 RK 800 8.5
19171 10/10/2019 Sal:4935 K.SRIHARI 13*16 WHITE RK 400 16
In [22]:
sales_complete_data.shape
Out[22]:
(98614, 6)
In [23]:
print(sales_complete_data["Party"].unique())
['SOLANKI PLASTICS' 'SARNESWARA TRADERS' 'Lock' ... 'markfed -adurupalli'
'8/10 late entry' '10-Jul']
In [24]:
sales_complete_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 98614 entries, 0 to 19171
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 98614 non-null object
1 Voucher 98614 non-null object
2 Party 98614 non-null object
3 Product 98614 non-null object
4 Qty 98614 non-null object
5 Rate 98614 non-null object
dtypes: object(6)
memory usage: 5.3+ MB
In [25]:
sales_complete_data.describe(include = "all" )
Out[25]:
count 98614 98614 98614 98614 98614 98614
unique 836 10043 1835 867 512 1075
top TP13 100
freq 3053 3053 13056 3053 12528 3051
In [26]:
sales_complete_data["Party"].value_counts()
Out[26]:
TP13 13056
K.SRIHARI 2537
KPR 2354
SVP-BUCHHI 1620
HAMPI FOODS 1419
...
g.subharao 1
svr brandi 1
VS 1
SK.BABU 1
10-Jul 1
Name: Party, Length: 1835, dtype: int64
In [27]:
sales_complete_data["Voucher"]
Out[27]:
0 Sal:1
1 Sal:1
2 Sal:2
3 Sal:2
4 Sal:2
...
19167 Sal:4935
19168 Sal:4935
19169 Sal:4935
19170 Sal:4935
19171 Sal:4935
Name: Voucher, Length: 98614, dtype: object
In [28]:
sales_complete_data["Voucher"] = sales_complete_data["Voucher"].str.replace("Sal:","")
sales_complete_data
Out[28]:
0 1/4/2017 1 SOLANKI PLASTICS DONA-VAI-9100 2 1,690.00
1 1/4/2017 1 SOLANKI PLASTICS LITE FOAM(1200) 6 1,620.00
2 1/4/2017 2 SARNESWARA TRADERS VISHNU CHOTA WINE 500 23
3 1/4/2017 2 SARNESWARA TRADERS LITE FOAM(1200) 6 1,620.00
4 1/4/2017 2 SARNESWARA TRADERS DONA-VAI-9100 5 1,690.00
... ... ... ... ... ... ...
19167 10/10/2019 4935 K.SRIHARI 16*20(100-W) 140 26
19168 10/10/2019 4935 K.SRIHARI 10*12 KRISHNA-BK(10 600 8.4
19169 10/10/2019 4935 K.SRIHARI 13*16 Bk(100)KRISHN 320 16
19170 10/10/2019 4935 K.SRIHARI 10*12 RK 800 8.5
19171 10/10/2019 4935 K.SRIHARI 13*16 WHITE RK 400 16
In [29]:
Out[29]:
Date object
Voucher object
Party object
Product object
Qty object
Rate object
dtype: object
In [30]:
sales_cleaned_data
Out[30]:
0 1/4/2017 1 SOLANKI PLASTICS DONA-VAI-9100 2 1690.0
1 1/4/2017 1 SOLANKI PLASTICS LITE FOAM(1200) 6 1620.0
2 1/4/2017 2 SARNESWARA TRADERS VISHNU CHOTA WINE 500 23.0
3 1/4/2017 2 SARNESWARA TRADERS LITE FOAM(1200) 6 1620.0
4 1/4/2017 2 SARNESWARA TRADERS DONA-VAI-9100 5 1690.0
... ... ... ... ... ... ...
95557 12/9/2019 4265 TP13 SPOON MED M.W 20 11.0
95558 12/9/2019 4266 K.SRIHARI SMART BOUL(48) 1 1830.0
95559 12/9/2019 4267 SMS SMARTBOUL GLA(4000) 1 1520.0
95560 12/9/2019 4268 ANILFANCY RR WINEGLASS 100 20.0
95561 12/9/2019 4268 ANILFANCY RR WATER GLASS 100 20.0
In [31]:
sales_complete_data.groupby(by="Party")
sales_complete_data
Out[31]:
... ... ... ... ... ... ...
19167 10/10/2019 4935 K.SRIHARI 16*20(100-W) 140 26
19168 10/10/2019 4935 K.SRIHARI 10*12 KRISHNA-BK(10 600 8.4
19169 10/10/2019 4935 K.SRIHARI 13*16 Bk(100)KRISHN 320 16
19170 10/10/2019 4935 K.SRIHARI 10*12 RK 800 8.5
19171 10/10/2019 4935 K.SRIHARI 13*16 WHITE RK 400 16
In [63]:
sales_complete_data.head()
Out[63]:
In [64]:
Out[64]:
Date object
Voucher object
Party object
Product object
Qty object
Rate object
dtype: object
In [65]:
sales_cleaned_data.dtypes
Out[65]:
Date object
Voucher int64
Party object
Product object
Qty int64
Rate float64
dtype: object
In [66]:
sales_complete_data["Voucher"] = sales_complete_data["Voucher"].astype("int")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [66], in <cell line: 1>()
----> 1 sales_complete_data["Voucher"] = sales_complete_data["Voucher"].asty

pe("int")
File ~\anaconda3\lib\site-packages\pandas\core\generic.py:5912, in NDFrame.a

stype(self, dtype, copy, errors)
5905 results = [
5906 self.iloc[:, i].astype(dtype, copy=copy)
5907 for i in range(len(self.columns))
5908 ]
5910 else:
5911 # else, only a single dtype is given
-> 5912 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=error

s)
5913 return self._constructor(new_data).__finalize__(self, method="as

type")
5915 # GH 33113: handle empty frame or series
File ~\anaconda3\lib\site-packages\pandas\core\internals\managers.py:419, in
BaseBlockManager.astype(self, dtype, copy, errors)
418 def astype(self: T, dtype, copy: bool = False, errors: str = "raise"
) -> T:
--> 419 return self.apply("astype", dtype=dtype, copy=copy, errors=error

s)
File ~\anaconda3\lib\site-packages\pandas\core\internals\managers.py:304, in
BaseBlockManager.apply(self, f, align_keys, ignore_failures, **kwargs)
302 applied = b.apply(f, **kwargs)
303 else:
--> 304 applied = getattr(b, f)(**kwargs)
305 except (TypeError, NotImplementedError):

306 if not ignore_failures:
File ~\anaconda3\lib\site-packages\pandas\core\internals\blocks.py:580, in B
lock.astype(self, dtype, copy, errors)
562 """
563 Coerce to the new dtype.
564
(...)
576 Block
577 """
578 values = self.values
--> 580 new_values = astype_array_safe(values, dtype, copy=copy, errors=erro

rs)
582 new_values = maybe_coerce_values(new_values)
583 newb = self.make_block(new_values)
File ~\anaconda3\lib\site-packages\pandas\core\dtypes\cast.py:1292, in astyp

e_array_safe(values, dtype, copy, errors)
1289 dtype = dtype.numpy_dtype
1291 try:
-> 1292 new_values = astype_array(values, dtype, copy=copy)
1293 except (ValueError, TypeError):
1294 # e.g. astype_nansafe can fail on object-dtype of strings
1295 # trying to convert to float
1296 if errors == "ignore":

e_array(values, dtype, copy)
1234 values = values.astype(dtype, copy=copy)
1236 else:
-> 1237 values = astype_nansafe(values, dtype, copy=copy)
1239 # in pandas we don't store numpy str dtypes, so convert to object
1240 if isinstance(dtype, np.dtype) and issubclass(values.dtype.type, str

):

e_nansafe(arr, dtype, copy, skipna)
1150 elif is_object_dtype(arr.dtype):
1151
1152 # work around NumPy brokenness, #1987
1153 if np.issubdtype(dtype.type, np.integer):
-> 1154 return lib.astype_intsafe(arr, dtype)
1156 # if we have a datetime/timedelta array of objects
1157 # then coerce to a proper dtype and recall astype_nansafe
1159 elif is_datetime64_dtype(dtype):
File ~\anaconda3\lib\site-packages\pandas\_libs\lib.pyx:668, in pandas._lib

s.lib.astype_intsafe()
ValueError: invalid literal for int() with base 10: ' '
In [60]:
sales_complete_data.sort_values(by= 'Date',ascending= False)
Out[60]:
16219 9/9/2019 4132 TP13 PP 16 153
16221 9/9/2019 4133 ATS TAMBULAM 7*10 90 12
16223 9/9/2019 4135 SEKHAR MARKET RR WATER GLASS 500 20
16224 9/9/2019 4136 SALESMAN SURESH BLACK DOG-350ML 80 22
16226 9/9/2019 4137 KARIMULLA-VGIRI CYCLE-BK-10*12 1,600.00 6.6
... ... ... ... ... ... ...
45206 credit bill
45203 my swami devastanam
45202 sri kodanda ramaswa
45142 credit bill
5947 gsr mallam
In [54]:
sales_complete_data.sort_values(by = "Voucher")
Out[54]:
19145 INV 19
26053 khadervali
26065 credit bill
26142 directh autolo
26220 AUTULO
... ... ... ... ... ... ...
47108 31/03/2018 9998 KRISHNAPATNAM PORT GST AMOUNT 1 324
47110 31/03/2018 9999 PAPARAO GARUDA 13*16 BLUE 550 13
47111 31/03/2018 9999 PAPARAO 10*10 TEJA 25 128
47112 31/03/2018 9999 PAPARAO GARUDA 16*20 BLUE 60 24.5
47113 31/03/2018 9999 PAPARAO HIGH COUNT 16*20 BL 5 160
In [62]:
pd.to_datetime(sales_complete_data["Date"])
--------------------------------------------------------------------------
-
TypeError Traceback (most recent call las

t)
File ~\anaconda3\lib\site-packages\pandas\core\arrays\datetimes.py:2211, i
n objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_
iso8601, allow_object, allow_mixed)
2210 try:
-> 2211 values, tz_parsed = conversion.datetime_to_datetime64(data.rav

el("K"))
2212 # If tzaware, these values represent unix timestamps, so we
2213 # return them as i8 to distinguish from wall times
File ~\anaconda3\lib\site-packages\pandas\_libs\tslibs\conversion.pyx:360,
in pandas._libs.tslibs.conversion.datetime_to_datetime64()
TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:
ParserError Traceback (most recent call las

t)
Input In [62], in <cell line: 1>()
----> 1 pd.to_datetime(sales_complete_data["Date"])
File ~\anaconda3\lib\site-packages\pandas\core\tools\datetimes.py:1047, in
to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, in
fer_datetime_format, origin, cache)
1045 result = arg.tz_localize(tz)
1046 elif isinstance(arg, ABCSeries):
-> 1047 cache_array = _maybe_cache(arg, format, cache, convert_listlik

e)
1048 if not cache_array.empty:
1049 result = arg.map(cache_array)
_maybe_cache(arg, format, cache, convert_listlike)
195 unique_dates = unique(arg)
196 if len(unique_dates) < len(arg):
--> 197 cache_dates = convert_listlike(unique_dates, format)
198 cache_array = Series(cache_dates, index=unique_dates)
199 # GH#39882 and GH#35888 in case of None and NaT we get duplica
tes
_convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_dat
etime_format, dayfirst, yearfirst, exact)
400 assert format is None or infer_datetime_format
401 utc = tz == "utc"
--> 402 result, tz_parsed = objects_to_datetime64ns(
403 arg,
404 dayfirst=dayfirst,
405 yearfirst=yearfirst,
406 utc=utc,
407 errors=errors,
408 require_iso8601=require_iso8601,
409 allow_object=True,
410 )
412 if tz_parsed is not None:
413 # We can take a shortcut since the datetime64 numpy array
414 # is in UTC
415 dta = DatetimeArray(result, dtype=tz_to_dtype(tz_parsed))
2215 return values.view("i8"), tz_parsed
2216 except (ValueError, TypeError):
-> 2217 raise err

2219 if tz_parsed is not None:
2220 # We can take a shortcut since the datetime64 numpy array
2221 # is in UTC
2222 # Return i8 values to denote unix timestamps
2223 return result.view("i8"), tz_parsed
2197 order: Literal["F", "C"] = "F" if flags.f_contiguous else "C"
2198 try:
-> 2199 result, tz_parsed = tslib.array_to_datetime(
2200 data.ravel("K"),
2201 errors=errors,
2202 utc=utc,
2203 dayfirst=dayfirst,
2204 yearfirst=yearfirst,
2205 require_iso8601=require_iso8601,
2206 allow_mixed=allow_mixed,
2207 )
2208 result = result.reshape(data.shape, order=order)
2209 except ValueError as err:
File ~\anaconda3\lib\site-packages\pandas\_libs\tslib.pyx:381, in pandas._

libs.tslib.array_to_datetime()

libs.tslib.array_to_datetime()

libs.tslib._array_to_datetime_object()

libs.tslib._array_to_datetime_object()
File ~\anaconda3\lib\site-packages\pandas\_libs\tslibs\parsing.pyx:281, in
pandas._libs.tslibs.parsing.parse_datetime_string()
File ~\anaconda3\lib\site-packages\dateutil\parser\_parser.py:1368, in par

se(timestr, parserinfo, **kwargs)
1366 return parser(parserinfo).parse(timestr, **kwargs)
1367 else:
-> 1368 return DEFAULTPARSER.parse(timestr, **kwargs)
File ~\anaconda3\lib\site-packages\dateutil\parser\_parser.py:646, in pars

er.parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
643 raise ParserError("Unknown string format: %s", timestr)
645 if len(res) == 0:
--> 646 raise ParserError("String does not contain a date: %s", timest
r)
648 try:
649 ret = self._build_naive(res, default)
ParserError: String does not contain a date:
In [ ]:

Daily Task 6 &amp; 7 - Explore Merge Function &amp; Perform Data Cleaning - Jupyter Notebook

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Daily Task 6 &amp; 7 - Explore Merge Function &amp; Perform Data Cleaning - Jupyter Notebook

Uploaded by

Copyright:

Available Formats

7/11/22, 1:23 PM Daily Task 6 & 7 - Explore Merge Function & Perform Data Cleaning - Jupyter Notebook

Daily Task 6 - Explore Merge Function

temp = pd.DataFrame({"City": ['Mumbai','Chennai','Nashik','Pune','Delhi','Banglore'],

humidity = pd.DataFrame({"City": ['Pune','Mumbai','Chennai','Nashik','Delhi','Tamilnadu'],

City Temp Humidity

weather = pd.merge(temp,humidity,on = "City",how="outer") ## It will merge both datasets as

City Temp Humidity

0 Mumbai 25.0 83.0

1 Chennai 23.0 85.0

2 Nashik 22.0 78.0

3 Pune 21.0 75.0

4 Delhi 20.0 53.0

5 Banglore 26.0 NaN

6 Tamilnadu NaN 69.0

weather = pd.merge(temp,humidity,on = "City",how="left",indicator = True) ## It will merge

City Temp Humidity _merge

0 Mumbai 25 83.0 both

1 Chennai 23 85.0 both

2 Nashik 22 78.0 both

3 Pune 21 75.0 both

4 Delhi 20 53.0 both

5 Banglore 26 NaN left_only

weather = pd.merge(temp,humidity,on = "City",how="right",indicator = True) ## It will merge

City Temp Humidity _merge

0 Pune 21.0 75 both

1 Mumbai 25.0 83 both

2 Chennai 23.0 85 both

3 Nashik 22.0 78 both

4 Delhi 20.0 53 both

5 Tamilnadu NaN 69 right_only

electronics = pd.DataFrame({"Brands" :["HP","LG","Panasonic","Sony"],

Brands Devices Department

3 Sony Keyborad Design

electronics_new = pd.DataFrame({"Brands" :["Intel","LG","Panasonic","Sony","Haier"],

Brands Devices Department

0 Intel Computer Production

4 Haier Oven Purchase

accessories = pd.merge(electronics,electronics_new, on = "Brands")

Brands Devices_x Department_x Devices_y Department_y

0 LG Washing Machine HR Fridge HR

1 Panasonic TV Quality TV Quality

2 Sony Keyborad Design AC Design

accessories = pd.merge(electronics,electronics_new, on = "Brands",suffixes = ('_left', '_ri

Brands Devices_left Department_left Devices_right Department_right

0 LG Washing Machine HR Fridge HR

1 Panasonic TV Quality TV Quality

2 Sony Keyborad Design AC Design

Daily Task 7 - Perform Data Cleaning

sales_data_2017 = pd.read_csv('E:\Data Science by John\pandas\Sales Transactions-2017.csv')

Date Voucher Party Product Qty Rate Gr

... ... ... ... ... ... ...

47286 NaN NaN NaN NaN NaN NaN N

47287 NaN NaN NaN NaN NaN NaN N

47288 NaN Total NaN NaN 607,734.60 669,300.49 9,953,816

47289 NaN Total NaN NaN 7,593,062.00 8,309,116.00 115,778,725

47290 rows × 9 columns

sales_data_2018 = pd.read_csv('E:\Data Science by John\pandas\Sales Transactions-2018.csv')

Date Voucher Party Product Qty Rate Gross

1 1/4/2018 Sal:146 TP13 RUBBER 5 290 1,450.00

... ... ... ... ... ... ... ...

44736 NaN NaN NaN NaN NaN NaN NaN

44737 NaN NaN NaN NaN NaN NaN NaN

44738 NaN Total NaN NaN 666,056.00 1,067,808.80 10,796,991.30 29,9

Daily Task 6 & 7 - Explore Merge Function & Perform Data Cleaning - Jupyter Notebook

Daily Task 6 & 7 - Explore Merge Function & Perform Data Cleaning - Jupyter Notebook