Professional Documents
Culture Documents
Brief Summary
Delhivery is the largest and fastest-growing fully integrated player in India by revenue in
Fiscal 2021. They aim to build the operating system for commerce, through a
combination of world-class infrastructure, logistics operations of the highest quality, and
cutting-edge engineering and technology capabilities.
The Data team builds intelligence and capabilities using this data that helps them to
widen the gap between the quality, efficiency, and profitability of their business versus
their competitors.
Problem Statement
The company wants to understand and process the data coming out of data engineering
pipelines:
• Clean, sanitize and manipulate data to get useful features out of raw fields
• Make sense out of the raw data and help the data science team to build forecasting
models on it
Import Libraries
In [1]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statistics
import warnings
import random
from scipy import stats
from scipy.stats import levene
from scipy.stats import shapiro
from scipy.stats import anderson
from scipy.stats import pearsonr
from sklearn import preprocessing
import statsmodels.api as sm
warnings.filterwarnings("ignore")
%matplotlib inline
/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: Fu
tureWarning: pandas.util.testing is deprecated. Use the functions in the pu
blic API at pandas.testing instead.
import pandas.util.testing as tm
about:srcdoc Page 1 of 79
Delhivery (6) 19/08/22, 10:36 PM
Global Variables
In [2]:
DEFAULT_OPTIONS = { "mean": True, "mode": True, "title": True, "median": True
In [3]:
TITLE_FONT_WGT = "bold"
COLOR_PALETTE = sns.color_palette("ch:s=.25, rot=-.25")
FONTSIZE, FONTFAMILY, FONTWEIGHT = 12, "Comic Sans MS", "bold"
SMALL_FIGSIZE, SMALL_WIDE_FIGSIZE, MEDIUM_FIGSIZE, MEDIUM_TALL_FIGSIZE, LARGE_FI
Common Utilities
In [4]:
fontsize, fontfamily, fontweight = 12, "Comic Sans MS", "bold"
palette_color = sns.color_palette("ch:s=.25, rot=-.25")
In [5]:
def sort_values(df, ascending = False):
return df.sort_values(ascending = ascending)
In [6]:
def missing_values(df):
total_null_cnt = df.isnull().count()
null_in_col = sort_values(df.isnull().sum())
percent = sort_values(null_in_col / total_null_cnt * 100)
print("Total records = ", df.shape[0])
tab = pd.concat([null_in_col, percent.round(2)], axis = 1, keys = ['# of Missi
return tab
In [7]:
def univariate_analysis(df, col):
fig, ax = plt.subplots(1, 3, figsize = LARGE_FIGSIZE)
plt.suptitle(col, fontsize = FONTSIZE, fontweight = FONTWEIGHT)
uni_histplot(df, col, ax[0], "transits_count", { "nolabel": True })
bi_boxplot(df, "transits_count", col, ax[1], None, { "nolabel": True })
bi_scatterplot(df, "transits_count", col, ax[2], None, "transits_count",
In [8]:
def get_outliers_range(df, col):
q1 = df[col].quantile(0.25)
q3 = df[col].quantile(0.75)
iqr = q3 - q1
outlier_left = q1 - 1.5 * iqr
outlier_right = q3 + 1.5 * iqr
return outlier_left, outlier_right
about:srcdoc Page 2 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [9]:
def detect_outliers_group(df, col):
copied_index = df.index
base_parameters = ["route_type", "source_center", "destination_center", "trans
col_data = df[base_parameters + [col]]
outlier_left, outlier_right = get_outliers_range(col_data.groupby(base_paramet
outliers_range = pd.merge(outlier_left, outlier_right, how = "left", left_on
outliers_range.columns = ["outlier_left", "outlier_right"]
outliers_range = outliers_range.reset_index()
col_data = pd.merge(col_data, outliers_range, how = "left", left_on = base_pa
return df[(df[col] < col_data["outlier_left"]) | (df[col] > col_data["outlier_
In [10]:
def remove_outliers_group(df, col):
copied_index = df.index
base_parameters = ["route_type", "source_center", "destination_center", "trans
col_data = df[base_parameters + [col]]
outlier_left, outlier_right = get_outliers_range(col_data.groupby(base_paramet
outliers_range = pd.merge(outlier_left, outlier_right, how = "left", left_on
outliers_range.columns = ["outlier_left", "outlier_right"]
outliers_range = outliers_range.reset_index()
col_data = pd.merge(col_data, outliers_range, how = "left", left_on = base_pa
return df[(df[col] >= col_data["outlier_left"]) & (df[col] <= col_data["outlie
In [11]:
def handle_outliers_group(df, col):
copied_index = df.index
base_parameters = ["route_type", "source_center", "destination_center", "trans
col_data = df[base_parameters + [col]]
outlier_left, outlier_right = get_outliers_range(col_data.groupby(base_paramet
outliers_range = pd.merge(outlier_left, outlier_right, how = "left", left_on
outliers_range.columns = ["outlier_left", "outlier_right"]
outliers_range = outliers_range.reset_index()
for i in outliers_range.index:
outliers_range.loc[i, "impute_data"] = random.randint(int(outliers_range
col_data = pd.merge(col_data, outliers_range, how = "left", left_on = base_pa
bool_mask = (col_data[col] < col_data["outlier_left"]) | (col_data[col] >
col_data.loc[bool_mask, col] = col_data.loc[bool_mask, "impute_data"]
df[col] = col_data[col].abs()
df[col] = df[col].replace(0,0.1)
In [12]:
def detect_outliers(df, col):
q1 = df[col].quantile(0.25)
q3 = df[col].quantile(0.75)
iqr = q3 - q1
outlier_left = q1 - 1.5 * iqr
outlier_right = q3 + 1.5 * iqr
return df[ (df[col] <= outlier_left) | (df[col] >= outlier_right) ]
about:srcdoc Page 3 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [13]:
def remove_outliers(df, col):
q1 = df[col].quantile(0.25)
q3 = df[col].quantile(0.75)
iqr = q3 - q1
outlier_left = q1 - 1.5 * iqr
outlier_right = q3 + 1.5 * iqr
return df[ (df[col] > outlier_left) & (df[col] < outlier_right) ]
In [14]:
def set_title(axis, label):
if (not label): return
axis.set_title(label, fontweight = TITLE_FONT_WGT)
In [15]:
def set_legend(axis, options):
if (not options): return
axis.legend(options)
In [16]:
def axv_line(df, axis, label, options):
if (not label): return
value = options.get("value")
color = options.get("color")
line_style = options.get("line_style")
axis.axvline(value, color = color, linestyle = line_style, label = label)
In [17]:
def add_meta_data(df, col, axis, title, options):
set_title(axis, title)
if (not options): return
if (options.get("mean")): axv_line(df, axis, { "label": "Mean", "value":
if (options.get("mode")): axv_line(df, axis, { "label": "Mode", "value":
if (options.get("median")): axv_line(df, axis, { "label": "Median", "value"
if (options.get("legend")): set_legend(axis, { "Mean": df[col].mean(), "Mode"
if (options.get("xlabel")): axis.set_xlabel(options.get("xlabel"))
if (options.get("ylabel")): axis.set_ylabel(options.get("ylabel"))
if (options.get("rotate")):
axis.set_xticklabels(axis.get_xticklabels(), rotation = 90)
if (options.get("nolabel")):
axis.set_xlabel(None)
axis.set_ylabel(None)
In [18]:
def uni_distplot(df, col, axis, title, options = DEFAULT_OPTIONS):
sns.distplot(df[col], ax = axis)
add_meta_data(df, col, axis, title, options)
In [19]:
def uni_boxplot(df, col, axis, title):
sns.boxplot(y = df[col], ax = axis)
add_meta_data(df, col, axis, title, { "ylabel": col })
about:srcdoc Page 4 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [20]:
def uni_barplot(df, col, axis, options):
df_count = df[col].value_counts()
df_count.plot.bar(color = COLOR_PALETTE, ax = axis)
add_meta_data(df, col, axis, None, options)
In [21]:
def uni_countplot(df, col, hue, axis):
sns.countplot(data = df, x = col, hue = hue, palette = "Set2", ax = axis)
add_meta_data(df, col, axis, col + " - " + hue + " based distribution", {
In [22]:
def uni_pieplot(df, col, axis, options):
df_count = df[col].value_counts()
df_count.plot.pie(colors = COLOR_PALETTE, autopct = '%.0f%%', ax = axis)
add_meta_data(df, col, axis, None, options)
In [23]:
def uni_histplot(df, col, axis, title, options):
data = df[col]
if (options.get("log")): data = np.log(data)
sns.histplot(data, bins = 50, kde = True, ax = axis)
add_meta_data(df, None, axis, title, None)
In [24]:
def uni_qqplot(df, col, title, axis, options):
data = df[col]
if (options.get("log")): data = np.log(data)
sm.qqplot(data, line = 's', ax = axis)
add_meta_data(df, None, axis, title, None)
In [25]:
def bi_pointplot(df, xcol, ycol, hue, axis, title):
sns.pointplot(x = df[xcol], y = df[ycol], hue = df[hue], ax = axis)
add_meta_data(df, None, axis, title, { "xlabel": xcol, "ylabel": ycol })
In [26]:
def uni_scatterplot(df, xcol, title, axis):
g = sns.scatterplot(data = df[xcol], ax = axis)
add_meta_data(df, None, axis, title, None)
g.set(xticklabels = [])
g.set(xlabel = None)
In [27]:
def bi_boxplot(df, xcol, ycol, axis, hue, options):
sns.boxplot(data = df, x = xcol, y = ycol, ax = axis, palette = "Paired",
add_meta_data(df, None, axis, None, options)
In [28]:
def bi_scatterplot(df, xcol, ycol, axis, title, hue, options):
sns.scatterplot(data = df, x = xcol, y = ycol, ax = axis, hue = hue)
add_meta_data(df, None, axis, title, options)
about:srcdoc Page 5 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [29]:
def bi_pointplot(df, xcol, ycol, hue, axis, title):
sns.pointplot(x = df[xcol], y = df[ycol], hue = df[hue], ax = axis)
add_meta_data(df, None, axis, title, { "xlabel": xcol, "ylabel": ycol })
In [30]:
def bi_lineplot(df, xcol, ycol, hue, axis, title):
if (hue): g = sns.lineplot(data = df, x = xcol, y = ycol, hue = hue, ax =
else: g = sns.lineplot(data = df, x = xcol, y = ycol, ax = axis, markers=
if (hue): title = title + " | " + hue
add_meta_data(df, None, axis, title, None)
g.set(xlabel = None)
g.set(ylabel = None)
In [31]:
def heatmap(df, title, options):
sns.color_palette("pastel")
sns.heatmap(df, annot = True, vmin = -1, vmax = 1, cmap = "PiYG")
add_meta_data(df, None, None, title, options)
plt.show()
In [32]:
def pairplot(df):
sns.color_palette("pastel")
sns.pairplot(df)
plt.show()
In [33]:
def kdeplot(df, colname, title, axis, test = False):
g = sns.kdeplot(df[colname], ax = axis)
axis.axvline(df[colname].mean(), color = "r", linestyle = "--", label =
if (title):
axis.set_title(title, fontweight = fontweight)
g.set(yticklabels=[])
g.set(ylabel=None)
if (not test):
axis.axvline(df[colname].median(), color = "g", linestyle = "-", label
else:
axis.legend({ "Mean" : df[colname].mean(), "Median" : df[colname].median
In [34]:
def boxplot(df, colname, title, axis):
sns.boxplot(y = df[colname], ax = axis)
if (title):
axis.set_title(title, fontweight = fontweight)
axis.set_ylabel(colname, fontsize = fontsize, family = fontfamily)
In [35]:
def scatterplot(df, xcolname, ycolname, title, axis):
sns.scatterplot(data = df, x = xcolname, y = ycolname, ax = axis)
axis.set_title(title, fontweight = fontweight)
axis.set_xlabel(None)
axis.set_ylabel(None)
about:srcdoc Page 6 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [36]:
def scatterplotonecol(df, xcolname, title, axis):
g = sns.scatterplot(data = df[xcolname], ax = axis)
if (title):
axis.set_title(title, fontweight = fontweight)
g.set(xticklabels=[])
g.set(xlabel=None)
In [37]:
def boxplot_bicol(df,colname1, colname2,axis):
sns.boxplot(x = colname1,y = colname2, data = df,ax=axis,palette="Paired"
axis.set_xlabel(colname1, fontweight="bold",fontsize=14,family = "Comic Sans
axis.set_ylabel(colname2, fontweight="bold", fontsize=14,family = "Comic San
In [38]:
def pointplot(df,colname1,colname2,colname3,title,axis):
sns.pointplot(x=df[colname1],y=df[colname2],hue=df[colname3],ax=axis)
axis.set_xlabel(colname1)
axis.set_ylabel(colname2)
axis.set_title(title,fontweight="bold")
In [39]:
def countplot(df, xcolname, hcolname, axis):
title = xcolname + " - " + hcolname + " based distribution"
sns.countplot(data=df,x=xcolname, hue=hcolname, palette="Set2",ax=axis)
axis.set_title(title,fontweight="bold")
axis.set_xlabel(xcolname)
axis.set_ylabel('count')
In [40]:
def histplot(df,title,axis):
sns.histplot(df, bins = 50, kde = True, ax = axis)
axis.set_title(title,fontweight="bold")
In [41]:
def qqplot(df,title,axis):
sm.qqplot(df, line = 's', ax = axis)
axis.set_title(title)
Column Profiling
about:srcdoc Page 7 of 79
Delhivery (6) 19/08/22, 10:36 PM
Load data
about:srcdoc Page 8 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [42]:
delhivery_data = pd.read_csv("https://d2beiqkhq929f0.cloudfront.net/public_asset
delhivery_data.head()
thanos::sroute:eb7bfc78-
2018-09-20 trip-
0 training b351-4c0e-a951- Carting IND38812
02:35:36.476840 153741093647649320
fa3d5c3...
thanos::sroute:eb7bfc78-
2018-09-20 trip-
1 training b351-4c0e-a951- Carting IND38812
02:35:36.476840 153741093647649320
fa3d5c3...
thanos::sroute:eb7bfc78-
2018-09-20 trip-
2 training b351-4c0e-a951- Carting IND38812
02:35:36.476840 153741093647649320
fa3d5c3...
thanos::sroute:eb7bfc78-
2018-09-20 trip-
3 training b351-4c0e-a951- Carting IND38812
02:35:36.476840 153741093647649320
fa3d5c3...
thanos::sroute:eb7bfc78-
2018-09-20 trip-
4 training b351-4c0e-a951- Carting IND38812
02:35:36.476840 153741093647649320
fa3d5c3...
5 rows × 24 columns
(144867, 19)
Out[44]:
In [45]:
delhivery_data.columns
about:srcdoc Page 9 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [46]:
delhivery_data.dtypes
data object
Out[46]:
trip_creation_time object
route_schedule_uuid object
route_type object
trip_uuid object
source_center object
source_name object
destination_center object
destination_name object
od_start_time object
od_end_time object
start_scan_to_end_scan float64
actual_distance_to_destination float64
actual_time float64
osrm_time float64
osrm_distance float64
segment_actual_time float64
segment_osrm_time float64
segment_osrm_distance float64
dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144867 entries, 0 to 144866
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 data 144867 non-null object
1 trip_creation_time 144867 non-null object
2 route_schedule_uuid 144867 non-null object
3 route_type 144867 non-null object
4 trip_uuid 144867 non-null object
5 source_center 144867 non-null object
6 source_name 144574 non-null object
7 destination_center 144867 non-null object
8 destination_name 144606 non-null object
9 od_start_time 144867 non-null object
10 od_end_time 144867 non-null object
11 start_scan_to_end_scan 144867 non-null float64
12 actual_distance_to_destination 144867 non-null float64
13 actual_time 144867 non-null float64
14 osrm_time 144867 non-null float64
15 osrm_distance 144867 non-null float64
16 segment_actual_time 144867 non-null float64
17 segment_osrm_time 144867 non-null float64
18 segment_osrm_distance 144867 non-null float64
dtypes: float64(8), object(11)
memory usage: 21.0+ MB
about:srcdoc Page 10 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [48]:
cols = delhivery_data.columns
num_cols = ["start_scan_to_end_scan", "actual_distance_to_destination", "actual_
cat_cols = ["data","route_type"]
dt_cols = ["trip_creation_time","od_start_time","od_end_time"]
#delhivery_data[cat_cols] = delhivery_data[cat_cols].astype("category")
delhivery_data[dt_cols] = delhivery_data[dt_cols].astype('datetime64')
In [50]:
delhivery_data.describe(include='object').T
thanos::sroute:4029a8a2-6c74-4b7e-a6d8-
route_schedule_uuid 144867 1504 1812
f9e069f...
about:srcdoc Page 11 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [51]:
delhivery_data.describe(include='all').T
2018-09-28 2018-09-12
trip_creation_time 144867 14817 101
05:23:15.359220 00:00:16.535741
thanos::sroute:4029a8a2-
route_schedule_uuid 144867 1504 6c74-4b7e-a6d8- 1812
f9e069f...
trip-
trip_uuid 144867 14817 101
153811219535896559
Gurgaon_Bilaspur_HB
source_name 144574 1498 23347
(Haryana)
Gurgaon_Bilaspur_HB
destination_name 144606 1468 15192
(Haryana)
2018-09-21 2018-09-12
od_start_time 144867 26369 81
18:37:09.322207 00:00:16.535741
2018-09-24 2018-09-12
od_end_time 144867 26369 81
09:59:15.691618 00:50:10.814399
about:srcdoc Page 12 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [52]:
for col in delhivery_data.columns:
l = len(col)
if l < 7: print(col, "\t\t\t\t:", delhivery_data[col].nunique())
elif l == 16: print(col, "\t\t:", delhivery_data[col].nunique())
elif l < 16: print(col, "\t\t\t:", delhivery_data[col].nunique())
elif l<30: print(col, "\t\t:", delhivery_data[col].nunique())
else: print(col, "\t:", delhivery_data[col].nunique())
data : 2
trip_creation_time : 14817
route_schedule_uuid : 1504
route_type : 2
trip_uuid : 14817
source_center : 1508
source_name : 1498
destination_center : 1481
destination_name : 1468
od_start_time : 26369
od_end_time : 26369
start_scan_to_end_scan : 1915
actual_distance_to_destination : 144515
actual_time : 3182
osrm_time : 1531
osrm_distance : 138046
segment_actual_time : 747
segment_osrm_time : 214
segment_osrm_distance : 113799
Inference
No abnormalities found.
training 104858
Out[54]:
test 40009
Name: data, dtype: int64
about:srcdoc Page 13 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [55]:
delhivery_data["route_type"].value_counts().sort_values(ascending=False)
FTL 99660
Out[55]:
Carting 45207
Name: route_type, dtype: int64
Inference
Around 50% of the trips has distance more than 50 kms. Maximum distance travelled
is 2186 kms
50% of delivery has the actual time of 149 mins and maximum time taken for longest
delivery is 6265 mins
50% of delivery time calculated in OSRM engine is 60 mins and maximum time taken
for longest delivery is 2032 mins
50% of delivery in OSRM engine is 65 kms and maximum time taken for longest
delivery is 2840 kms
50% of delivery has the segment actual time 147 mins and maximum time taken for
longest delivery is 6230 mins
50% of delivery has the segment OSRM time of 65 mins and maximum time taken for
longest delivery is 2564 mins
50% of delivery has the segment OSRM distance of 70 kms and maximum distance
taken for longest delivery is 3523 kms
50% of trip time difference between the start and end 280 mins and max trip time is
7898 mins
Most of the route type is Carting and it is around 8908
Most of the orders are coming from state Maharashtra
Most of the orders are delivered to state Maharashtra
Most of the orders are delivered from city Mumbai in Maharashtra state
Most of the orders are delivered to city Mumbai in Maharashtra state, Hence busiest
corridor under busiest state is Mumbai to Mumbai (round trip)
Average distance between the busiest corridor Mumbai to Mumbai is 14.62 km
Average time between the busiest corridor Mumbai to Mumbai is 55.29 minutes
about:srcdoc Page 14 of 79
Delhivery (6) 19/08/22, 10:36 PM
data 0 0.00
start_scan_to_end_scan 0 0.00
segment_osrm_time 0 0.00
segment_actual_time 0 0.00
osrm_distance 0 0.00
osrm_time 0 0.00
actual_time 0 0.00
actual_distance_to_destination 0 0.00
od_start_time 0 0.00
od_end_time 0 0.00
trip_creation_time 0 0.00
destination_center 0 0.00
source_center 0 0.00
trip_uuid 0 0.00
route_type 0 0.00
route_schedule_uuid 0 0.00
segment_osrm_distance 0 0.00
In [57]:
mask = False
mask = mask | delhivery_data["source_name"].isnull()
source_name_null_data = delhivery_data[mask]
source_name_null_data
about:srcdoc Page 15 of 79
Delhivery (6) 19/08/22, 10:36 PM
thanos::sroute:4460a38d-
2018-09-25 trip-
112 training ab9b-484e-bd4e- FTL
08:53:04.377810 153786558437756691
f4201d0...
thanos::sroute:4460a38d-
2018-09-25 trip-
113 training ab9b-484e-bd4e- FTL
08:53:04.377810 153786558437756691
f4201d0...
thanos::sroute:4460a38d-
2018-09-25 trip-
114 training ab9b-484e-bd4e- FTL
08:53:04.377810 153786558437756691
f4201d0...
thanos::sroute:4460a38d-
2018-09-25 trip-
115 training ab9b-484e-bd4e- FTL
08:53:04.377810 153786558437756691
f4201d0...
thanos::sroute:4460a38d-
2018-09-25 trip-
116 training ab9b-484e-bd4e- FTL
08:53:04.377810 153786558437756691
f4201d0...
thanos::sroute:cbef3b6a-
2018-10-03 trip-
144484 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...
thanos::sroute:cbef3b6a-
2018-10-03 trip-
144485 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...
thanos::sroute:cbef3b6a-
2018-10-03 trip-
144486 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...
thanos::sroute:cbef3b6a-
2018-10-03 trip-
144487 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...
thanos::sroute:cbef3b6a-
2018-10-03 trip-
144488 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...
In [58]:
mask = False
mask = mask | delhivery_data["destination_name"].isnull()
dest_name_null_data = delhivery_data[mask]
dest_name_null_data
about:srcdoc Page 16 of 79
Delhivery (6) 19/08/22, 10:36 PM
thanos::sroute:4460a38d-
2018-09-25 trip-
110 training ab9b-484e-bd4e- FTL
08:53:04.377810 153786558437756691
f4201d0...
thanos::sroute:4460a38d-
2018-09-25 trip-
111 training ab9b-484e-bd4e- FTL
08:53:04.377810 153786558437756691
f4201d0...
thanos::sroute:d0ebdacd-
2018-10-01 trip-
982 test e09b-47d3-be77- FTL
20:56:18.155260 153842737815495661
c9c4a05...
thanos::sroute:d0ebdacd-
2018-10-01 trip-
983 test e09b-47d3-be77- FTL
20:56:18.155260 153842737815495661
c9c4a05...
thanos::sroute:2f43f11e-
2018-09-24 trip-
4882 training d3ba-4590-9355- FTL
07:18:06.087341 153777348608709328
82928e1...
thanos::sroute:cbef3b6a-
2018-10-03 trip-
144478 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...
thanos::sroute:cbef3b6a-
2018-10-03 trip-
144479 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...
thanos::sroute:cbef3b6a-
2018-10-03 trip-
144480 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...
thanos::sroute:cbef3b6a-
2018-10-03 trip-
144481 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...
thanos::sroute:cbef3b6a-
2018-10-03 trip-
144482 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...
In [59]:
delhivery_data['source_name'].fillna(delhivery_data['source_center'],inplace
delhivery_data['destination_name'].fillna(delhivery_data['destination_center'
Inference
Inference
Source and destination name attributes are having small number of missing values.
Labels are also inconsitent in source and destination name attributes
about:srcdoc Page 17 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [61]:
grp = df1.groupby(['data','trip_uuid', 'trip_creation_time','route_type','source
df2 = grp.agg({'od_start_time' : 'min', 'od_end_time' : 'max','actual_distance_t
'segment_actual_time':'sum','segment_osrm_time':'sum','segment_osrm_dis
df2.sort_values(by=['trip_uuid', 'trip_creation_time','od_start_time'],inplace
df2 = df2.reset_index()
df2
trip- 2018-09-12
0 training FTL IND462022AAA
153671041653548748 00:00:16.535741
trip- 2018-09-12
2 training Carting IND572101AAA
153671042288605164 00:00:22.886430
trip- 2018-10-03
26367 test FTL IND583201AAA Hospet
153861118270144424 23:59:42.701692
In [62]:
grp = df2.groupby(['data','trip_uuid', 'trip_creation_time','route_type'])
delhivery_data_v2 = grp.agg({'source_center':'first','source_name':'first',
'segment_actual_time':'sum','segment_osrm_time':'sum','segment_osrm_dis
delhivery_data_v2.columns = ["data","trip_uuid", "trip_creation_time", "route_ty
delhivery_data_v2
about:srcdoc Page 18 of 79
Delhivery (6) 19/08/22, 10:36 PM
trip- 2018-09-27
0 test Carting IND424006AAA
153800653897073708 00:02:18.970980
trip- 2018-09-27
1 test Carting IND400072AAB
153800654935210748 00:02:29.352390
trip- 2018-09-27
2 test FTL IND302014AAA Jaipur_Hub
153800658820968126 00:03:08.209931
trip- 2018-09-27
3 test Carting IND421302AAF
153800659468028518 00:03:14.680535
trip- 2018-09-27
4 test Carting IND395009AAA
153800661729668086 00:03:37.296972
trip- 2018-09-26
14814 training FTL IND424006AAA
153800603160412602 23:53:51.604388
Inference
So we have reduced the number of rows from 144316 to just 14817. We now have 21
columns.
We combined delivery details of a package with multiple rows into single row.
Inference
about:srcdoc Page 19 of 79
Delhivery (6) 19/08/22, 10:36 PM
No duplicates found.
Feature Engineering
Split and extract features out of Source Name
In [64]:
delhivery_data_v2['source_city'] = delhivery_data_v2['source_name'].str.split
delhivery_data_v2['source_place'] = delhivery_data_v2['source_name'].str.split
delhivery_data_v2['source_code'] = delhivery_data_v2['source_name'].str.split
delhivery_data_v2['source_state']= delhivery_data_v2['source_name'].str.split
Extract features like month, year and day from Trip Creation
Time
In [66]:
delhivery_data_v2['trip_creation_year'] = delhivery_data_v2['trip_creation_time'
delhivery_data_v2['trip_creation_month'] = delhivery_data_v2['trip_creation_time
delhivery_data_v2['trip_creation_day']= delhivery_data_v2['trip_creation_time'
delhivery_data_v2["trip_creation_hour"] = delhivery_data_v2["trip_creation_time"
delhivery_data_v2["trip_creation_day"] = delhivery_data_v2["trip_creation_time"
delhivery_data_v2["trip_creation_week"] = delhivery_data_v2["trip_creation_time"
delhivery_data_v2["trip_creation_dayofweek"] = delhivery_data_v2["trip_creation_
about:srcdoc Page 20 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [68]:
dummies = pd.get_dummies(delhivery_data_v2.route_type,drop_first = True)
In [69]:
delhivery_data_v2 = pd.concat([delhivery_data_v2,dummies],axis=1)
In [71]:
delhivery_data_v2 = pd.concat([delhivery_data_v2,dummies],axis=1)
['start_scan_to_end_scan',
Out[72]:
'actual_distance_to_destination',
'actual_time',
'osrm_time',
'osrm_distance',
'segment_actual_time',
'segment_osrm_time',
'segment_osrm_distance']
about:srcdoc Page 21 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [74]:
delhivery_data_v2[delhivery_data_v2['start_scan_to_end_scan']<0]
0 rows × 36 columns
In [75]:
detect_outliers_group(delhivery_data_v2, col)
trip- 2018-09-27
69 test Carting IND362001AAA
153801189681302558 01:31:36.813300
trip- 2018-09-27
101 test Carting IND110037AAM Delhi_Airport
153801517906891689 02:26:19.069179
trip- 2018-09-26
14638 training Carting IND403726AAB Goa_ZuariNg
153799260763324327 20:10:07.633497
trip- 2018-09-26
14687 training Carting IND110044AAB Del_Okhla_P
153799639329945683 21:13:13.299701
In [76]:
delhivery_data_v2[delhivery_data_v2['start_scan_to_end_scan']<0]
0 rows × 36 columns
about:srcdoc Page 22 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [77]:
handle_outliers_group(delhivery_data_v2, col)
In [78]:
delhivery_data_v2[delhivery_data_v2['start_scan_to_end_scan']<0]
0 rows × 36 columns
In [79]:
univariate_analysis(delhivery_data_v2, col)
Inference
Average time taken to deliver from source to destination are relatively higher where
the number of transits are high.
Most of the one way parcels are likely to have least number of transits.
More data are right skewed in start scan to end scan attribute and also where the
number of transits are less than three.
about:srcdoc Page 23 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [81]:
detect_outliers_group(delhivery_data_v2, col)
trip- 2018-09-27
2 test FTL IND302014AAA Jaipur_Hub
153800658820968126 00:03:08.209931
trip- 2018-09-27
4 test Carting IND395009AAA
153800661729668086 00:03:37.296972
trip- 2018-09-27
27 test Carting IND000000AFT
153800774414397735 00:22:24.144231
trip- 2018-09-26
14751 training Carting IND110037AAK Delhi_Kapshe
153800194935773399 22:45:49.357982
trip- 2018-09-26
14781 training Carting IND400037AAA
153800419141788066 23:23:11.418113
In [82]:
handle_outliers_group(delhivery_data_v2, col)
In [83]:
univariate_analysis(delhivery_data_v2, col)
about:srcdoc Page 24 of 79
Delhivery (6) 19/08/22, 10:36 PM
Inference
In [85]:
detect_outliers_group(delhivery_data_v2, col)
about:srcdoc Page 25 of 79
Delhivery (6) 19/08/22, 10:36 PM
trip- 2018-09-27
31 test Carting IND712310AAE
153800813417374126 00:28:54.174126
trip- 2018-09-27
33 test Carting IND395023AAD
153800830425161914 00:31:44.251864
trip- 2018-09-26
14751 training Carting IND110037AAK
153800194935773399 22:45:49.357982
In [86]:
handle_outliers_group(delhivery_data_v2, col)
In [87]:
univariate_analysis(delhivery_data_v2, col)
Inference
about:srcdoc Page 26 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [89]:
detect_outliers_group(delhivery_data_v2, col)
trip- 2018-09-26
14754 training FTL IND361001AAA Jamnagar_
153800218813854926 22:49:48.138801
trip- 2018-09-26
14781 training Carting IND400037AAA
153800419141788066 23:23:11.418113
about:srcdoc Page 27 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [90]:
handle_outliers_group(delhivery_data_v2, col)
In [91]:
univariate_analysis(delhivery_data_v2, col)
Inference
In [93]:
detect_outliers_group(delhivery_data_v2, col)
about:srcdoc Page 28 of 79
Delhivery (6) 19/08/22, 10:36 PM
trip- 2018-09-27
21 test Carting IND395023AAD
153800750141776702 00:18:21.418023
trip- 2018-09-27
57 test FTL IND282002AAD
153801047876051786 01:07:58.760784
trip- 2018-09-26
14754 training FTL IND361001AAA Jamnagar_Dc
153800218813854926 22:49:48.138801
trip- 2018-09-26
14779 training Carting IND395023AAD
153800389194808105 23:18:11.948320
In [94]:
handle_outliers_group(delhivery_data_v2, col)
In [95]:
univariate_analysis(delhivery_data_v2, col)
Inference
about:srcdoc Page 29 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [97]:
detect_outliers_group(delhivery_data_v2, col)
about:srcdoc Page 30 of 79
Delhivery (6) 19/08/22, 10:36 PM
trip- 2018-09-27
33 test Carting IND395023AAD
153800830425161914 00:31:44.251864
trip- 2018-09-27
67 test Carting IND396191AAC Vapi_IndEsta
153801182428197910 01:30:24.282255
trip- 2018-09-26
14751 training Carting IND110037AAK Delhi_Kapshe
153800194935773399 22:45:49.357982
trip- 2018-09-26
14811 training Carting IND201307AAA
153800571655403292 23:48:36.554285
In [98]:
handle_outliers_group(delhivery_data_v2, col)
In [99]:
univariate_analysis(delhivery_data_v2, col)
Inference
about:srcdoc Page 31 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [101…
detect_outliers_group(delhivery_data_v2, col)
about:srcdoc Page 32 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [102…
handle_outliers_group(delhivery_data_v2, col)
In [103…
univariate_analysis(delhivery_data_v2, col)
Inference
about:srcdoc Page 33 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [105…
detect_outliers_group(delhivery_data_v2, col)
about:srcdoc Page 34 of 79
Delhivery (6) 19/08/22, 10:36 PM
trip- 2018-09-27
210 test Carting IND500008AAC
153802898973535963 06:16:29.735723
trip- 2018-09-27
473 test Carting IND400072AAB Mumbai Hub
153807813360229446 19:55:33.602541
trip- 2018-09-26
14306 training Carting IND411033AAA
153793840860911258 05:06:48.609360
trip- 2018-09-26
14338 training Carting IND500008AAC
153794472414216798 06:52:04.142401
trip- 2018-09-26
14468 training FTL IND411033AAA
153797075209653066 14:05:52.096792
In [106…
handle_outliers_group(delhivery_data_v2, col)
In [107…
univariate_analysis(delhivery_data_v2, col)
Inference
about:srcdoc Page 35 of 79
Delhivery (6) 19/08/22, 10:36 PM
Inference
about:srcdoc Page 36 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [109…
sourcestate10 = delhivery_data_v2["source_state"].value_counts()[0:10]
destinationstate10 = delhivery_data_v2["destination_state"].value_counts()[
fig, ax = plt.subplots(1,2,figsize=(16,5))
Inference
Haryana, Maharastra and Karnataka are the popular source and destination states.
about:srcdoc Page 37 of 79
Delhivery (6) 19/08/22, 10:36 PM
Inference
The trips are recorded only for the months of September and October. The recording
perhaps stopped after that. So we do not analyse further on the basis of month.
Inference
So we see that maximum number of trips are happening on Wednesday and minimum on
Sunday.
about:srcdoc Page 38 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [112…
sns.distplot(delhivery_data_v2["trip_creation_hour"])
plt.title("Distribution of Trip Hour")
plt.show()
Inference
So, we observe a kind of bimodal distribution with minimum trips occuring during the day
hours (8 AM to 1 PM) and maximum occuring during late night or early morning hours (8
PM to 2 AM).
about:srcdoc Page 39 of 79
Delhivery (6) 19/08/22, 10:36 PM
about:srcdoc Page 40 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [114…
for col in num_cols:
fig, ax = plt.subplots(1, 2, figsize = (18, 4))
plt.suptitle(col, fontsize = fontsize, fontweight = fontweight)
pointplot(delhivery_data_v2,"data",col,"route_type",'',ax[0])
pointplot(delhivery_data_v2,"route_type",col,"data",'',ax[1])
about:srcdoc Page 41 of 79
Delhivery (6) 19/08/22, 10:36 PM
about:srcdoc Page 42 of 79
Delhivery (6) 19/08/22, 10:36 PM
Inference
So we see that the time taken by full truck load deliveries is on average, a lot higher
(>300 hours) than the cart deliveries (<100 hours).
The full truck load deliveries cover much longer distances onaverage (>150 kms)
than carting deliveries (~ 25 kms)
Time and distances follow similar trends against the hour of the day. Maximum time
and distance deliveries are likely to be made during peak morning hours of 10 AM to
12 PM as well as 5 PM, 7 PM and 1 AM.
In [115…
plt.figure(figsize = (8, 5))
sns.color_palette("pastel")
sns.heatmap(delhivery_data_v2[num_cols].corr(), annot=True, vmin=-1, vmax =
plt.show()
about:srcdoc Page 43 of 79
Delhivery (6) 19/08/22, 10:36 PM
Inference
So we see that certain fields are highly correlated :
In [116…
sns.color_palette("pastel")
sns.pairplot(delhivery_data_v2[num_cols])
plt.show()
about:srcdoc Page 44 of 79
Delhivery (6) 19/08/22, 10:36 PM
Inference
All the numerical attributes are linearly related with each other.
countplot(delhivery_data_v2,"data","route_type",ax[0])
countplot(delhivery_data_v2,"route_type","data", ax[1])
plt.show()
about:srcdoc Page 45 of 79
Delhivery (6) 19/08/22, 10:36 PM
for e in st:
percx.append(top3s[(top3s['source_state']==e)&(top3s["route_type"]=="Carting"
for e in st:
percx.append(top3s[(top3s['source_state']==e)&(top3s["route_type"]=="FTL"
i=0
for p in g.patches:
txt = str((round(percx[i]*100))) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
g.text(txt_x+0.1,txt_y,txt)
i+=1
plt.show()
about:srcdoc Page 46 of 79
Delhivery (6) 19/08/22, 10:36 PM
Inference
So we see that for top 3 source states,
for e in st:
percx.append(top3s[(top3s['destination_state']==e)&(top3s["route_type"]==
for e in st:
percx.append(top3s[(top3s['destination_state']==e)&(top3s["route_type"]==
i=0
for p in g.patches:
txt = str((round(percx[i]*100))) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
g.text(txt_x+0.1,txt_y,txt)
i+=1
plt.show()
about:srcdoc Page 47 of 79
Delhivery (6) 19/08/22, 10:36 PM
Inference
So we see that for top 3 destination states,
about:srcdoc Page 48 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [120…
fig , ax = plt.subplots(2,2,figsize=(20,12))
In [121…
fig , ax = plt.subplots(2,2,figsize=(20,12))
about:srcdoc Page 49 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [122…
fitted_od_time_taken,lmbda = stats.boxcox(delhivery_data_v2['od_time_taken'
fitted_start_scan_to_end_scan,lmbda = stats.boxcox(abs(delhivery_data_v2['start_
fig , ax = plt.subplots(2,2,figsize=(20,12))
about:srcdoc Page 50 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [123…
anderson(fitted_od_time_taken)
AndersonResult(statistic=28.82281265821257, critical_values=array([0.576, 0
Out[123…
.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. , 2
.5, 1. ]))
In [124…
anderson(fitted_start_scan_to_end_scan)
AndersonResult(statistic=23.295076091240844, critical_values=array([0.576,
Out[124…
0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. ,
2.5, 1. ]))
about:srcdoc Page 51 of 79
Delhivery (6) 19/08/22, 10:36 PM
Inference
Since P-Value of this test lies below 0.05, Then we can safely reject the null
hypothesis and conclude od_start_time / od_end_time and start_scan_to_end_scan
attribute are dependent on each other
Visual Analysis
In [127…
sns.distplot(delhivery_data_v2["start_scan_to_end_scan"], label="start_scan_to_e
sns.distplot(delhivery_data_v2["od_time_taken"], label="od_time_taken")
plt.legend()
plt.show()
In [128…
sns.scatterplot(data = delhivery_data_v2, x = 'od_time_taken', y = 'start_scan_t
plt.show()
about:srcdoc Page 52 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [129…
sns.pointplot(data = delhivery_data_v2, x = 'od_time_taken', y = 'start_scan_to_
<matplotlib.axes._subplots.AxesSubplot at 0x7f83db6eaad0>
Out[129…
about:srcdoc Page 53 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [130…
fig , ax = plt.subplots(2,2,figsize=(20,12))
histplot(delhivery_data_v2['actual_time'],"Actual Time",ax[0][0])
qqplot(delhivery_data_v2['actual_time'], "qqplot for Actual Time", ax[0][1])
histplot(delhivery_data_v2['osrm_time'],"OSRM Time",ax[1][0])
qqplot(delhivery_data_v2['osrm_time'], "qqplot for OSRM Time", ax[1][1])
In [131…
fig , ax = plt.subplots(2,2,figsize=(20,12))
histplot(np.log(delhivery_data_v2['actual_time']),"Actual Time",ax[0][0])
qqplot(np.log(delhivery_data_v2['actual_time']), "qqplot for Actual Time",
histplot(np.log(delhivery_data_v2['osrm_time']),"OSRM time",ax[1][0])
qqplot(np.log(delhivery_data_v2['osrm_time']), "qqplot for OSRM Time", ax[1
about:srcdoc Page 54 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [132…
fitted_actual_time,lmbda = stats.boxcox(delhivery_data_v2['actual_time'])
fitted_osrm_time,lmbda = stats.boxcox(delhivery_data_v2['osrm_time'])
fig , ax = plt.subplots(2,2,figsize=(20,12))
histplot(fitted_actual_time,"Actual Time",ax[0][0])
qqplot(fitted_actual_time, "qqplot for Actual Time", ax[0][1])
histplot(fitted_osrm_time,"OSRM time",ax[1][0])
qqplot(fitted_osrm_time, "qqplot for OSRM time", ax[1][1])
about:srcdoc Page 55 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [133…
anderson(fitted_actual_time)
AndersonResult(statistic=29.222329256357625, critical_values=array([0.576,
Out[133…
0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. ,
2.5, 1. ]))
In [134…
anderson(fitted_osrm_time)
AndersonResult(statistic=12.061248831811099, critical_values=array([0.576,
Out[134…
0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. ,
2.5, 1. ]))
about:srcdoc Page 56 of 79
Delhivery (6) 19/08/22, 10:36 PM
Inference
Since P-Value of this test lies below 0.05, Then we can safely reject the null
hypothesis and conclude actual time and OSRM time attribute are dependent on
each other.
Visual Analysis
In [137…
sns.distplot(delhivery_data_v2["actual_time"], label="actual_time")
sns.distplot(delhivery_data_v2["osrm_time"], label="osrm_time")
plt.legend()
plt.show()
In [138…
sns.scatterplot(data = delhivery_data_v2, x = 'actual_time', y = 'osrm_time'
plt.show()
about:srcdoc Page 57 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [139…
sns.pointplot(data = delhivery_data_v2, x = 'actual_time', y = 'osrm_time')
<matplotlib.axes._subplots.AxesSubplot at 0x7f83bf405d10>
Out[139…
about:srcdoc Page 58 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [140…
fig , ax = plt.subplots(2,2,figsize=(20,12))
histplot(delhivery_data_v2['actual_time'],"Actual Time",ax[0][0])
qqplot(delhivery_data_v2['actual_time'], "qqplot for Actual Time", ax[0][1])
In [141…
fig , ax = plt.subplots(2,2,figsize=(20,12))
histplot(np.log(abs(delhivery_data_v2['actual_time'])),"Actual Time",ax[0][
qqplot(np.log(abs(delhivery_data_v2['actual_time'])), "qqplot for Actual Time"
histplot(np.log(abs(delhivery_data_v2['segment_actual_time'])),"Segment Actual T
qqplot(np.log(abs(delhivery_data_v2['segment_actual_time'])), "qqplot for Segmen
about:srcdoc Page 59 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [142…
fitted_actual_time,lmbda = stats.boxcox(delhivery_data_v2['actual_time'])
fitted_segment_actual_time,lmbda = stats.boxcox(delhivery_data_v2['segment_actua
fig , ax = plt.subplots(2,2,figsize=(20,12))
histplot(fitted_actual_time,"Actual Time",ax[0][0])
qqplot(fitted_actual_time, "qqplot for Actual Time", ax[0][1])
about:srcdoc Page 60 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [143…
anderson(fitted_actual_time)
AndersonResult(statistic=29.222329256357625, critical_values=array([0.576,
Out[143…
0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. ,
2.5, 1. ]))
In [144…
anderson(fitted_segment_actual_time)
AndersonResult(statistic=39.9044356195227, critical_values=array([0.576, 0.
Out[144…
656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. , 2.
5, 1. ]))
about:srcdoc Page 61 of 79
Delhivery (6) 19/08/22, 10:36 PM
Inference
Since P-Value of this test lies below 0.05, Then we can safely reject the null
hypothesis and conclude actual time and Segment actual time attribute are
dependent on each other.
Visual Analysis
In [147…
sns.distplot(delhivery_data_v2["actual_time"], label="actual_time")
sns.distplot(delhivery_data_v2["segment_actual_time"], label="segment_actual_tim
plt.legend()
plt.show()
In [148…
sns.scatterplot(data = delhivery_data_v2, x = 'actual_time', y = 'segment_actual
plt.show()
about:srcdoc Page 62 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [149…
sns.pointplot(data = delhivery_data_v2, x = 'actual_time', y = 'segment_actual_t
<matplotlib.axes._subplots.AxesSubplot at 0x7f83bc801b50>
Out[149…
about:srcdoc Page 63 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [150…
fig , ax = plt.subplots(2,2,figsize=(20,12))
histplot(delhivery_data_v2['osrm_distance'],"OSRM Distance",ax[0][0])
qqplot(delhivery_data_v2['osrm_distance'], "qqplot for OSRM Distance", ax[0
In [151…
fig , ax = plt.subplots(2,2,figsize=(20,12))
histplot(np.log(delhivery_data_v2['osrm_distance']),"OSRM Distance",ax[0][0
qqplot(np.log(delhivery_data_v2['osrm_distance']), "qqplot for OSRM Distance"
about:srcdoc Page 64 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [152…
fitted_osrm_distance,lmbda = stats.boxcox(delhivery_data_v2['osrm_distance'
fitted_segment_osrm_distance,lmbda = stats.boxcox(delhivery_data_v2['segment_osr
fig , ax = plt.subplots(2,2,figsize=(20,12))
histplot(fitted_osrm_distance,"OSRM Distance",ax[0][0])
qqplot(fitted_osrm_distance, "qqplot for OSRM Distance", ax[0][1])
about:srcdoc Page 65 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [153…
anderson(fitted_osrm_distance)
AndersonResult(statistic=19.994770089560916, critical_values=array([0.576,
Out[153…
0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. ,
2.5, 1. ]))
In [154…
anderson(fitted_segment_osrm_distance)
AndersonResult(statistic=70.3080027928263, critical_values=array([0.576, 0.
Out[154…
656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. , 2.
5, 1. ]))
about:srcdoc Page 66 of 79
Delhivery (6) 19/08/22, 10:36 PM
Inference
Since P-Value of this test lies below 0.05, Then we can safely reject the null
hypothesis and conclude OSRM distance and Segment OSRM distance attribute are
dependent on each other.
Visual Analysis
In [157…
sns.distplot(delhivery_data_v2["osrm_distance"], label="osrm_distance")
sns.distplot(delhivery_data_v2["segment_osrm_distance"], label="segment_osrm_dis
plt.legend()
plt.show()
In [158…
sns.scatterplot(data = delhivery_data_v2, x = 'osrm_distance', y = 'segment_osrm
plt.show()
about:srcdoc Page 67 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [159…
sns.pointplot(data = delhivery_data_v2, x = 'osrm_distance', y = 'segment_osrm_d
<matplotlib.axes._subplots.AxesSubplot at 0x7f83b8761250>
Out[159…
about:srcdoc Page 68 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [160…
fig , ax = plt.subplots(2,2,figsize=(20,12))
histplot(delhivery_data_v2['osrm_time'],"OSRM Time",ax[0][0])
qqplot(delhivery_data_v2['osrm_time'], "qqplot for OSRM Time", ax[0][1])
In [161…
fig , ax = plt.subplots(2,2,figsize=(20,12))
histplot(np.log(delhivery_data_v2['osrm_time']),"OSRM Time",ax[0][0])
qqplot(np.log(delhivery_data_v2['osrm_time']), "qqplot for OSRM Time", ax[0
about:srcdoc Page 69 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [162…
fitted_osrm_time,lmbda = stats.boxcox(delhivery_data_v2['osrm_time'])
fitted_segment_osrm_time,lmbda = stats.boxcox(delhivery_data_v2['segment_osrm_ti
fig , ax = plt.subplots(2,2,figsize=(20,12))
histplot(fitted_osrm_time,"OSRM Time",ax[0][0])
qqplot(fitted_osrm_time, "qqplot for OSRM Time", ax[0][1])
about:srcdoc Page 70 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [163…
anderson(fitted_osrm_time)
AndersonResult(statistic=12.061248831811099, critical_values=array([0.576,
Out[163…
0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. ,
2.5, 1. ]))
In [164…
anderson(fitted_segment_osrm_time)
AndersonResult(statistic=55.687229280645624, critical_values=array([0.576,
Out[164…
0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. ,
2.5, 1. ]))
about:srcdoc Page 71 of 79
Delhivery (6) 19/08/22, 10:36 PM
Inference
Since P-Value of this test lies below 0.05, Then we can safely reject the null
hypothesis and conclude OSRM time and Segment OSRM time attribute are
dependent on each other.
Visual Analysis
In [167…
sns.distplot(delhivery_data_v2["osrm_time"], label="osrm_time")
sns.distplot(delhivery_data_v2["segment_osrm_time"], label="segment_osrm_time"
plt.legend()
plt.show()
In [168…
sns.scatterplot(data = delhivery_data_v2, x = 'osrm_time', y = 'segment_osrm_tim
plt.show()
about:srcdoc Page 72 of 79
Delhivery (6) 19/08/22, 10:36 PM
In [169…
sns.pointplot(data = delhivery_data_v2, x = 'osrm_time', y = 'segment_osrm_time'
<matplotlib.axes._subplots.AxesSubplot at 0x7f83a6036290>
Out[169…
Standard Scaling
In [171…
scaler = preprocessing.StandardScaler()
standard_df = scaler.fit_transform(delhivery_data_v2[num_cols])
delhivery_data_v4 = pd.DataFrame(standard_df)
delhivery_data_v4.columns = num_cols
Min-Max Scaling
In [172…
scaler = preprocessing.MinMaxScaler()
minmax_df = scaler.fit_transform(delhivery_data_v2[num_cols])
delhivery_data_v5 = pd.DataFrame(minmax_df)
delhivery_data_v5.columns = num_cols
In [173…
num_cols
about:srcdoc Page 73 of 79
Delhivery (6) 19/08/22, 10:36 PM
['od_time_taken',
Out[173…
'start_scan_to_end_scan',
'actual_distance_to_destination',
'actual_time',
'osrm_time',
'osrm_distance',
'segment_actual_time',
'segment_osrm_time',
'segment_osrm_distance']
In [174…
fig, (ax1, ax2, ax3) = plt.subplots(ncols = 3, figsize =(20, 5))
ax1.set_title('Before Scaling')
In [175…
for col in num_cols:
fig, ax = plt.subplots(1, 3, figsize = (18, 4))
plt.suptitle(col, fontsize = fontsize, fontweight = fontweight)
ax[0].set_title('Before Scaling')
sns.kdeplot(delhivery_data_v2[col], ax = ax[0], color ='red')
ax[1].set_title('After Standard Scaling')
sns.kdeplot(delhivery_data_v4[col], ax = ax[1], color ='blue')
ax[2].set_title('After Min-Max Scaling')
sns.kdeplot(delhivery_data_v5[col], ax = ax[2], color ='green')
plt.show()
about:srcdoc Page 74 of 79
Delhivery (6) 19/08/22, 10:36 PM
about:srcdoc Page 75 of 79
Delhivery (6) 19/08/22, 10:36 PM
Inference
After normalization, All the numerical attributes got to a similiar scale ranges from 0
to 1
After standardization, It translates the data to the mean vector of original data to the
origin and squishes or expands.
Business Insights
1,44,867 number of records and 17 attributes are present in this dataset.
Source and destination name attributes are having small number of missing values.
All the numerical attributes mean and median values are not close to each other
which clearly indicates data is not normally distributed.
Also the range of numerical attributes are widely distributed which shows there
about:srcdoc Page 76 of 79
Delhivery (6) 19/08/22, 10:36 PM
More number of FTL route types are present in raw data but we cannot conclude
before aggregating the rows.
Average time taken to deliver from source to destination are relatively higher where
the number of transits are high.
Most of the one way parcels are likely to have least number of transits.
All the numerical attributes are rightly skewed and requires some treatment.
New feature Trip creation year, month, day, date and time information are extracted
from trip creation time attribute.
City, Place and Area information are extracted from both source and destination
name attribute.
New feature OD time taken is calculated based on the difference between OD Start
time and OD end time.
Almost all the numerical attributes are strongly linearly correlated with each other.
In all months, Carts are highly used compared to full truck loads.
about:srcdoc Page 77 of 79
Delhivery (6) 19/08/22, 10:36 PM
Karnataka, Maharastra, Tamilnadu, Haryana are the top states from where the
parcels are originated.
From Karnataka, Maharastra, Haryana and Tamilnadu, More number of parcels are
sent in carts compared to full truck loads.
Karnataka, Maharastra, Tamilnadu, Haryana and Telangana are the top states to
where the parcels are delivered.
Karnataka, Maharastra, Tamilnadu, Haryana and Telangana are the top states
involved in more number of trips.
In Karnataka, Maharastra and Haryana, One way parcels are more preferred.
In Tamilnadu and West Bengal, One way and round way type of parcels are equally
likely used.
OD_start_time / OD_end_time and start scan to end scan attribute are closely related
with each other
Actual time and OSRM time attribute are closely related with each other
Actual time and Segment actual time attribute are closely related with each other
OSRM distance and Segment OSRM distance attribute are closely related with each
other
OSRM time and Segment OSRM time attribute are closely related with each other
Since P-Value of this test lies below 0.05, Then we can safely reject the null
hypothesis and conclude OSRM time and Segment OSRM time attribute are
dependent on each other.
Since P-Value of this test lies below 0.05, Then we can safely reject the null
hypothesis and conclude OSRM distance and Segment OSRM distance attribute are
dependent on each other.
Since P-Value of this test lies below 0.05, Then we can safely reject the null
hypothesis and conclude actual time and Segment actual time attribute are
dependent on each other.
Since P-Value of this test lies below 0.05, Then we can safely reject the null
hypothesis and conclude actual time and OSRM time attribute are dependent on
each other.
about:srcdoc Page 78 of 79
Delhivery (6) 19/08/22, 10:36 PM
Since P-Value of this test lies below 0.05, Then we can safely reject the null
hypothesis and conclude od_start_time / od_end_time and start scan to end scan
attribute are dependent on each other
Recommendations
1. Delhivery company can increase their business by giving offers / discounts to busiest
corridor under busiest state
2. Delhivery company can increase their business by giving offers / discounts to route
type FTL
3. Delhivery company should focus more on Southern states as more parcels are
orginated and delivered
4. Delhivery company should plan to use shortest path from source to destination
center
about:srcdoc Page 79 of 79