You are on page 1of 79

Delhivery (6) 19/08/22, 10:36 PM

Brief Summary
Delhivery is the largest and fastest-growing fully integrated player in India by revenue in
Fiscal 2021. They aim to build the operating system for commerce, through a
combination of world-class infrastructure, logistics operations of the highest quality, and
cutting-edge engineering and technology capabilities.

The Data team builds intelligence and capabilities using this data that helps them to
widen the gap between the quality, efficiency, and profitability of their business versus
their competitors.

Problem Statement
The company wants to understand and process the data coming out of data engineering
pipelines:

• Clean, sanitize and manipulate data to get useful features out of raw fields

• Make sense out of the raw data and help the data science team to build forecasting
models on it

Import Libraries
In [1]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statistics
import warnings
import random
from scipy import stats
from scipy.stats import levene
from scipy.stats import shapiro
from scipy.stats import anderson
from scipy.stats import pearsonr
from sklearn import preprocessing
import statsmodels.api as sm
warnings.filterwarnings("ignore")
%matplotlib inline

/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: Fu
tureWarning: pandas.util.testing is deprecated. Use the functions in the pu
blic API at pandas.testing instead.
import pandas.util.testing as tm

about:srcdoc Page 1 of 79
Delhivery (6) 19/08/22, 10:36 PM

Global Variables
In [2]:
DEFAULT_OPTIONS = { "mean": True, "mode": True, "title": True, "median": True

In [3]:
TITLE_FONT_WGT = "bold"
COLOR_PALETTE = sns.color_palette("ch:s=.25, rot=-.25")
FONTSIZE, FONTFAMILY, FONTWEIGHT = 12, "Comic Sans MS", "bold"
SMALL_FIGSIZE, SMALL_WIDE_FIGSIZE, MEDIUM_FIGSIZE, MEDIUM_TALL_FIGSIZE, LARGE_FI

Common Utilities
In [4]:
fontsize, fontfamily, fontweight = 12, "Comic Sans MS", "bold"
palette_color = sns.color_palette("ch:s=.25, rot=-.25")

In [5]:
def sort_values(df, ascending = False):
return df.sort_values(ascending = ascending)

In [6]:
def missing_values(df):
total_null_cnt = df.isnull().count()
null_in_col = sort_values(df.isnull().sum())
percent = sort_values(null_in_col / total_null_cnt * 100)
print("Total records = ", df.shape[0])
tab = pd.concat([null_in_col, percent.round(2)], axis = 1, keys = ['# of Missi
return tab

In [7]:
def univariate_analysis(df, col):
fig, ax = plt.subplots(1, 3, figsize = LARGE_FIGSIZE)
plt.suptitle(col, fontsize = FONTSIZE, fontweight = FONTWEIGHT)
uni_histplot(df, col, ax[0], "transits_count", { "nolabel": True })
bi_boxplot(df, "transits_count", col, ax[1], None, { "nolabel": True })
bi_scatterplot(df, "transits_count", col, ax[2], None, "transits_count",

In [8]:
def get_outliers_range(df, col):
q1 = df[col].quantile(0.25)
q3 = df[col].quantile(0.75)
iqr = q3 - q1
outlier_left = q1 - 1.5 * iqr
outlier_right = q3 + 1.5 * iqr
return outlier_left, outlier_right

about:srcdoc Page 2 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [9]:
def detect_outliers_group(df, col):
copied_index = df.index
base_parameters = ["route_type", "source_center", "destination_center", "trans
col_data = df[base_parameters + [col]]
outlier_left, outlier_right = get_outliers_range(col_data.groupby(base_paramet
outliers_range = pd.merge(outlier_left, outlier_right, how = "left", left_on
outliers_range.columns = ["outlier_left", "outlier_right"]
outliers_range = outliers_range.reset_index()
col_data = pd.merge(col_data, outliers_range, how = "left", left_on = base_pa
return df[(df[col] < col_data["outlier_left"]) | (df[col] > col_data["outlier_

In [10]:
def remove_outliers_group(df, col):
copied_index = df.index
base_parameters = ["route_type", "source_center", "destination_center", "trans
col_data = df[base_parameters + [col]]
outlier_left, outlier_right = get_outliers_range(col_data.groupby(base_paramet
outliers_range = pd.merge(outlier_left, outlier_right, how = "left", left_on
outliers_range.columns = ["outlier_left", "outlier_right"]
outliers_range = outliers_range.reset_index()
col_data = pd.merge(col_data, outliers_range, how = "left", left_on = base_pa
return df[(df[col] >= col_data["outlier_left"]) & (df[col] <= col_data["outlie

In [11]:
def handle_outliers_group(df, col):
copied_index = df.index
base_parameters = ["route_type", "source_center", "destination_center", "trans
col_data = df[base_parameters + [col]]
outlier_left, outlier_right = get_outliers_range(col_data.groupby(base_paramet
outliers_range = pd.merge(outlier_left, outlier_right, how = "left", left_on
outliers_range.columns = ["outlier_left", "outlier_right"]
outliers_range = outliers_range.reset_index()
for i in outliers_range.index:
outliers_range.loc[i, "impute_data"] = random.randint(int(outliers_range
col_data = pd.merge(col_data, outliers_range, how = "left", left_on = base_pa
bool_mask = (col_data[col] < col_data["outlier_left"]) | (col_data[col] >
col_data.loc[bool_mask, col] = col_data.loc[bool_mask, "impute_data"]
df[col] = col_data[col].abs()
df[col] = df[col].replace(0,0.1)

In [12]:
def detect_outliers(df, col):
q1 = df[col].quantile(0.25)
q3 = df[col].quantile(0.75)
iqr = q3 - q1
outlier_left = q1 - 1.5 * iqr
outlier_right = q3 + 1.5 * iqr
return df[ (df[col] <= outlier_left) | (df[col] >= outlier_right) ]

about:srcdoc Page 3 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [13]:
def remove_outliers(df, col):
q1 = df[col].quantile(0.25)
q3 = df[col].quantile(0.75)
iqr = q3 - q1
outlier_left = q1 - 1.5 * iqr
outlier_right = q3 + 1.5 * iqr
return df[ (df[col] > outlier_left) & (df[col] < outlier_right) ]

In [14]:
def set_title(axis, label):
if (not label): return
axis.set_title(label, fontweight = TITLE_FONT_WGT)

In [15]:
def set_legend(axis, options):
if (not options): return
axis.legend(options)

In [16]:
def axv_line(df, axis, label, options):
if (not label): return
value = options.get("value")
color = options.get("color")
line_style = options.get("line_style")
axis.axvline(value, color = color, linestyle = line_style, label = label)

In [17]:
def add_meta_data(df, col, axis, title, options):
set_title(axis, title)
if (not options): return
if (options.get("mean")): axv_line(df, axis, { "label": "Mean", "value":
if (options.get("mode")): axv_line(df, axis, { "label": "Mode", "value":
if (options.get("median")): axv_line(df, axis, { "label": "Median", "value"
if (options.get("legend")): set_legend(axis, { "Mean": df[col].mean(), "Mode"
if (options.get("xlabel")): axis.set_xlabel(options.get("xlabel"))
if (options.get("ylabel")): axis.set_ylabel(options.get("ylabel"))
if (options.get("rotate")):
axis.set_xticklabels(axis.get_xticklabels(), rotation = 90)
if (options.get("nolabel")):
axis.set_xlabel(None)
axis.set_ylabel(None)

In [18]:
def uni_distplot(df, col, axis, title, options = DEFAULT_OPTIONS):
sns.distplot(df[col], ax = axis)
add_meta_data(df, col, axis, title, options)

In [19]:
def uni_boxplot(df, col, axis, title):
sns.boxplot(y = df[col], ax = axis)
add_meta_data(df, col, axis, title, { "ylabel": col })

about:srcdoc Page 4 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [20]:
def uni_barplot(df, col, axis, options):
df_count = df[col].value_counts()
df_count.plot.bar(color = COLOR_PALETTE, ax = axis)
add_meta_data(df, col, axis, None, options)

In [21]:
def uni_countplot(df, col, hue, axis):
sns.countplot(data = df, x = col, hue = hue, palette = "Set2", ax = axis)
add_meta_data(df, col, axis, col + " - " + hue + " based distribution", {

In [22]:
def uni_pieplot(df, col, axis, options):
df_count = df[col].value_counts()
df_count.plot.pie(colors = COLOR_PALETTE, autopct = '%.0f%%', ax = axis)
add_meta_data(df, col, axis, None, options)

In [23]:
def uni_histplot(df, col, axis, title, options):
data = df[col]
if (options.get("log")): data = np.log(data)
sns.histplot(data, bins = 50, kde = True, ax = axis)
add_meta_data(df, None, axis, title, None)

In [24]:
def uni_qqplot(df, col, title, axis, options):
data = df[col]
if (options.get("log")): data = np.log(data)
sm.qqplot(data, line = 's', ax = axis)
add_meta_data(df, None, axis, title, None)

In [25]:
def bi_pointplot(df, xcol, ycol, hue, axis, title):
sns.pointplot(x = df[xcol], y = df[ycol], hue = df[hue], ax = axis)
add_meta_data(df, None, axis, title, { "xlabel": xcol, "ylabel": ycol })

In [26]:
def uni_scatterplot(df, xcol, title, axis):
g = sns.scatterplot(data = df[xcol], ax = axis)
add_meta_data(df, None, axis, title, None)
g.set(xticklabels = [])
g.set(xlabel = None)

In [27]:
def bi_boxplot(df, xcol, ycol, axis, hue, options):
sns.boxplot(data = df, x = xcol, y = ycol, ax = axis, palette = "Paired",
add_meta_data(df, None, axis, None, options)

In [28]:
def bi_scatterplot(df, xcol, ycol, axis, title, hue, options):
sns.scatterplot(data = df, x = xcol, y = ycol, ax = axis, hue = hue)
add_meta_data(df, None, axis, title, options)

about:srcdoc Page 5 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [29]:
def bi_pointplot(df, xcol, ycol, hue, axis, title):
sns.pointplot(x = df[xcol], y = df[ycol], hue = df[hue], ax = axis)
add_meta_data(df, None, axis, title, { "xlabel": xcol, "ylabel": ycol })

In [30]:
def bi_lineplot(df, xcol, ycol, hue, axis, title):
if (hue): g = sns.lineplot(data = df, x = xcol, y = ycol, hue = hue, ax =
else: g = sns.lineplot(data = df, x = xcol, y = ycol, ax = axis, markers=
if (hue): title = title + " | " + hue
add_meta_data(df, None, axis, title, None)
g.set(xlabel = None)
g.set(ylabel = None)

In [31]:
def heatmap(df, title, options):
sns.color_palette("pastel")
sns.heatmap(df, annot = True, vmin = -1, vmax = 1, cmap = "PiYG")
add_meta_data(df, None, None, title, options)
plt.show()

In [32]:
def pairplot(df):
sns.color_palette("pastel")
sns.pairplot(df)
plt.show()

In [33]:
def kdeplot(df, colname, title, axis, test = False):
g = sns.kdeplot(df[colname], ax = axis)
axis.axvline(df[colname].mean(), color = "r", linestyle = "--", label =
if (title):
axis.set_title(title, fontweight = fontweight)
g.set(yticklabels=[])
g.set(ylabel=None)
if (not test):
axis.axvline(df[colname].median(), color = "g", linestyle = "-", label
else:
axis.legend({ "Mean" : df[colname].mean(), "Median" : df[colname].median

In [34]:
def boxplot(df, colname, title, axis):
sns.boxplot(y = df[colname], ax = axis)
if (title):
axis.set_title(title, fontweight = fontweight)
axis.set_ylabel(colname, fontsize = fontsize, family = fontfamily)

In [35]:
def scatterplot(df, xcolname, ycolname, title, axis):
sns.scatterplot(data = df, x = xcolname, y = ycolname, ax = axis)
axis.set_title(title, fontweight = fontweight)
axis.set_xlabel(None)
axis.set_ylabel(None)

about:srcdoc Page 6 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [36]:
def scatterplotonecol(df, xcolname, title, axis):
g = sns.scatterplot(data = df[xcolname], ax = axis)
if (title):
axis.set_title(title, fontweight = fontweight)
g.set(xticklabels=[])
g.set(xlabel=None)

In [37]:
def boxplot_bicol(df,colname1, colname2,axis):
sns.boxplot(x = colname1,y = colname2, data = df,ax=axis,palette="Paired"
axis.set_xlabel(colname1, fontweight="bold",fontsize=14,family = "Comic Sans
axis.set_ylabel(colname2, fontweight="bold", fontsize=14,family = "Comic San

In [38]:
def pointplot(df,colname1,colname2,colname3,title,axis):
sns.pointplot(x=df[colname1],y=df[colname2],hue=df[colname3],ax=axis)
axis.set_xlabel(colname1)
axis.set_ylabel(colname2)
axis.set_title(title,fontweight="bold")

In [39]:
def countplot(df, xcolname, hcolname, axis):
title = xcolname + " - " + hcolname + " based distribution"
sns.countplot(data=df,x=xcolname, hue=hcolname, palette="Set2",ax=axis)
axis.set_title(title,fontweight="bold")
axis.set_xlabel(xcolname)
axis.set_ylabel('count')

In [40]:
def histplot(df,title,axis):
sns.histplot(df, bins = 50, kde = True, ax = axis)
axis.set_title(title,fontweight="bold")

In [41]:
def qqplot(df,title,axis):
sm.qqplot(df, line = 's', ax = axis)
axis.set_title(title)

Column Profiling

about:srcdoc Page 7 of 79
Delhivery (6) 19/08/22, 10:36 PM

data - tells whether the data is testing or training data


trip_creation_time – Timestamp of trip creation
route_schedule_uuid – Unique Id for a particular route schedule
route_type – Transportation type
FTL – Full Truck Load: FTL shipments get to the destination sooner, as the truck
is making no other pickups or drop-offs along the way
Carting: Handling system consisting of small vehicles (carts)
trip_uuid - Unique ID given to a particular trip (A trip may include different source
and destination centers)
source_center - Source ID of trip origin
source_name - Source Name of trip origin
destination_cente – Destination ID
destination_name – Destination Name
od_start_time – Trip start time
od_end_time – Trip end time
start_scan_to_end_scan – Time taken to deliver from source to destination
is_cutoff – Unknown field
cutoff_factor – Unknown field
cutoff_timestamp – Unknown field
actual_distance_to_destination – Distance in Kms between source and destination
warehouse
actual_time – Actual time taken to complete the delivery (Cumulative)
osrm_time – An open-source routing engine time calculator which computes the
shortest path between points in a given map (Includes usual traffic, distance through
major and minor roads) and gives the time (Cumulative)
osrm_distance – An open-source routing engine which computes the shortest path
between points in a given map (Includes usual traffic, distance through major and
minor roads) (Cumulative)
factor – Unknown field
segment_actual_time – This is a segment time. Time taken by the subset of the
package delivery
segment_osrm_time – This is the OSRM segment time. Time taken by the subset of
the package delivery
segment_osrm_distance – This is the OSRM distance. Distance covered by subset of
the package delivery
segment_factor – Unknown field

Load data

about:srcdoc Page 8 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [42]:
delhivery_data = pd.read_csv("https://d2beiqkhq929f0.cloudfront.net/public_asset
delhivery_data.head()

Out[42]: data trip_creation_time route_schedule_uuid route_type trip_uuid source_c

thanos::sroute:eb7bfc78-
2018-09-20 trip-
0 training b351-4c0e-a951- Carting IND38812
02:35:36.476840 153741093647649320
fa3d5c3...

thanos::sroute:eb7bfc78-
2018-09-20 trip-
1 training b351-4c0e-a951- Carting IND38812
02:35:36.476840 153741093647649320
fa3d5c3...

thanos::sroute:eb7bfc78-
2018-09-20 trip-
2 training b351-4c0e-a951- Carting IND38812
02:35:36.476840 153741093647649320
fa3d5c3...

thanos::sroute:eb7bfc78-
2018-09-20 trip-
3 training b351-4c0e-a951- Carting IND38812
02:35:36.476840 153741093647649320
fa3d5c3...

thanos::sroute:eb7bfc78-
2018-09-20 trip-
4 training b351-4c0e-a951- Carting IND38812
02:35:36.476840 153741093647649320
fa3d5c3...

5 rows × 24 columns

Drop Unknown Fields


In [43]:
unknwn_cols = ["is_cutoff","cutoff_factor","cutoff_timestamp","factor","segment_
delhivery_data.drop(unknwn_cols, axis=1, inplace=True)

Observations on shape & data types of all


attributes
In [44]:
delhivery_data.shape

(144867, 19)
Out[44]:

In [45]:
delhivery_data.columns

Index(['data', 'trip_creation_time', 'route_schedule_uuid', 'route_type',


Out[45]:
'trip_uuid', 'source_center', 'source_name', 'destination_center',
'destination_name', 'od_start_time', 'od_end_time',
'start_scan_to_end_scan', 'actual_distance_to_destination',
'actual_time', 'osrm_time', 'osrm_distance', 'segment_actual_time',
'segment_osrm_time', 'segment_osrm_distance'],
dtype='object')

about:srcdoc Page 9 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [46]:
delhivery_data.dtypes

data object
Out[46]:
trip_creation_time object
route_schedule_uuid object
route_type object
trip_uuid object
source_center object
source_name object
destination_center object
destination_name object
od_start_time object
od_end_time object
start_scan_to_end_scan float64
actual_distance_to_destination float64
actual_time float64
osrm_time float64
osrm_distance float64
segment_actual_time float64
segment_osrm_time float64
segment_osrm_distance float64
dtype: object

Conversion of Categorical attributes


In [47]:
delhivery_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144867 entries, 0 to 144866
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 data 144867 non-null object
1 trip_creation_time 144867 non-null object
2 route_schedule_uuid 144867 non-null object
3 route_type 144867 non-null object
4 trip_uuid 144867 non-null object
5 source_center 144867 non-null object
6 source_name 144574 non-null object
7 destination_center 144867 non-null object
8 destination_name 144606 non-null object
9 od_start_time 144867 non-null object
10 od_end_time 144867 non-null object
11 start_scan_to_end_scan 144867 non-null float64
12 actual_distance_to_destination 144867 non-null float64
13 actual_time 144867 non-null float64
14 osrm_time 144867 non-null float64
15 osrm_distance 144867 non-null float64
16 segment_actual_time 144867 non-null float64
17 segment_osrm_time 144867 non-null float64
18 segment_osrm_distance 144867 non-null float64
dtypes: float64(8), object(11)
memory usage: 21.0+ MB

about:srcdoc Page 10 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [48]:
cols = delhivery_data.columns
num_cols = ["start_scan_to_end_scan", "actual_distance_to_destination", "actual_
cat_cols = ["data","route_type"]
dt_cols = ["trip_creation_time","od_start_time","od_end_time"]

#delhivery_data[cat_cols] = delhivery_data[cat_cols].astype("category")
delhivery_data[dt_cols] = delhivery_data[dt_cols].astype('datetime64')

Analyzing basic statistics about each


feature, such as count, min, max, and mean
In [49]:
delhivery_data.describe().T

Out[49]: count mean std min 25%

start_scan_to_end_scan 144867.0 961.262986 1037.012769 20.000000 161.000000

actual_distance_to_destination 144867.0 234.073372 344.990009 9.000045 23.355874

actual_time 144867.0 416.927527 598.103621 9.000000 51.000000

osrm_time 144867.0 213.868272 308.011085 6.000000 27.000000

osrm_distance 144867.0 284.771297 421.119294 9.008200 29.914700

segment_actual_time 144867.0 36.196111 53.571158 -244.000000 20.000000

segment_osrm_time 144867.0 18.507548 14.775960 0.000000 11.000000

segment_osrm_distance 144867.0 22.829020 17.860660 0.000000 12.070100

In [50]:
delhivery_data.describe(include='object').T

Out[50]: count unique top freq

data 144867 2 training 104858

thanos::sroute:4029a8a2-6c74-4b7e-a6d8-
route_schedule_uuid 144867 1504 1812
f9e069f...

route_type 144867 2 FTL 99660

trip_uuid 144867 14817 trip-153811219535896559 101

source_center 144867 1508 IND000000ACB 23347

source_name 144574 1498 Gurgaon_Bilaspur_HB (Haryana) 23347

destination_center 144867 1481 IND000000ACB 15192

destination_name 144606 1468 Gurgaon_Bilaspur_HB (Haryana) 15192

about:srcdoc Page 11 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [51]:
delhivery_data.describe(include='all').T

Out[51]: count unique top freq

data 144867 2 training 104858

2018-09-28 2018-09-12
trip_creation_time 144867 14817 101
05:23:15.359220 00:00:16.535741

thanos::sroute:4029a8a2-
route_schedule_uuid 144867 1504 6c74-4b7e-a6d8- 1812
f9e069f...

route_type 144867 2 FTL 99660

trip-
trip_uuid 144867 14817 101
153811219535896559

source_center 144867 1508 IND000000ACB 23347

Gurgaon_Bilaspur_HB
source_name 144574 1498 23347
(Haryana)

destination_center 144867 1481 IND000000ACB 15192

Gurgaon_Bilaspur_HB
destination_name 144606 1468 15192
(Haryana)

2018-09-21 2018-09-12
od_start_time 144867 26369 81
18:37:09.322207 00:00:16.535741

2018-09-24 2018-09-12
od_end_time 144867 26369 81
09:59:15.691618 00:50:10.814399

start_scan_to_end_scan 144867.0 NaN NaN NaN

actual_distance_to_destination 144867.0 NaN NaN NaN

actual_time 144867.0 NaN NaN NaN

osrm_time 144867.0 NaN NaN NaN

osrm_distance 144867.0 NaN NaN NaN

segment_actual_time 144867.0 NaN NaN NaN

segment_osrm_time 144867.0 NaN NaN NaN

segment_osrm_distance 144867.0 NaN NaN NaN

Non-Graphical Analysis: Value counts and


unique attributes
Unique values (counts) for each Feature

about:srcdoc Page 12 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [52]:
for col in delhivery_data.columns:
l = len(col)
if l < 7: print(col, "\t\t\t\t:", delhivery_data[col].nunique())
elif l == 16: print(col, "\t\t:", delhivery_data[col].nunique())
elif l < 16: print(col, "\t\t\t:", delhivery_data[col].nunique())
elif l<30: print(col, "\t\t:", delhivery_data[col].nunique())
else: print(col, "\t:", delhivery_data[col].nunique())

data : 2
trip_creation_time : 14817
route_schedule_uuid : 1504
route_type : 2
trip_uuid : 14817
source_center : 1508
source_name : 1498
destination_center : 1481
destination_name : 1468
od_start_time : 26369
od_end_time : 26369
start_scan_to_end_scan : 1915
actual_distance_to_destination : 144515
actual_time : 3182
osrm_time : 1531
osrm_distance : 138046
segment_actual_time : 747
segment_osrm_time : 214
segment_osrm_distance : 113799

Unique values (names) are checked for each


Features
In [53]:
for colname in cat_cols:
print("\nUnique values of ",colname," are : ",list(delhivery_data[colname

Unique values of data are : ['training', 'test']

Unique values of route_type are : ['Carting', 'FTL']

Inference

No abnormalities found.

Unique values (counts) are checked for each


Features unique values
In [54]:
delhivery_data["data"].value_counts().sort_values(ascending=False)

training 104858
Out[54]:
test 40009
Name: data, dtype: int64

about:srcdoc Page 13 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [55]:
delhivery_data["route_type"].value_counts().sort_values(ascending=False)

FTL 99660
Out[55]:
Carting 45207
Name: route_type, dtype: int64

Inference

Around 50% of the trips has distance more than 50 kms. Maximum distance travelled
is 2186 kms
50% of delivery has the actual time of 149 mins and maximum time taken for longest
delivery is 6265 mins
50% of delivery time calculated in OSRM engine is 60 mins and maximum time taken
for longest delivery is 2032 mins
50% of delivery in OSRM engine is 65 kms and maximum time taken for longest
delivery is 2840 kms
50% of delivery has the segment actual time 147 mins and maximum time taken for
longest delivery is 6230 mins
50% of delivery has the segment OSRM time of 65 mins and maximum time taken for
longest delivery is 2564 mins
50% of delivery has the segment OSRM distance of 70 kms and maximum distance
taken for longest delivery is 3523 kms
50% of trip time difference between the start and end 280 mins and max trip time is
7898 mins
Most of the route type is Carting and it is around 8908
Most of the orders are coming from state Maharashtra
Most of the orders are delivered to state Maharashtra
Most of the orders are delivered from city Mumbai in Maharashtra state
Most of the orders are delivered to city Mumbai in Maharashtra state, Hence busiest
corridor under busiest state is Mumbai to Mumbai (round trip)
Average distance between the busiest corridor Mumbai to Mumbai is 14.62 km
Average time between the busiest corridor Mumbai to Mumbai is 55.29 minutes

Missing value detection


In [56]:
missing_values(delhivery_data)

about:srcdoc Page 14 of 79
Delhivery (6) 19/08/22, 10:36 PM

Total records = 144867


Out[56]: # of Missing % of Missing

source_name 293 0.20

destination_name 261 0.18

data 0 0.00

start_scan_to_end_scan 0 0.00

segment_osrm_time 0 0.00

segment_actual_time 0 0.00

osrm_distance 0 0.00

osrm_time 0 0.00

actual_time 0 0.00

actual_distance_to_destination 0 0.00

od_start_time 0 0.00

od_end_time 0 0.00

trip_creation_time 0 0.00

destination_center 0 0.00

source_center 0 0.00

trip_uuid 0 0.00

route_type 0 0.00

route_schedule_uuid 0 0.00

segment_osrm_distance 0 0.00

In [57]:
mask = False
mask = mask | delhivery_data["source_name"].isnull()
source_name_null_data = delhivery_data[mask]
source_name_null_data

about:srcdoc Page 15 of 79
Delhivery (6) 19/08/22, 10:36 PM

Out[57]: data trip_creation_time route_schedule_uuid route_type trip_uuid

thanos::sroute:4460a38d-
2018-09-25 trip-
112 training ab9b-484e-bd4e- FTL
08:53:04.377810 153786558437756691
f4201d0...

thanos::sroute:4460a38d-
2018-09-25 trip-
113 training ab9b-484e-bd4e- FTL
08:53:04.377810 153786558437756691
f4201d0...

thanos::sroute:4460a38d-
2018-09-25 trip-
114 training ab9b-484e-bd4e- FTL
08:53:04.377810 153786558437756691
f4201d0...

thanos::sroute:4460a38d-
2018-09-25 trip-
115 training ab9b-484e-bd4e- FTL
08:53:04.377810 153786558437756691
f4201d0...

thanos::sroute:4460a38d-
2018-09-25 trip-
116 training ab9b-484e-bd4e- FTL
08:53:04.377810 153786558437756691
f4201d0...

... ... ... ... ...

thanos::sroute:cbef3b6a-
2018-10-03 trip-
144484 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...

thanos::sroute:cbef3b6a-
2018-10-03 trip-
144485 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...

thanos::sroute:cbef3b6a-
2018-10-03 trip-
144486 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...

thanos::sroute:cbef3b6a-
2018-10-03 trip-
144487 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...

thanos::sroute:cbef3b6a-
2018-10-03 trip-
144488 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...

293 rows × 19 columns

In [58]:
mask = False
mask = mask | delhivery_data["destination_name"].isnull()
dest_name_null_data = delhivery_data[mask]
dest_name_null_data

about:srcdoc Page 16 of 79
Delhivery (6) 19/08/22, 10:36 PM

Out[58]: data trip_creation_time route_schedule_uuid route_type trip_uuid

thanos::sroute:4460a38d-
2018-09-25 trip-
110 training ab9b-484e-bd4e- FTL
08:53:04.377810 153786558437756691
f4201d0...

thanos::sroute:4460a38d-
2018-09-25 trip-
111 training ab9b-484e-bd4e- FTL
08:53:04.377810 153786558437756691
f4201d0...

thanos::sroute:d0ebdacd-
2018-10-01 trip-
982 test e09b-47d3-be77- FTL
20:56:18.155260 153842737815495661
c9c4a05...

thanos::sroute:d0ebdacd-
2018-10-01 trip-
983 test e09b-47d3-be77- FTL
20:56:18.155260 153842737815495661
c9c4a05...

thanos::sroute:2f43f11e-
2018-09-24 trip-
4882 training d3ba-4590-9355- FTL
07:18:06.087341 153777348608709328
82928e1...

... ... ... ... ...

thanos::sroute:cbef3b6a-
2018-10-03 trip-
144478 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...

thanos::sroute:cbef3b6a-
2018-10-03 trip-
144479 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...

thanos::sroute:cbef3b6a-
2018-10-03 trip-
144480 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...

thanos::sroute:cbef3b6a-
2018-10-03 trip-
144481 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...

thanos::sroute:cbef3b6a-
2018-10-03 trip-
144482 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...

261 rows × 19 columns

In [59]:
delhivery_data['source_name'].fillna(delhivery_data['source_center'],inplace
delhivery_data['destination_name'].fillna(delhivery_data['destination_center'

Inference
Inference

Source and destination name attributes are having small number of missing values.
Labels are also inconsitent in source and destination name attributes

about:srcdoc Page 17 of 79
Delhivery (6) 19/08/22, 10:36 PM

Merging of rows and aggregation of fields


In [60]:
df1= delhivery_data
#df1 = df1[df1['trip_uuid']=='trip-153741093647649320']

In [61]:
grp = df1.groupby(['data','trip_uuid', 'trip_creation_time','route_type','source
df2 = grp.agg({'od_start_time' : 'min', 'od_end_time' : 'max','actual_distance_t
'segment_actual_time':'sum','segment_osrm_time':'sum','segment_osrm_dis
df2.sort_values(by=['trip_uuid', 'trip_creation_time','od_start_time'],inplace
df2 = df2.reset_index()
df2

Out[61]: data trip_uuid trip_creation_time route_type source_center

trip- 2018-09-12
0 training FTL IND462022AAA
153671041653548748 00:00:16.535741

trip- 2018-09-12 Kanpur_


1 training FTL IND209304AAA
153671041653548748 00:00:16.535741

trip- 2018-09-12
2 training Carting IND572101AAA
153671042288605164 00:00:22.886430

trip- 2018-09-12 Doddablpur_


3 training Carting IND561203AAB
153671042288605164 00:00:22.886430

trip- 2018-09-12 Bangalore_


4 training FTL IND562132AAA
153671043369099517 00:00:33.691250

... ... ... ... ... ...

trip- 2018-10-03 Tirchchndr_S


26364 test Carting IND628204AAA
153861115439069069 23:59:14.390954

trip- 2018-10-03 Thisayanvilai_


26365 test Carting IND627657AAA
153861115439069069 23:59:14.390954

trip- 2018-10-03 Peikulam_S


26366 test Carting IND628613AAA
153861115439069069 23:59:14.390954

trip- 2018-10-03
26367 test FTL IND583201AAA Hospet
153861118270144424 23:59:42.701692

trip- 2018-10-03 Sandur_W


26368 test FTL IND583119AAA
153861118270144424 23:59:42.701692

26369 rows × 18 columns

In [62]:
grp = df2.groupby(['data','trip_uuid', 'trip_creation_time','route_type'])
delhivery_data_v2 = grp.agg({'source_center':'first','source_name':'first',
'segment_actual_time':'sum','segment_osrm_time':'sum','segment_osrm_dis
delhivery_data_v2.columns = ["data","trip_uuid", "trip_creation_time", "route_ty
delhivery_data_v2

about:srcdoc Page 18 of 79
Delhivery (6) 19/08/22, 10:36 PM

Out[62]: data trip_uuid trip_creation_time route_type source_center

trip- 2018-09-27
0 test Carting IND424006AAA
153800653897073708 00:02:18.970980

trip- 2018-09-27
1 test Carting IND400072AAB
153800654935210748 00:02:29.352390

trip- 2018-09-27
2 test FTL IND302014AAA Jaipur_Hub
153800658820968126 00:03:08.209931

trip- 2018-09-27
3 test Carting IND421302AAF
153800659468028518 00:03:14.680535

trip- 2018-09-27
4 test Carting IND395009AAA
153800661729668086 00:03:37.296972

... ... ... ... ... ...

trip- 2018-09-26 Vadodara_Kar


14812 training Carting IND390022AAA
153800579708680929 23:49:57.087036

trip- 2018-09-26 Sivaganga


14813 training Carting IND630561AAA
153800585467019097 23:50:54.670423

trip- 2018-09-26
14814 training FTL IND424006AAA
153800603160412602 23:53:51.604388

trip- 2018-09-26 Mainpu


14815 training FTL IND205001AAB
153800605670819251 23:54:16.708455

trip- 2018-09-26 GZB_Mohan


14816 training Carting IND201007AAA
153800606794535545 23:54:27.945614

14817 rows × 19 columns

Inference

So we have reduced the number of rows from 144316 to just 14817. We now have 21
columns.

We combined delivery details of a package with multiple rows into single row.

Validate Duplicate Records


In [63]:
delhivery_data_v2[delhivery_data_v2.duplicated()]

Out[63]: data trip_uuid trip_creation_time route_type source_center source_name destination_center

Inference

about:srcdoc Page 19 of 79
Delhivery (6) 19/08/22, 10:36 PM

No duplicates found.

Feature Engineering
Split and extract features out of Source Name
In [64]:
delhivery_data_v2['source_city'] = delhivery_data_v2['source_name'].str.split
delhivery_data_v2['source_place'] = delhivery_data_v2['source_name'].str.split
delhivery_data_v2['source_code'] = delhivery_data_v2['source_name'].str.split
delhivery_data_v2['source_state']= delhivery_data_v2['source_name'].str.split

Split and extract features out of Destination Name


In [65]:
delhivery_data_v2['destination_city'] = delhivery_data_v2['destination_name'
delhivery_data_v2['destination_place'] = delhivery_data_v2['destination_name'
delhivery_data_v2['destination_code'] = delhivery_data_v2['destination_name'
delhivery_data_v2['destination_state']= delhivery_data_v2['destination_name'

Extract features like month, year and day from Trip Creation
Time
In [66]:
delhivery_data_v2['trip_creation_year'] = delhivery_data_v2['trip_creation_time'
delhivery_data_v2['trip_creation_month'] = delhivery_data_v2['trip_creation_time
delhivery_data_v2['trip_creation_day']= delhivery_data_v2['trip_creation_time'

delhivery_data_v2["trip_creation_hour"] = delhivery_data_v2["trip_creation_time"
delhivery_data_v2["trip_creation_day"] = delhivery_data_v2["trip_creation_time"
delhivery_data_v2["trip_creation_week"] = delhivery_data_v2["trip_creation_time"
delhivery_data_v2["trip_creation_dayofweek"] = delhivery_data_v2["trip_creation_

Calculate the time taken between od_start_time and


od_end_time
In [67]:
delhivery_data_v2['od_time_taken'] = (delhivery_data_v2['od_end_time'] - delhive
#(df.from_date - df.to_date) / pd.Timedelta(minutes=1)

Handling categorical values


Do one-hot encoding of categorical variable -
route_type

about:srcdoc Page 20 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [68]:
dummies = pd.get_dummies(delhivery_data_v2.route_type,drop_first = True)

In [69]:
delhivery_data_v2 = pd.concat([delhivery_data_v2,dummies],axis=1)

Do one-hot encoding of categorical variable - data


In [70]:
dummies = pd.get_dummies(delhivery_data_v2.data,drop_first = True)

In [71]:
delhivery_data_v2 = pd.concat([delhivery_data_v2,dummies],axis=1)

Visual Univariate Analysis - Numerical


Variables
In [72]:
num_cols

['start_scan_to_end_scan',
Out[72]:
'actual_distance_to_destination',
'actual_time',
'osrm_time',
'osrm_distance',
'segment_actual_time',
'segment_osrm_time',
'segment_osrm_distance']

Outliers Detection and Handling -


start_scan_to_end_scan
In [73]:
col = num_cols[0]
univariate_analysis(delhivery_data_v2, col)

about:srcdoc Page 21 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [74]:
delhivery_data_v2[delhivery_data_v2['start_scan_to_end_scan']<0]

Out[74]: data trip_uuid trip_creation_time route_type source_center source_name destination_center

0 rows × 36 columns

In [75]:
detect_outliers_group(delhivery_data_v2, col)

Out[75]: data trip_uuid trip_creation_time route_type source_center

trip- 2018-09-27 Surat_C


21 test Carting IND395023AAD
153800750141776702 00:18:21.418023

trip- 2018-09-27
69 test Carting IND362001AAA
153801189681302558 01:31:36.813300

trip- 2018-09-27 Bhiwandi_M


72 test FTL IND421302AAG
153801202166779166 01:33:41.668026

trip- 2018-09-27 Lucknow_H


85 test Carting IND226004AAA
153801386437307256 02:04:24.373330

trip- 2018-09-27
101 test Carting IND110037AAM Delhi_Airport
153801517906891689 02:26:19.069179

... ... ... ... ... ...

trip- 2018-09-26
14638 training Carting IND403726AAB Goa_ZuariNg
153799260763324327 20:10:07.633497

trip- 2018-09-26 Bangalore_Ne


14646 training FTL IND562132AAA
153799315216859426 20:19:12.168865

trip- 2018-09-26
14687 training Carting IND110044AAB Del_Okhla_P
153799639329945683 21:13:13.299701

trip- 2018-09-26 Bengaluru_Ka


14789 training Carting IND560067AAC
153800448393383196 23:28:03.934090

trip- 2018-09-26 Sivaganga_W


14813 training Carting IND630561AAA
153800585467019097 23:50:54.670423

604 rows × 36 columns

In [76]:
delhivery_data_v2[delhivery_data_v2['start_scan_to_end_scan']<0]

Out[76]: data trip_uuid trip_creation_time route_type source_center source_name destination_center

0 rows × 36 columns

about:srcdoc Page 22 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [77]:
handle_outliers_group(delhivery_data_v2, col)

In [78]:
delhivery_data_v2[delhivery_data_v2['start_scan_to_end_scan']<0]

Out[78]: data trip_uuid trip_creation_time route_type source_center source_name destination_center

0 rows × 36 columns

In [79]:
univariate_analysis(delhivery_data_v2, col)

Inference

Average time taken to deliver from source to destination are relatively higher where
the number of transits are high.
Most of the one way parcels are likely to have least number of transits.
More data are right skewed in start scan to end scan attribute and also where the
number of transits are less than three.

Outliers Detection and Handling -


actual_distance_to_destination
In [80]:
col = num_cols[1]
univariate_analysis(delhivery_data_v2, col)

about:srcdoc Page 23 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [81]:
detect_outliers_group(delhivery_data_v2, col)

Out[81]: data trip_uuid trip_creation_time route_type source_center

trip- 2018-09-27
2 test FTL IND302014AAA Jaipur_Hub
153800658820968126 00:03:08.209931

trip- 2018-09-27
4 test Carting IND395009AAA
153800661729668086 00:03:37.296972

trip- 2018-09-27 Mumbai_Kaly


5 test Carting IND421301AAA
153800662027930085 00:03:40.279575

trip- 2018-09-27
27 test Carting IND000000AFT
153800774414397735 00:22:24.144231

trip- 2018-09-27 Chandigarh_M


39 test Carting IND160002AAC
153800890376792315 00:41:43.768170

... ... ... ... ... ...

trip- 2018-09-26 Kanpur_


14736 training Carting IND209304AAA
153800093896706050 22:28:58.967321

trip- 2018-09-26
14751 training Carting IND110037AAK Delhi_Kapshe
153800194935773399 22:45:49.357982

trip- 2018-09-26 Muzaffrp


14775 training Carting IND842001AAA
153800360857000808 23:13:28.570233

trip- 2018-09-26
14781 training Carting IND400037AAA
153800419141788066 23:23:11.418113

trip- 2018-09-26 Tirupur_Kol


14797 training Carting IND641607AAA
153800495416332487 23:35:54.163552

1210 rows × 36 columns

In [82]:
handle_outliers_group(delhivery_data_v2, col)

In [83]:
univariate_analysis(delhivery_data_v2, col)

about:srcdoc Page 24 of 79
Delhivery (6) 19/08/22, 10:36 PM

Inference

More data are right skewed in actual distance to destination attribute.

Outliers Detection and Handling - actual_time


In [84]:
col = num_cols[2]
univariate_analysis(delhivery_data_v2, col)

In [85]:
detect_outliers_group(delhivery_data_v2, col)

about:srcdoc Page 25 of 79
Delhivery (6) 19/08/22, 10:36 PM

Out[85]: data trip_uuid trip_creation_time route_type source_center

trip- 2018-09-27 Murshidabad


20 test Carting IND742149AAA
153800744556994688 00:17:25.570190

trip- 2018-09-27
31 test Carting IND712310AAE
153800813417374126 00:28:54.174126

trip- 2018-09-27 PNQ Vad


32 test Carting IND411014AAA
153800827701849043 00:31:17.018741 DPC (Ma

trip- 2018-09-27
33 test Carting IND395023AAD
153800830425161914 00:31:44.251864

trip- 2018-09-27 Gulbarga_


48 test Carting IND585104AAA
153800950402244509 00:51:44.022677

... ... ... ... ... ...

trip- 2018-09-26 Bangalore_N


14685 training Carting IND562132AAA
153799615285931023 21:09:12.859557

trip- 2018-09-26 Kollam_C


14710 training Carting IND691001AAB
153799857684519878 21:49:36.845451

trip- 2018-09-26
14751 training Carting IND110037AAK
153800194935773399 22:45:49.357982

trip- 2018-09-26 Muzaffrpu


14775 training Carting IND842001AAA
153800360857000808 23:13:28.570233

trip- 2018-09-26 Noida_C


14811 training Carting IND201307AAA
153800571655403292 23:48:36.554285

821 rows × 36 columns

In [86]:
handle_outliers_group(delhivery_data_v2, col)

In [87]:
univariate_analysis(delhivery_data_v2, col)

Inference

990 outliers are detected in actual time attribute.

about:srcdoc Page 26 of 79
Delhivery (6) 19/08/22, 10:36 PM

Outliers Detection and Handling - osrm_time


In [88]:
col = num_cols[3]
univariate_analysis(delhivery_data_v2, col)

In [89]:
detect_outliers_group(delhivery_data_v2, col)

Out[89]: data trip_uuid trip_creation_time route_type source_center

trip- 2018-09-27 Mumbai_Kaly


5 test Carting IND421301AAA
153800662027930085 00:03:40.279575

trip- 2018-09-27 CCU_Beli


49 test Carting IND700065AAA
153800970294643704 00:55:02.946700

trip- 2018-09-27 Jamshedpu


51 test FTL IND832109AAB
153800984084960070 00:57:20.849835

trip- 2018-09-27 Anand_V


98 test Carting IND388121AAA
153801468900715290 02:18:09.007402

trip- 2018-09-27 MAA_Poon


110 test Carting IND600056AAB
153801610726869838 02:41:47.269073

... ... ... ... ... ...

trip- 2018-09-26 Chandigarh_M


14697 training FTL IND160002AAC
153799728722932020 21:28:07.229557

trip- 2018-09-26
14754 training FTL IND361001AAA Jamnagar_
153800218813854926 22:49:48.138801

trip- 2018-09-26
14781 training Carting IND400037AAA
153800419141788066 23:23:11.418113

trip- 2018-09-26 Bengaluru_


14789 training Carting IND560067AAC
153800448393383196 23:28:03.934090

trip- 2018-09-26 Kolkata_


14807 training Carting IND712311AAA
153800550884418718 23:45:08.844409

871 rows × 36 columns

about:srcdoc Page 27 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [90]:
handle_outliers_group(delhivery_data_v2, col)

In [91]:
univariate_analysis(delhivery_data_v2, col)

Inference

More data are right skewed in OSRM attribute.


More number of round trips takes more than 2 transits to deliver.
1070 outliers are detected in OSRM time attribute.

Outliers Detection and Handling - osrm_distance


In [92]:
col = num_cols[4]
univariate_analysis(delhivery_data_v2, col)

In [93]:
detect_outliers_group(delhivery_data_v2, col)

about:srcdoc Page 28 of 79
Delhivery (6) 19/08/22, 10:36 PM

Out[93]: data trip_uuid trip_creation_time route_type source_center

trip- 2018-09-27 Mumbai_Ka


5 test Carting IND421301AAA
153800662027930085 00:03:40.279575 _Dc (Ma

trip- 2018-09-27 Brahmapuri_D


9 test Carting IND441206AAB
153800693388046603 00:08:53.880714

trip- 2018-09-27
21 test Carting IND395023AAD
153800750141776702 00:18:21.418023

trip- 2018-09-27
57 test FTL IND282002AAD
153801047876051786 01:07:58.760784

trip- 2018-09-27 Anand_VU


98 test Carting IND388121AAA
153801468900715290 02:18:09.007402

... ... ... ... ... ...

trip- 2018-09-26
14754 training FTL IND361001AAA Jamnagar_Dc
153800218813854926 22:49:48.138801

trip- 2018-09-26 Chotila_C


14755 training FTL IND363520AAC
153800223678375371 22:50:36.783999

trip- 2018-09-26
14779 training Carting IND395023AAD
153800389194808105 23:18:11.948320

trip- 2018-09-26 Mumbai


14781 training Carting IND400037AAA
153800419141788066 23:23:11.418113

trip- 2018-09-26 Rampur_R


14786 training FTL IND244901AAB
153800436074116276 23:26:00.741419

886 rows × 36 columns

In [94]:
handle_outliers_group(delhivery_data_v2, col)

In [95]:
univariate_analysis(delhivery_data_v2, col)

Inference

about:srcdoc Page 29 of 79
Delhivery (6) 19/08/22, 10:36 PM

More data are right skewed in OSRM distance attribute.


1003 outliers are detected in OSRM distance attribute.

Outliers Detection and Handling -


segment_actual_time
In [96]:
col = num_cols[5]
univariate_analysis(delhivery_data_v2, col)

In [97]:
detect_outliers_group(delhivery_data_v2, col)

about:srcdoc Page 30 of 79
Delhivery (6) 19/08/22, 10:36 PM

Out[97]: data trip_uuid trip_creation_time route_type source_center

trip- 2018-09-27
33 test Carting IND395023AAD
153800830425161914 00:31:44.251864

trip- 2018-09-27
67 test Carting IND396191AAC Vapi_IndEsta
153801182428197910 01:30:24.282255

trip- 2018-09-27 Bhiwandi_


72 test FTL IND421302AAG
153801202166779166 01:33:41.668026

trip- 2018-09-27 Bengaluru_Bo


150 test Carting IND560099AAB
153802092485997218 04:02:04.860231

trip- 2018-09-27 Ahmedab


176 test Carting IND382430AAB
153802358988013009 04:46:29.880386

... ... ... ... ... ...

trip- 2018-09-26 Kollam_


14710 training Carting IND691001AAB
153799857684519878 21:49:36.845451

trip- 2018-09-26
14751 training Carting IND110037AAK Delhi_Kapshe
153800194935773399 22:45:49.357982

trip- 2018-09-26 Muzaffrp


14775 training Carting IND842001AAA
153800360857000808 23:13:28.570233

trip- 2018-09-26 Bengaluru_


14789 training Carting IND560067AAC
153800448393383196 23:28:03.934090

trip- 2018-09-26
14811 training Carting IND201307AAA
153800571655403292 23:48:36.554285

785 rows × 36 columns

In [98]:
handle_outliers_group(delhivery_data_v2, col)

In [99]:
univariate_analysis(delhivery_data_v2, col)

Inference

about:srcdoc Page 31 of 79
Delhivery (6) 19/08/22, 10:36 PM

More data are right skewed in segment actual time attribute.


870 outliers are detected in segment actual time attribute.

Outliers Detection and Handling -


segment_osrm_time
In [100…
col = num_cols[6]
univariate_analysis(delhivery_data_v2, col)

In [101…
detect_outliers_group(delhivery_data_v2, col)

about:srcdoc Page 32 of 79
Delhivery (6) 19/08/22, 10:36 PM

Out[101… data trip_uuid trip_creation_time route_type source_center

trip- 2018-09-27 Kanpur_Cent


8 test Carting IND209304AAA
153800688276350851 00:08:02.763752 (Uttar Pr

trip- 2018-09-27 Vapi_Ind


67 test Carting IND396191AAC
153801182428197910 01:30:24.282255

trip- 2018-09-27 Dehradun_Se


126 test FTL IND248197AAA
153801859365729551 03:23:13.657578

trip- 2018-09-27 Pune_Tatha


145 test FTL IND411033AAA
153802053032411379 03:55:30.324382 (Mahar

trip- 2018-09-27 Pune_Tatha


161 test Carting IND411033AAA
153802191796534285 04:18:37.965599 (Mahar

... ... ... ... ... ...

trip- 2018-09-26 Mumbai An


14781 training Carting IND400037AAA
153800419141788066 23:23:11.418113 (Mahar

trip- 2018-09-26 Muzaffrpur_B


14795 training FTL IND842001AAA
153800486451898339 23:34:24.519194

trip- 2018-09-26 Noida_Sec 0


14804 training FTL IND201301AAF
153800536221106758 23:42:42.211304 (Uttar Pr

trip- 2018-09-26 Kolkata_Dank


14807 training Carting IND712311AAA
153800550884418718 23:45:08.844409 (West B

trip- 2018-09-26 Mainpuri_Agr


14815 training FTL IND205001AAB
153800605670819251 23:54:16.708455 (Uttar Pr

855 rows × 36 columns

In [102…
handle_outliers_group(delhivery_data_v2, col)

In [103…
univariate_analysis(delhivery_data_v2, col)

Inference

about:srcdoc Page 33 of 79
Delhivery (6) 19/08/22, 10:36 PM

More data are right skewed in segment OSRm time attribute.

Outliers Detection and Handling -


segment_osrm_distance
In [104…
col = num_cols[6]
univariate_analysis(delhivery_data_v2, col)

In [105…
detect_outliers_group(delhivery_data_v2, col)

about:srcdoc Page 34 of 79
Delhivery (6) 19/08/22, 10:36 PM

Out[105… data trip_uuid trip_creation_time route_type source_center

trip- 2018-09-27
210 test Carting IND500008AAC
153802898973535963 06:16:29.735723

trip- 2018-09-27 Durgapur_C


298 test FTL IND713205AAB
153804913581490167 11:52:15.815278

trip- 2018-09-27 Chandigar


335 test FTL IND160002AAC
153805947380180455 14:44:33.802060

trip- 2018-09-27 Visakhapatna


452 test Carting IND530012AAA
153807672514287673 19:32:05.143140

trip- 2018-09-27
473 test Carting IND400072AAB Mumbai Hub
153807813360229446 19:55:33.602541

... ... ... ... ... ...

trip- 2018-09-26
14306 training Carting IND411033AAA
153793840860911258 05:06:48.609360

trip- 2018-09-26
14338 training Carting IND500008AAC
153794472414216798 06:52:04.142401

trip- 2018-09-26 Manbazar


14416 training FTL IND723131AAA
153795866685428331 10:44:26.854538

trip- 2018-09-26
14468 training FTL IND411033AAA
153797075209653066 14:05:52.096792

trip- 2018-09-26 Chandigar


14697 training FTL IND160002AAC
153799728722932020 21:28:07.229557

216 rows × 36 columns

In [106…
handle_outliers_group(delhivery_data_v2, col)

In [107…
univariate_analysis(delhivery_data_v2, col)

Inference

about:srcdoc Page 35 of 79
Delhivery (6) 19/08/22, 10:36 PM

More data are right skewed in segment OSRm distance attribute.


1017 outliers are detected in segment OSRM distance attribute.

Visual Univariate Analysis - Categorical


Variables
In [108…
for i in range(len(cat_cols)):
fig, ax = plt.subplots(1, 2, figsize = (12, 4))
plt.suptitle(cat_cols[i], fontsize = fontsize, fontweight = fontweight)
uni_barplot(delhivery_data_v2, cat_cols[i], ax[0], { "nolabel": True })
uni_pieplot(delhivery_data_v2, cat_cols[i], ax[1], { "nolabel": True })

Inference

60% of deliveries used carts.

about:srcdoc Page 36 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [109…
sourcestate10 = delhivery_data_v2["source_state"].value_counts()[0:10]
destinationstate10 = delhivery_data_v2["destination_state"].value_counts()[

fig, ax = plt.subplots(1,2,figsize=(16,5))

sns.barplot(x = np.linspace(0,1,10), y = sourcestate10.values, data = sourcestat


ax[0].set_xticklabels(sourcestate10.index,rotation=45)
ax[0].set_title("Source State")

sns.barplot(x = np.linspace(0,1,10), y = destinationstate10.values, data =destin


ax[1].set_xticklabels(destinationstate10.index,rotation=45)
ax[1].set_title("Destination State")

plt.suptitle("The Top 10 Source and Destination States")


plt.show()

Inference
Haryana, Maharastra and Karnataka are the popular source and destination states.

Trip creation month


In [110…
col = 'trip_creation_month'
fig, ax = plt.subplots(1, 2, figsize = (18, 6))
plt.suptitle('Distribution of Trip Month', fontsize = FONTSIZE, fontweight
uni_barplot(delhivery_data_v2, col, ax[0], { "nolabel": True })
uni_pieplot(delhivery_data_v2, col, ax[1], { "nolabel": True })

about:srcdoc Page 37 of 79
Delhivery (6) 19/08/22, 10:36 PM

Inference
The trips are recorded only for the months of September and October. The recording
perhaps stopped after that. So we do not analyse further on the basis of month.

Trip creation week


In [111…
delhivery_data_v2["trip_creation_dayofweek"] = delhivery_data_v2["trip_creation_
sns.countplot(x = "trip_creation_dayofweek",data=delhivery_data_v2,order=['Mon'
plt.title("Distribution of trips on each day of week")
plt.show()

Inference
So we see that maximum number of trips are happening on Wednesday and minimum on
Sunday.

Trip creation Hour

about:srcdoc Page 38 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [112…
sns.distplot(delhivery_data_v2["trip_creation_hour"])
plt.title("Distribution of Trip Hour")
plt.show()

Inference
So, we observe a kind of bimodal distribution with minimum trips occuring during the day
hours (8 AM to 1 PM) and maximum occuring during late night or early morning hours (8
PM to 2 AM).

Visual Bivariate Analysis - Numerical


Variables
In [113…
for col in num_cols:
fig, ax = plt.subplots(1, 2, figsize = (18, 4))
plt.suptitle(col, fontsize = fontsize, fontweight = fontweight)
boxplot_bicol(delhivery_data_v2,"data",col,ax[0])
boxplot_bicol(delhivery_data_v2,"route_type",col,ax[1])

WARNING:matplotlib.font_manager:findfont: Font family ['Comic Sans MS'] not


found. Falling back to DejaVu Sans.

about:srcdoc Page 39 of 79
Delhivery (6) 19/08/22, 10:36 PM

about:srcdoc Page 40 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [114…
for col in num_cols:
fig, ax = plt.subplots(1, 2, figsize = (18, 4))
plt.suptitle(col, fontsize = fontsize, fontweight = fontweight)
pointplot(delhivery_data_v2,"data",col,"route_type",'',ax[0])
pointplot(delhivery_data_v2,"route_type",col,"data",'',ax[1])

#pointplot(yulu_data_v1,"weather","count","season",'Count of booking across ea

about:srcdoc Page 41 of 79
Delhivery (6) 19/08/22, 10:36 PM

about:srcdoc Page 42 of 79
Delhivery (6) 19/08/22, 10:36 PM

Inference

So we see that the time taken by full truck load deliveries is on average, a lot higher
(>300 hours) than the cart deliveries (<100 hours).

The full truck load deliveries cover much longer distances onaverage (>150 kms)
than carting deliveries (~ 25 kms)

Time and distances follow similar trends against the hour of the day. Maximum time
and distance deliveries are likely to be made during peak morning hours of 10 AM to
12 PM as well as 5 PM, 7 PM and 1 AM.

In [115…
plt.figure(figsize = (8, 5))
sns.color_palette("pastel")
sns.heatmap(delhivery_data_v2[num_cols].corr(), annot=True, vmin=-1, vmax =
plt.show()

about:srcdoc Page 43 of 79
Delhivery (6) 19/08/22, 10:36 PM

Inference
So we see that certain fields are highly correlated :

cut-off factor : osrm_time, actual_time, osrm_distance,


actual_distance_to_destination, start_scan_to_end_scan.

start_scan_to_end_scan : osrm_time, actual_time, osrm_distance,


actual_distance_to_destination.

osrm_time, actual_time, osrm_distance, actual_distance_to_destination are all highly


correlated to each other, which is expected because distance will effect time, and
osrm calculation will be somewhat close to actual (even if not perfect).

segment_osrm_time and segment_osrm_distance are also highly correlated as


expected.

we see poor correlation between segment_actual_time and segment_osrm_time


(even though overall actual_time and osrm_time are highly correlated).

In [116…
sns.color_palette("pastel")
sns.pairplot(delhivery_data_v2[num_cols])
plt.show()

about:srcdoc Page 44 of 79
Delhivery (6) 19/08/22, 10:36 PM

Inference

All the numerical attributes are linearly related with each other.

Visual Bivariate Analysis - Categorical


Variables
In [117…
fig , ax = plt.subplots(1,2,figsize=(15,5))

countplot(delhivery_data_v2,"data","route_type",ax[0])
countplot(delhivery_data_v2,"route_type","data", ax[1])

plt.show()

about:srcdoc Page 45 of 79
Delhivery (6) 19/08/22, 10:36 PM

Route Type Distributions for Top 3 Source


States
In [118…
top3s = delhivery_data_v2[(delhivery_data_v2["source_state"]=='Maharashtra'
top3s = top3s[['route_type','source_state']]
st = ['Maharashtra','Karnataka','Haryana']
g = sns.countplot(x='source_state',hue='route_type', data=top3s, order = st
percx = []

for e in st:
percx.append(top3s[(top3s['source_state']==e)&(top3s["route_type"]=="Carting"
for e in st:
percx.append(top3s[(top3s['source_state']==e)&(top3s["route_type"]=="FTL"

i=0
for p in g.patches:
txt = str((round(percx[i]*100))) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
g.text(txt_x+0.1,txt_y,txt)
i+=1
plt.show()

about:srcdoc Page 46 of 79
Delhivery (6) 19/08/22, 10:36 PM

Inference
So we see that for top 3 source states,

Maharashtra hs 85% Carting and 15% FTL,

Karnataka has 88% Carting and 12% FTL,

Haryana has 75% Carting and 25% FTL.

Route Type Distributions for Top 3


Destination States
In [119…
top3s = delhivery_data_v2[(delhivery_data_v2["destination_state"]=='Maharashtra'
top3s = top3s[['route_type','destination_state']]
st = ['Maharashtra','Karnataka','Haryana']
g = sns.countplot(x='destination_state',hue='route_type', data=top3s, order
percx = []

for e in st:
percx.append(top3s[(top3s['destination_state']==e)&(top3s["route_type"]==
for e in st:
percx.append(top3s[(top3s['destination_state']==e)&(top3s["route_type"]==

i=0
for p in g.patches:
txt = str((round(percx[i]*100))) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
g.text(txt_x+0.1,txt_y,txt)
i+=1
plt.show()

about:srcdoc Page 47 of 79
Delhivery (6) 19/08/22, 10:36 PM

Inference
So we see that for top 3 destination states,

Maharashtra has 86% Carting and 14% FTL,

Karnataka has 86% Carting and 14% FTL,

Haryana has 81% Carting and 19% FTL.

Appropriate test to check whether


"Compare the difference between the time
taken between od_start_time/od_end_time
and start_scan_to_end_scan"
Statistical Hypothesis Test - Pearson’s Correlation
Coefficient
Step 1 - Define Null and Alternate Hypothesis

Null Hypothesis (H0) : The two samples are independent.


Alternate Hyphothesis (Ha) : There is a dependency between the samples.
Significance Level (alpha) : 0.05

Step 2 - Validate the assumptions

about:srcdoc Page 48 of 79
Delhivery (6) 19/08/22, 10:36 PM

Observations in each sample are independent and identically distributed (iid).


Observations in each sample are normally distributed.
Observations in each sample have the same variance.

Normality check of the data


Histogram and QQ-Plots

In [120…
fig , ax = plt.subplots(2,2,figsize=(20,12))

histplot(delhivery_data_v2['od_time_taken'],"Time taken between od_start_time/od


qqplot(delhivery_data_v2['od_time_taken'], "qqplot for Time taken between od_sta

histplot(delhivery_data_v2['start_scan_to_end_scan'],"Time taken to deliver from


qqplot(delhivery_data_v2['start_scan_to_end_scan'], "qqplot for Time taken to de

Applying log on the data - Log Normal Distribution

In [121…
fig , ax = plt.subplots(2,2,figsize=(20,12))

histplot(np.log(delhivery_data_v2['od_time_taken']),"Time taken between od_start


qqplot(np.log(delhivery_data_v2['od_time_taken']), "qqplot for Time taken betwee

histplot(np.log(delhivery_data_v2['start_scan_to_end_scan']),"Time taken to deli


qqplot(np.log(delhivery_data_v2['start_scan_to_end_scan']), "qqplot for Time tak

about:srcdoc Page 49 of 79
Delhivery (6) 19/08/22, 10:36 PM

Applying BoxCox Distribution

In [122…
fitted_od_time_taken,lmbda = stats.boxcox(delhivery_data_v2['od_time_taken'
fitted_start_scan_to_end_scan,lmbda = stats.boxcox(abs(delhivery_data_v2['start_

fig , ax = plt.subplots(2,2,figsize=(20,12))

histplot(fitted_od_time_taken,"Time taken between od_start_time/od_end_time"


qqplot(fitted_od_time_taken, "qqplot for Time taken between od_start_time/od_end

histplot(fitted_start_scan_to_end_scan,"Time taken to deliver from source to des


qqplot(fitted_start_scan_to_end_scan, "qqplot for Time taken to deliver from sou

about:srcdoc Page 50 of 79
Delhivery (6) 19/08/22, 10:36 PM

Variance check of the data


Anderson-Darling Test

In [123…
anderson(fitted_od_time_taken)

AndersonResult(statistic=28.82281265821257, critical_values=array([0.576, 0
Out[123…
.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. , 2
.5, 1. ]))

In [124…
anderson(fitted_start_scan_to_end_scan)

AndersonResult(statistic=23.295076091240844, critical_values=array([0.576,
Out[124…
0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. ,
2.5, 1. ]))

Step 3 - Pearson’s Correlation Coefficient


In [125…
stat, p_value = pearsonr(fitted_od_time_taken, fitted_start_scan_to_end_scan

Step 4 - Check p-value with siginificance level


In [126…
if p_value <= 0.05:
print("Reject NULL Hypothesis")
else:
print("Failed to Reject NULL Hypothesis")

Reject NULL Hypothesis

about:srcdoc Page 51 of 79
Delhivery (6) 19/08/22, 10:36 PM

Inference

Since P-Value of this test lies below 0.05, Then we can safely reject the null
hypothesis and conclude od_start_time / od_end_time and start_scan_to_end_scan
attribute are dependent on each other

Visual Analysis
In [127…
sns.distplot(delhivery_data_v2["start_scan_to_end_scan"], label="start_scan_to_e
sns.distplot(delhivery_data_v2["od_time_taken"], label="od_time_taken")

plt.legend()
plt.show()

In [128…
sns.scatterplot(data = delhivery_data_v2, x = 'od_time_taken', y = 'start_scan_t
plt.show()

about:srcdoc Page 52 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [129…
sns.pointplot(data = delhivery_data_v2, x = 'od_time_taken', y = 'start_scan_to_

<matplotlib.axes._subplots.AxesSubplot at 0x7f83db6eaad0>
Out[129…

Appropriate test to check whether


"Compare the difference between
actual_time aggregated value and OSRM
time aggregated value"
Statistical Hypothesis Test - Pearson’s Correlation
Coefficient
Step 1 - Define Null and Alternate Hypothesis

Null Hypothesis (H0) : The two samples are independent.


Alternate Hyphothesis (Ha) : There is a dependency between the samples.
Significance Level (alpha) : 0.05

Step 2 - Validate the assumptions

Observations in each sample are independent and identically distributed (iid).


Observations in each sample are normally distributed.
Observations in each sample have the same variance.

Normality check of the data


Histogram and QQ-Plots

about:srcdoc Page 53 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [130…
fig , ax = plt.subplots(2,2,figsize=(20,12))

histplot(delhivery_data_v2['actual_time'],"Actual Time",ax[0][0])
qqplot(delhivery_data_v2['actual_time'], "qqplot for Actual Time", ax[0][1])

histplot(delhivery_data_v2['osrm_time'],"OSRM Time",ax[1][0])
qqplot(delhivery_data_v2['osrm_time'], "qqplot for OSRM Time", ax[1][1])

Applying log on the data - Log Normal Distribution

In [131…
fig , ax = plt.subplots(2,2,figsize=(20,12))

histplot(np.log(delhivery_data_v2['actual_time']),"Actual Time",ax[0][0])
qqplot(np.log(delhivery_data_v2['actual_time']), "qqplot for Actual Time",

histplot(np.log(delhivery_data_v2['osrm_time']),"OSRM time",ax[1][0])
qqplot(np.log(delhivery_data_v2['osrm_time']), "qqplot for OSRM Time", ax[1

about:srcdoc Page 54 of 79
Delhivery (6) 19/08/22, 10:36 PM

Applying BoxCox Distribution

In [132…
fitted_actual_time,lmbda = stats.boxcox(delhivery_data_v2['actual_time'])
fitted_osrm_time,lmbda = stats.boxcox(delhivery_data_v2['osrm_time'])

fig , ax = plt.subplots(2,2,figsize=(20,12))

histplot(fitted_actual_time,"Actual Time",ax[0][0])
qqplot(fitted_actual_time, "qqplot for Actual Time", ax[0][1])

histplot(fitted_osrm_time,"OSRM time",ax[1][0])
qqplot(fitted_osrm_time, "qqplot for OSRM time", ax[1][1])

about:srcdoc Page 55 of 79
Delhivery (6) 19/08/22, 10:36 PM

Variance check of the data


Anderson-Darling Test

In [133…
anderson(fitted_actual_time)

AndersonResult(statistic=29.222329256357625, critical_values=array([0.576,
Out[133…
0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. ,
2.5, 1. ]))

In [134…
anderson(fitted_osrm_time)

AndersonResult(statistic=12.061248831811099, critical_values=array([0.576,
Out[134…
0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. ,
2.5, 1. ]))

Step 3 - Pearson’s Correlation Coefficient


In [135…
stat, p_value = pearsonr(fitted_actual_time, fitted_osrm_time)

Step 4 - Check p-value with siginificance level


In [136…
if p_value <= 0.05:
print("Reject NULL Hypothesis")
else:
print("Failed to Reject NULL Hypothesis")

Reject NULL Hypothesis

about:srcdoc Page 56 of 79
Delhivery (6) 19/08/22, 10:36 PM

Inference

Since P-Value of this test lies below 0.05, Then we can safely reject the null
hypothesis and conclude actual time and OSRM time attribute are dependent on
each other.

Visual Analysis
In [137…
sns.distplot(delhivery_data_v2["actual_time"], label="actual_time")
sns.distplot(delhivery_data_v2["osrm_time"], label="osrm_time")

plt.legend()
plt.show()

In [138…
sns.scatterplot(data = delhivery_data_v2, x = 'actual_time', y = 'osrm_time'
plt.show()

about:srcdoc Page 57 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [139…
sns.pointplot(data = delhivery_data_v2, x = 'actual_time', y = 'osrm_time')

<matplotlib.axes._subplots.AxesSubplot at 0x7f83bf405d10>
Out[139…

Appropriate test to check whether


"Compare the difference between
actual_time aggregated value and
segment actual time aggregated value"
Statistical Hypothesis Test - Pearson’s Correlation
Coefficient
Step 1 - Define Null and Alternate Hypothesis

Null Hypothesis (H0) : The two samples are independent.


Alternate Hyphothesis (Ha) : There is a dependency between the samples.
Significance Level (alpha) : 0.05

Step 2 - Validate the assumptions

Observations in each sample are independent and identically distributed (iid).


Observations in each sample are normally distributed.
Observations in each sample have the same variance.

Normality check of the data


Histogram and QQ-Plots

about:srcdoc Page 58 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [140…
fig , ax = plt.subplots(2,2,figsize=(20,12))

histplot(delhivery_data_v2['actual_time'],"Actual Time",ax[0][0])
qqplot(delhivery_data_v2['actual_time'], "qqplot for Actual Time", ax[0][1])

histplot(delhivery_data_v2['segment_actual_time'],"Segment Actual Time",ax[


qqplot(delhivery_data_v2['segment_actual_time'], "qqplot for Segment Actual Time

Applying log on the data - Log Normal Distribution

In [141…
fig , ax = plt.subplots(2,2,figsize=(20,12))

histplot(np.log(abs(delhivery_data_v2['actual_time'])),"Actual Time",ax[0][
qqplot(np.log(abs(delhivery_data_v2['actual_time'])), "qqplot for Actual Time"

histplot(np.log(abs(delhivery_data_v2['segment_actual_time'])),"Segment Actual T
qqplot(np.log(abs(delhivery_data_v2['segment_actual_time'])), "qqplot for Segmen

about:srcdoc Page 59 of 79
Delhivery (6) 19/08/22, 10:36 PM

Applying BoxCox Distribution

In [142…
fitted_actual_time,lmbda = stats.boxcox(delhivery_data_v2['actual_time'])
fitted_segment_actual_time,lmbda = stats.boxcox(delhivery_data_v2['segment_actua

fig , ax = plt.subplots(2,2,figsize=(20,12))

histplot(fitted_actual_time,"Actual Time",ax[0][0])
qqplot(fitted_actual_time, "qqplot for Actual Time", ax[0][1])

histplot(fitted_segment_actual_time,"Segment Actual Time",ax[1][0])


qqplot(fitted_segment_actual_time, "qqplot for Segment Actual Time", ax[1][

about:srcdoc Page 60 of 79
Delhivery (6) 19/08/22, 10:36 PM

Variance check of the data


Anderson-Darling Test

In [143…
anderson(fitted_actual_time)

AndersonResult(statistic=29.222329256357625, critical_values=array([0.576,
Out[143…
0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. ,
2.5, 1. ]))

In [144…
anderson(fitted_segment_actual_time)

AndersonResult(statistic=39.9044356195227, critical_values=array([0.576, 0.
Out[144…
656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. , 2.
5, 1. ]))

Step 3 - Pearson’s Correlation Coefficient


In [145…
stat, p_value = pearsonr(fitted_actual_time, fitted_segment_actual_time)

Step 4 - Check p-value with siginificance level


In [146…
if p_value <= 0.05:
print("Reject NULL Hypothesis")
else:
print("Failed to Reject NULL Hypothesis")

Reject NULL Hypothesis

about:srcdoc Page 61 of 79
Delhivery (6) 19/08/22, 10:36 PM

Inference

Since P-Value of this test lies below 0.05, Then we can safely reject the null
hypothesis and conclude actual time and Segment actual time attribute are
dependent on each other.

Visual Analysis
In [147…
sns.distplot(delhivery_data_v2["actual_time"], label="actual_time")
sns.distplot(delhivery_data_v2["segment_actual_time"], label="segment_actual_tim

plt.legend()
plt.show()

In [148…
sns.scatterplot(data = delhivery_data_v2, x = 'actual_time', y = 'segment_actual
plt.show()

about:srcdoc Page 62 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [149…
sns.pointplot(data = delhivery_data_v2, x = 'actual_time', y = 'segment_actual_t

<matplotlib.axes._subplots.AxesSubplot at 0x7f83bc801b50>
Out[149…

Appropriate test to check whether


"Compare the difference between osrm
distance aggregated value and segment
osrm distance aggregated value"
Statistical Hypothesis Test - Pearson’s Correlation
Coefficient
Step 1 - Define Null and Alternate Hypothesis

Null Hypothesis (H0) : The two samples are independent.


Alternate Hyphothesis (Ha) : There is a dependency between the samples.
Significance Level (alpha) : 0.05

Step 2 - Validate the assumptions

Observations in each sample are independent and identically distributed (iid).


Observations in each sample are normally distributed.
Observations in each sample have the same variance.

Normality check of the data


Histogram and QQ-Plots

about:srcdoc Page 63 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [150…
fig , ax = plt.subplots(2,2,figsize=(20,12))

histplot(delhivery_data_v2['osrm_distance'],"OSRM Distance",ax[0][0])
qqplot(delhivery_data_v2['osrm_distance'], "qqplot for OSRM Distance", ax[0

histplot(delhivery_data_v2['segment_osrm_distance'],"Segment OSRM Distance"


qqplot(delhivery_data_v2['segment_osrm_distance'], "qqplot for Segment OSRM Dist

Applying log on the data - Log Normal Distribution

In [151…
fig , ax = plt.subplots(2,2,figsize=(20,12))

histplot(np.log(delhivery_data_v2['osrm_distance']),"OSRM Distance",ax[0][0
qqplot(np.log(delhivery_data_v2['osrm_distance']), "qqplot for OSRM Distance"

histplot(np.log(delhivery_data_v2['segment_osrm_distance']),"Segment OSRM Distan


qqplot(np.log(delhivery_data_v2['segment_osrm_distance']), "qqplot for Segment O

about:srcdoc Page 64 of 79
Delhivery (6) 19/08/22, 10:36 PM

Applying BoxCox Distribution

In [152…
fitted_osrm_distance,lmbda = stats.boxcox(delhivery_data_v2['osrm_distance'
fitted_segment_osrm_distance,lmbda = stats.boxcox(delhivery_data_v2['segment_osr

fig , ax = plt.subplots(2,2,figsize=(20,12))

histplot(fitted_osrm_distance,"OSRM Distance",ax[0][0])
qqplot(fitted_osrm_distance, "qqplot for OSRM Distance", ax[0][1])

histplot(fitted_segment_osrm_distance,"Segment OSRM Distance",ax[1][0])


qqplot(fitted_segment_osrm_distance, "qqplot for Segment OSRM Distance", ax

about:srcdoc Page 65 of 79
Delhivery (6) 19/08/22, 10:36 PM

Variance check of the data


Anderson-Darling Test

In [153…
anderson(fitted_osrm_distance)

AndersonResult(statistic=19.994770089560916, critical_values=array([0.576,
Out[153…
0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. ,
2.5, 1. ]))

In [154…
anderson(fitted_segment_osrm_distance)

AndersonResult(statistic=70.3080027928263, critical_values=array([0.576, 0.
Out[154…
656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. , 2.
5, 1. ]))

Step 3 - Pearson’s Correlation Coefficient


In [155…
stat, p_value = pearsonr(fitted_osrm_distance, fitted_segment_osrm_distance

Step 4 - Check p-value with siginificance level


In [156…
if p_value <= 0.05:
print("Reject NULL Hypothesis")
else:
print("Failed to Reject NULL Hypothesis")

Reject NULL Hypothesis

about:srcdoc Page 66 of 79
Delhivery (6) 19/08/22, 10:36 PM

Inference

Since P-Value of this test lies below 0.05, Then we can safely reject the null
hypothesis and conclude OSRM distance and Segment OSRM distance attribute are
dependent on each other.

Visual Analysis
In [157…
sns.distplot(delhivery_data_v2["osrm_distance"], label="osrm_distance")
sns.distplot(delhivery_data_v2["segment_osrm_distance"], label="segment_osrm_dis

plt.legend()
plt.show()

In [158…
sns.scatterplot(data = delhivery_data_v2, x = 'osrm_distance', y = 'segment_osrm
plt.show()

about:srcdoc Page 67 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [159…
sns.pointplot(data = delhivery_data_v2, x = 'osrm_distance', y = 'segment_osrm_d

<matplotlib.axes._subplots.AxesSubplot at 0x7f83b8761250>
Out[159…

Appropriate test to check whether


"Compare the difference between osrm
time aggregated value and segment osrm
time aggregated value"
Statistical Hypothesis Test - Pearson’s Correlation
Coefficient
Step 1 - Define Null and Alternate Hypothesis

Null Hypothesis (H0) : The two samples are independent.


Alternate Hyphothesis (Ha) : There is a dependency between the samples.
Significance Level (alpha) : 0.05

Step 2 - Validate the assumptions

Observations in each sample are independent and identically distributed (iid).


Observations in each sample are normally distributed.
Observations in each sample have the same variance.

Normality check of the data


Histogram and QQ-Plots

about:srcdoc Page 68 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [160…
fig , ax = plt.subplots(2,2,figsize=(20,12))

histplot(delhivery_data_v2['osrm_time'],"OSRM Time",ax[0][0])
qqplot(delhivery_data_v2['osrm_time'], "qqplot for OSRM Time", ax[0][1])

histplot(delhivery_data_v2['segment_osrm_time'],"Segment OSRM Time",ax[1][0


qqplot(delhivery_data_v2['segment_osrm_time'], "qqplot for Segment OSRM Time"

Applying log on the data - Log Normal Distribution

In [161…
fig , ax = plt.subplots(2,2,figsize=(20,12))

histplot(np.log(delhivery_data_v2['osrm_time']),"OSRM Time",ax[0][0])
qqplot(np.log(delhivery_data_v2['osrm_time']), "qqplot for OSRM Time", ax[0

histplot(np.log(delhivery_data_v2['segment_osrm_time']),"Segment OSRM Time"


qqplot(np.log(delhivery_data_v2['segment_osrm_time']), "qqplot for Segment OSRM

about:srcdoc Page 69 of 79
Delhivery (6) 19/08/22, 10:36 PM

Applying BoxCox Distribution

In [162…
fitted_osrm_time,lmbda = stats.boxcox(delhivery_data_v2['osrm_time'])
fitted_segment_osrm_time,lmbda = stats.boxcox(delhivery_data_v2['segment_osrm_ti

fig , ax = plt.subplots(2,2,figsize=(20,12))

histplot(fitted_osrm_time,"OSRM Time",ax[0][0])
qqplot(fitted_osrm_time, "qqplot for OSRM Time", ax[0][1])

histplot(fitted_segment_osrm_time,"Segment OSRM Time",ax[1][0])


qqplot(fitted_segment_osrm_time, "qqplot for Segment OSRM Time", ax[1][1])

about:srcdoc Page 70 of 79
Delhivery (6) 19/08/22, 10:36 PM

Variance check of the data


Anderson-Darling Test

In [163…
anderson(fitted_osrm_time)

AndersonResult(statistic=12.061248831811099, critical_values=array([0.576,
Out[163…
0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. ,
2.5, 1. ]))

In [164…
anderson(fitted_segment_osrm_time)

AndersonResult(statistic=55.687229280645624, critical_values=array([0.576,
Out[164…
0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. ,
2.5, 1. ]))

Step 3 - Pearson’s Correlation Coefficient


In [165…
stat, p_value = pearsonr(fitted_osrm_time, fitted_segment_osrm_time)

Step 4 - Check p-value with siginificance level


In [166…
if p_value <= 0.05:
print("Reject NULL Hypothesis")
else:
print("Failed to Reject NULL Hypothesis")

Reject NULL Hypothesis

about:srcdoc Page 71 of 79
Delhivery (6) 19/08/22, 10:36 PM

Inference

Since P-Value of this test lies below 0.05, Then we can safely reject the null
hypothesis and conclude OSRM time and Segment OSRM time attribute are
dependent on each other.

Visual Analysis
In [167…
sns.distplot(delhivery_data_v2["osrm_time"], label="osrm_time")
sns.distplot(delhivery_data_v2["segment_osrm_time"], label="segment_osrm_time"

plt.legend()
plt.show()

In [168…
sns.scatterplot(data = delhivery_data_v2, x = 'osrm_time', y = 'segment_osrm_tim
plt.show()

about:srcdoc Page 72 of 79
Delhivery (6) 19/08/22, 10:36 PM

In [169…
sns.pointplot(data = delhivery_data_v2, x = 'osrm_time', y = 'segment_osrm_time'

<matplotlib.axes._subplots.AxesSubplot at 0x7f83a6036290>
Out[169…

Normalize/ Standardize the numerical


features using MinMaxScaler or
StandardScaler
In [170…
num_cols = ["od_time_taken","start_scan_to_end_scan", "actual_distance_to_destin

Standard Scaling
In [171…
scaler = preprocessing.StandardScaler()
standard_df = scaler.fit_transform(delhivery_data_v2[num_cols])
delhivery_data_v4 = pd.DataFrame(standard_df)
delhivery_data_v4.columns = num_cols

Min-Max Scaling
In [172…
scaler = preprocessing.MinMaxScaler()
minmax_df = scaler.fit_transform(delhivery_data_v2[num_cols])
delhivery_data_v5 = pd.DataFrame(minmax_df)
delhivery_data_v5.columns = num_cols

In [173…
num_cols

about:srcdoc Page 73 of 79
Delhivery (6) 19/08/22, 10:36 PM

['od_time_taken',
Out[173…
'start_scan_to_end_scan',
'actual_distance_to_destination',
'actual_time',
'osrm_time',
'osrm_distance',
'segment_actual_time',
'segment_osrm_time',
'segment_osrm_distance']

In [174…
fig, (ax1, ax2, ax3) = plt.subplots(ncols = 3, figsize =(20, 5))
ax1.set_title('Before Scaling')

sns.kdeplot(delhivery_data_v2['od_time_taken'], ax = ax1, color ='r')


sns.kdeplot(delhivery_data_v2['start_scan_to_end_scan'], ax = ax1, color ='b'
ax2.set_title('After Standard Scaling')

sns.kdeplot(delhivery_data_v4['od_time_taken'], ax = ax2, color ='red')


sns.kdeplot(delhivery_data_v4['start_scan_to_end_scan'], ax = ax2, color ='blue'
ax3.set_title('After Min-Max Scaling')

sns.kdeplot(delhivery_data_v5['od_time_taken'], ax = ax3, color ='black')


sns.kdeplot(delhivery_data_v5['start_scan_to_end_scan'], ax = ax3, color ='g'
plt.show()

In [175…
for col in num_cols:
fig, ax = plt.subplots(1, 3, figsize = (18, 4))
plt.suptitle(col, fontsize = fontsize, fontweight = fontweight)
ax[0].set_title('Before Scaling')
sns.kdeplot(delhivery_data_v2[col], ax = ax[0], color ='red')
ax[1].set_title('After Standard Scaling')
sns.kdeplot(delhivery_data_v4[col], ax = ax[1], color ='blue')
ax[2].set_title('After Min-Max Scaling')
sns.kdeplot(delhivery_data_v5[col], ax = ax[2], color ='green')

plt.show()

about:srcdoc Page 74 of 79
Delhivery (6) 19/08/22, 10:36 PM

about:srcdoc Page 75 of 79
Delhivery (6) 19/08/22, 10:36 PM

Inference

After normalization, All the numerical attributes got to a similiar scale ranges from 0
to 1
After standardization, It translates the data to the mean vector of original data to the
origin and squishes or expands.

Business Insights
1,44,867 number of records and 17 attributes are present in this dataset.

Source and destination name attributes are having small number of missing values.

Labels are also inconsitent in source and destination name attributes.

All the numerical attributes mean and median values are not close to each other
which clearly indicates data is not normally distributed.

Also the range of numerical attributes are widely distributed which shows there

about:srcdoc Page 76 of 79
Delhivery (6) 19/08/22, 10:36 PM

might be some outliers present in the data.

Min value of segment actual time is -244.

More number of FTL route types are present in raw data but we cannot conclude
before aggregating the rows.

Gurgaon_Bilaspur_HB (Haryana) seems to be most popular source and destination


center.

Average time taken to deliver from source to destination are relatively higher where
the number of transits are high.

Most of the one way parcels are likely to have least number of transits.

More number of round trips takes more than 2 transits to deliver.

All the numerical attributes are rightly skewed and requires some treatment.

Number of trips are dropping extensively in recent days.

60% of deliveries used carts.

Gurgaon_Bilaspur_HB (Haryana), Bhiwandi_Mankoli_HB (Maharastra) and


Bangalore_Nelmngla_H (Karnataka) are the most popular source centers.

Gurgaon_Bilaspur_HB (Haryana), Bangalore_Nelmngla_H (Karnataka) and


Bhiwandi_Mankoli_HB (Maharastra) are the most popular destination centers.

All these data captured in September and October 2018.

New feature Trip creation year, month, day, date and time information are extracted
from trip creation time attribute.

City, Place and Area information are extracted from both source and destination
name attribute.

New feature OD time taken is calculated based on the difference between OD Start
time and OD end time.

Almost all the numerical attributes are strongly linearly correlated with each other.

In all months, Carts are highly used compared to full truck loads.

More number of parcels are started in september compared to october.

More number of parcels are delivered in september compared to october.

about:srcdoc Page 77 of 79
Delhivery (6) 19/08/22, 10:36 PM

Karnataka, Maharastra, Tamilnadu, Haryana are the top states from where the
parcels are originated.

From Karnataka, Maharastra, Haryana and Tamilnadu, More number of parcels are
sent in carts compared to full truck loads.

Karnataka, Maharastra, Tamilnadu, Haryana and Telangana are the top states to
where the parcels are delivered.

In Karnataka, Maharastra, Haryana and Tamilnadu, More number of parcels are


delivered by carts compared to full truck loads.

Karnataka, Maharastra, Tamilnadu, Haryana and Telangana are the top states
involved in more number of trips.

In Karnataka, Maharastra and Haryana, One way parcels are more preferred.

In Tamilnadu and West Bengal, One way and round way type of parcels are equally
likely used.

OD_start_time / OD_end_time and start scan to end scan attribute are closely related
with each other

Actual time and OSRM time attribute are closely related with each other

Actual time and Segment actual time attribute are closely related with each other

OSRM distance and Segment OSRM distance attribute are closely related with each
other

OSRM time and Segment OSRM time attribute are closely related with each other

Since P-Value of this test lies below 0.05, Then we can safely reject the null
hypothesis and conclude OSRM time and Segment OSRM time attribute are
dependent on each other.

Since P-Value of this test lies below 0.05, Then we can safely reject the null
hypothesis and conclude OSRM distance and Segment OSRM distance attribute are
dependent on each other.

Since P-Value of this test lies below 0.05, Then we can safely reject the null
hypothesis and conclude actual time and Segment actual time attribute are
dependent on each other.

Since P-Value of this test lies below 0.05, Then we can safely reject the null
hypothesis and conclude actual time and OSRM time attribute are dependent on
each other.

about:srcdoc Page 78 of 79
Delhivery (6) 19/08/22, 10:36 PM

Since P-Value of this test lies below 0.05, Then we can safely reject the null
hypothesis and conclude od_start_time / od_end_time and start scan to end scan
attribute are dependent on each other

Recommendations
1. Delhivery company can increase their business by giving offers / discounts to busiest
corridor under busiest state

2. Delhivery company can increase their business by giving offers / discounts to route
type FTL

3. Delhivery company should focus more on Southern states as more parcels are
orginated and delivered

4. Delhivery company should plan to use shortest path from source to destination
center

about:srcdoc Page 79 of 79

You might also like