Delhivery Mani

Delhivery (6) 19/08/22, 10:36 PM
Brief Summary
Delhivery is the largest and fastest-growing fully integrated player in India by revenue in
Fiscal 2021. They aim to build the operating system for commerce, through a
combination of world-class infrastructure, logistics operations of the highest quality, and
cutting-edge engineering and technology capabilities.
The Data team builds intelligence and capabilities using this data that helps them to
widen the gap between the quality, efficiency, and profitability of their business versus
their competitors.
Problem Statement
The company wants to understand and process the data coming out of data engineering
pipelines:
• Clean, sanitize and manipulate data to get useful features out of raw fields
• Make sense out of the raw data and help the data science team to build forecasting
models on it
Import Libraries
In [1]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statistics
import warnings
import random
from scipy import stats
from scipy.stats import levene
from scipy.stats import shapiro
from scipy.stats import anderson
from scipy.stats import pearsonr
from sklearn import preprocessing
import statsmodels.api as sm
warnings.filterwarnings("ignore")
%matplotlib inline
/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: Fu
tureWarning: pandas.util.testing is deprecated. Use the functions in the pu
blic API at pandas.testing instead.
import pandas.util.testing as tm
about:srcdoc Page 1 of 79
Delhivery (6) 19/08/22, 10:36 PM
Global Variables
In [2]:
DEFAULT_OPTIONS = { "mean": True, "mode": True, "title": True, "median": True
In [3]:
TITLE_FONT_WGT = "bold"
COLOR_PALETTE = sns.color_palette("ch:s=.25, rot=-.25")
FONTSIZE, FONTFAMILY, FONTWEIGHT = 12, "Comic Sans MS", "bold"
SMALL_FIGSIZE, SMALL_WIDE_FIGSIZE, MEDIUM_FIGSIZE, MEDIUM_TALL_FIGSIZE, LARGE_FI
Common Utilities
In [4]:
fontsize, fontfamily, fontweight = 12, "Comic Sans MS", "bold"
palette_color = sns.color_palette("ch:s=.25, rot=-.25")
In [5]:
def sort_values(df, ascending = False):
return df.sort_values(ascending = ascending)
In [6]:
def missing_values(df):
total_null_cnt = df.isnull().count()
null_in_col = sort_values(df.isnull().sum())
percent = sort_values(null_in_col / total_null_cnt * 100)
print("Total records = ", df.shape[0])
tab = pd.concat([null_in_col, percent.round(2)], axis = 1, keys = ['# of Missi
return tab
In [7]:
def univariate_analysis(df, col):
fig, ax = plt.subplots(1, 3, figsize = LARGE_FIGSIZE)
plt.suptitle(col, fontsize = FONTSIZE, fontweight = FONTWEIGHT)
uni_histplot(df, col, ax[0], "transits_count", { "nolabel": True })
bi_boxplot(df, "transits_count", col, ax[1], None, { "nolabel": True })
bi_scatterplot(df, "transits_count", col, ax[2], None, "transits_count",
In [8]:
def get_outliers_range(df, col):
q1 = df[col].quantile(0.25)
iqr = q3 - q1
outlier_left = q1 - 1.5 * iqr
outlier_right = q3 + 1.5 * iqr
return outlier_left, outlier_right
Delhivery (6) 19/08/22, 10:36 PM
In [9]:
def detect_outliers_group(df, col):
copied_index = df.index
base_parameters = ["route_type", "source_center", "destination_center", "trans
col_data = df[base_parameters + [col]]
outlier_left, outlier_right = get_outliers_range(col_data.groupby(base_paramet
outliers_range = pd.merge(outlier_left, outlier_right, how = "left", left_on
outliers_range.columns = ["outlier_left", "outlier_right"]
outliers_range = outliers_range.reset_index()
col_data = pd.merge(col_data, outliers_range, how = "left", left_on = base_pa
return df[(df[col] < col_data["outlier_left"]) | (df[col] > col_data["outlier_
In [10]:
def remove_outliers_group(df, col):
return df[(df[col] >= col_data["outlier_left"]) & (df[col] <= col_data["outlie
In [11]:
def handle_outliers_group(df, col):
for i in outliers_range.index:
outliers_range.loc[i, "impute_data"] = random.randint(int(outliers_range
bool_mask = (col_data[col] < col_data["outlier_left"]) | (col_data[col] >
col_data.loc[bool_mask, col] = col_data.loc[bool_mask, "impute_data"]
df[col] = col_data[col].abs()
df[col] = df[col].replace(0,0.1)
In [12]:
def detect_outliers(df, col):
iqr = q3 - q1
return df[ (df[col] <= outlier_left) | (df[col] >= outlier_right) ]
Delhivery (6) 19/08/22, 10:36 PM
In [13]:
def remove_outliers(df, col):
iqr = q3 - q1
return df[ (df[col] > outlier_left) & (df[col] < outlier_right) ]
In [14]:
def set_title(axis, label):
if (not label): return
axis.set_title(label, fontweight = TITLE_FONT_WGT)
In [15]:
def set_legend(axis, options):
if (not options): return
axis.legend(options)
In [16]:
def axv_line(df, axis, label, options):
if (not label): return
value = options.get("value")
color = options.get("color")
line_style = options.get("line_style")
axis.axvline(value, color = color, linestyle = line_style, label = label)
In [17]:
def add_meta_data(df, col, axis, title, options):
set_title(axis, title)
if (not options): return
if (options.get("mean")): axv_line(df, axis, { "label": "Mean", "value":
if (options.get("mode")): axv_line(df, axis, { "label": "Mode", "value":
if (options.get("median")): axv_line(df, axis, { "label": "Median", "value"
if (options.get("legend")): set_legend(axis, { "Mean": df[col].mean(), "Mode"
if (options.get("xlabel")): axis.set_xlabel(options.get("xlabel"))
if (options.get("ylabel")): axis.set_ylabel(options.get("ylabel"))
if (options.get("rotate")):
axis.set_xticklabels(axis.get_xticklabels(), rotation = 90)
if (options.get("nolabel")):
axis.set_xlabel(None)
axis.set_ylabel(None)
In [18]:
def uni_distplot(df, col, axis, title, options = DEFAULT_OPTIONS):
sns.distplot(df[col], ax = axis)
add_meta_data(df, col, axis, title, options)
In [19]:
def uni_boxplot(df, col, axis, title):
sns.boxplot(y = df[col], ax = axis)
add_meta_data(df, col, axis, title, { "ylabel": col })
Delhivery (6) 19/08/22, 10:36 PM
In [20]:
def uni_barplot(df, col, axis, options):
df_count = df[col].value_counts()
df_count.plot.bar(color = COLOR_PALETTE, ax = axis)
add_meta_data(df, col, axis, None, options)
In [21]:
def uni_countplot(df, col, hue, axis):
sns.countplot(data = df, x = col, hue = hue, palette = "Set2", ax = axis)
add_meta_data(df, col, axis, col + " - " + hue + " based distribution", {
In [22]:
def uni_pieplot(df, col, axis, options):
df_count = df[col].value_counts()
df_count.plot.pie(colors = COLOR_PALETTE, autopct = '%.0f%%', ax = axis)
add_meta_data(df, col, axis, None, options)
In [23]:
def uni_histplot(df, col, axis, title, options):
data = df[col]
if (options.get("log")): data = np.log(data)
sns.histplot(data, bins = 50, kde = True, ax = axis)
add_meta_data(df, None, axis, title, None)
In [24]:
def uni_qqplot(df, col, title, axis, options):
data = df[col]
if (options.get("log")): data = np.log(data)
sm.qqplot(data, line = 's', ax = axis)
In [25]:
def bi_pointplot(df, xcol, ycol, hue, axis, title):
sns.pointplot(x = df[xcol], y = df[ycol], hue = df[hue], ax = axis)
add_meta_data(df, None, axis, title, { "xlabel": xcol, "ylabel": ycol })
In [26]:
def uni_scatterplot(df, xcol, title, axis):
g = sns.scatterplot(data = df[xcol], ax = axis)
g.set(xticklabels = [])
g.set(xlabel = None)
In [27]:
def bi_boxplot(df, xcol, ycol, axis, hue, options):
sns.boxplot(data = df, x = xcol, y = ycol, ax = axis, palette = "Paired",
add_meta_data(df, None, axis, None, options)
In [28]:
def bi_scatterplot(df, xcol, ycol, axis, title, hue, options):
sns.scatterplot(data = df, x = xcol, y = ycol, ax = axis, hue = hue)
add_meta_data(df, None, axis, title, options)
Delhivery (6) 19/08/22, 10:36 PM
In [29]:
def bi_pointplot(df, xcol, ycol, hue, axis, title):
sns.pointplot(x = df[xcol], y = df[ycol], hue = df[hue], ax = axis)
add_meta_data(df, None, axis, title, { "xlabel": xcol, "ylabel": ycol })
In [30]:
def bi_lineplot(df, xcol, ycol, hue, axis, title):
if (hue): g = sns.lineplot(data = df, x = xcol, y = ycol, hue = hue, ax =
else: g = sns.lineplot(data = df, x = xcol, y = ycol, ax = axis, markers=
if (hue): title = title + " | " + hue
g.set(xlabel = None)
g.set(ylabel = None)
In [31]:
def heatmap(df, title, options):
sns.color_palette("pastel")
sns.heatmap(df, annot = True, vmin = -1, vmax = 1, cmap = "PiYG")
add_meta_data(df, None, None, title, options)
plt.show()
In [32]:
def pairplot(df):
sns.pairplot(df)
plt.show()
In [33]:
def kdeplot(df, colname, title, axis, test = False):
g = sns.kdeplot(df[colname], ax = axis)
axis.axvline(df[colname].mean(), color = "r", linestyle = "--", label =
if (title):
axis.set_title(title, fontweight = fontweight)
g.set(yticklabels=[])
g.set(ylabel=None)
if (not test):
axis.axvline(df[colname].median(), color = "g", linestyle = "-", label
else:
axis.legend({ "Mean" : df[colname].mean(), "Median" : df[colname].median
In [34]:
def boxplot(df, colname, title, axis):
sns.boxplot(y = df[colname], ax = axis)
if (title):
axis.set_ylabel(colname, fontsize = fontsize, family = fontfamily)
In [35]:
def scatterplot(df, xcolname, ycolname, title, axis):
sns.scatterplot(data = df, x = xcolname, y = ycolname, ax = axis)
axis.set_xlabel(None)
axis.set_ylabel(None)
Delhivery (6) 19/08/22, 10:36 PM
In [36]:
def scatterplotonecol(df, xcolname, title, axis):
g = sns.scatterplot(data = df[xcolname], ax = axis)
if (title):
g.set(xticklabels=[])
g.set(xlabel=None)
In [37]:
def boxplot_bicol(df,colname1, colname2,axis):
sns.boxplot(x = colname1,y = colname2, data = df,ax=axis,palette="Paired"
axis.set_xlabel(colname1, fontweight="bold",fontsize=14,family = "Comic Sans
axis.set_ylabel(colname2, fontweight="bold", fontsize=14,family = "Comic San
In [38]:
def pointplot(df,colname1,colname2,colname3,title,axis):
sns.pointplot(x=df[colname1],y=df[colname2],hue=df[colname3],ax=axis)
axis.set_xlabel(colname1)
axis.set_ylabel(colname2)
axis.set_title(title,fontweight="bold")
In [39]:
def countplot(df, xcolname, hcolname, axis):
title = xcolname + " - " + hcolname + " based distribution"
sns.countplot(data=df,x=xcolname, hue=hcolname, palette="Set2",ax=axis)
axis.set_xlabel(xcolname)
axis.set_ylabel('count')
In [40]:
def histplot(df,title,axis):
sns.histplot(df, bins = 50, kde = True, ax = axis)
In [41]:
def qqplot(df,title,axis):
sm.qqplot(df, line = 's', ax = axis)
axis.set_title(title)
Column Profiling
Delhivery (6) 19/08/22, 10:36 PM
data - tells whether the data is testing or training data

trip_creation_time – Timestamp of trip creation
route_schedule_uuid – Unique Id for a particular route schedule
route_type – Transportation type
FTL – Full Truck Load: FTL shipments get to the destination sooner, as the truck
is making no other pickups or drop-offs along the way
Carting: Handling system consisting of small vehicles (carts)
trip_uuid - Unique ID given to a particular trip (A trip may include different source
and destination centers)
source_center - Source ID of trip origin
source_name - Source Name of trip origin
destination_cente – Destination ID
destination_name – Destination Name
od_start_time – Trip start time
od_end_time – Trip end time
start_scan_to_end_scan – Time taken to deliver from source to destination
is_cutoff – Unknown field
cutoff_factor – Unknown field
cutoff_timestamp – Unknown field
actual_distance_to_destination – Distance in Kms between source and destination
warehouse
actual_time – Actual time taken to complete the delivery (Cumulative)
osrm_time – An open-source routing engine time calculator which computes the
shortest path between points in a given map (Includes usual traffic, distance through
major and minor roads) and gives the time (Cumulative)
osrm_distance – An open-source routing engine which computes the shortest path
between points in a given map (Includes usual traffic, distance through major and
minor roads) (Cumulative)
factor – Unknown field
segment_actual_time – This is a segment time. Time taken by the subset of the
package delivery
segment_osrm_time – This is the OSRM segment time. Time taken by the subset of
the package delivery
segment_osrm_distance – This is the OSRM distance. Distance covered by subset of
the package delivery
segment_factor – Unknown field
Load data
Delhivery (6) 19/08/22, 10:36 PM
In [42]:
delhivery_data = pd.read_csv("https://d2beiqkhq929f0.cloudfront.net/public_asset
delhivery_data.head()
Out[42]: data trip_creation_time route_schedule_uuid route_type trip_uuid source_c
thanos::sroute:eb7bfc78-
2018-09-20 trip-
0 training b351-4c0e-a951- Carting IND38812
02:35:36.476840 153741093647649320
fa3d5c3...
2018-09-20 trip-
02:35:36.476840 153741093647649320
fa3d5c3...
2018-09-20 trip-
02:35:36.476840 153741093647649320
fa3d5c3...
2018-09-20 trip-
02:35:36.476840 153741093647649320
fa3d5c3...
2018-09-20 trip-
02:35:36.476840 153741093647649320
fa3d5c3...
5 rows × 24 columns
Drop Unknown Fields

In [43]:
unknwn_cols = ["is_cutoff","cutoff_factor","cutoff_timestamp","factor","segment_
delhivery_data.drop(unknwn_cols, axis=1, inplace=True)
Observations on shape & data types of all

attributes
In [44]:
delhivery_data.shape
(144867, 19)
Out[44]:
In [45]:
delhivery_data.columns
Index(['data', 'trip_creation_time', 'route_schedule_uuid', 'route_type',

Out[45]:
'trip_uuid', 'source_center', 'source_name', 'destination_center',
'destination_name', 'od_start_time', 'od_end_time',
'start_scan_to_end_scan', 'actual_distance_to_destination',
'actual_time', 'osrm_time', 'osrm_distance', 'segment_actual_time',
'segment_osrm_time', 'segment_osrm_distance'],
dtype='object')
Delhivery (6) 19/08/22, 10:36 PM
In [46]:
delhivery_data.dtypes
data object
Out[46]:
trip_creation_time object
route_schedule_uuid object
route_type object
trip_uuid object
source_center object
source_name object
destination_center object
destination_name object
od_start_time object
od_end_time object
start_scan_to_end_scan float64
actual_distance_to_destination float64
actual_time float64
osrm_time float64
osrm_distance float64
segment_actual_time float64
segment_osrm_time float64
segment_osrm_distance float64
dtype: object
Conversion of Categorical attributes

In [47]:
delhivery_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144867 entries, 0 to 144866
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 data 144867 non-null object
1 trip_creation_time 144867 non-null object
2 route_schedule_uuid 144867 non-null object
3 route_type 144867 non-null object
4 trip_uuid 144867 non-null object
5 source_center 144867 non-null object
6 source_name 144574 non-null object
7 destination_center 144867 non-null object
8 destination_name 144606 non-null object
9 od_start_time 144867 non-null object
10 od_end_time 144867 non-null object
11 start_scan_to_end_scan 144867 non-null float64
12 actual_distance_to_destination 144867 non-null float64
13 actual_time 144867 non-null float64
14 osrm_time 144867 non-null float64
15 osrm_distance 144867 non-null float64
16 segment_actual_time 144867 non-null float64
17 segment_osrm_time 144867 non-null float64
18 segment_osrm_distance 144867 non-null float64
dtypes: float64(8), object(11)
memory usage: 21.0+ MB
Delhivery (6) 19/08/22, 10:36 PM
In [48]:
cols = delhivery_data.columns
num_cols = ["start_scan_to_end_scan", "actual_distance_to_destination", "actual_
cat_cols = ["data","route_type"]
dt_cols = ["trip_creation_time","od_start_time","od_end_time"]
#delhivery_data[cat_cols] = delhivery_data[cat_cols].astype("category")
delhivery_data[dt_cols] = delhivery_data[dt_cols].astype('datetime64')
Analyzing basic statistics about each

feature, such as count, min, max, and mean
In [49]:
delhivery_data.describe().T
Out[49]: count mean std min 25%
start_scan_to_end_scan 144867.0 961.262986 1037.012769 20.000000 161.000000
actual_distance_to_destination 144867.0 234.073372 344.990009 9.000045 23.355874
actual_time 144867.0 416.927527 598.103621 9.000000 51.000000
osrm_time 144867.0 213.868272 308.011085 6.000000 27.000000
osrm_distance 144867.0 284.771297 421.119294 9.008200 29.914700
segment_actual_time 144867.0 36.196111 53.571158 -244.000000 20.000000
segment_osrm_time 144867.0 18.507548 14.775960 0.000000 11.000000
segment_osrm_distance 144867.0 22.829020 17.860660 0.000000 12.070100
In [50]:
delhivery_data.describe(include='object').T
Out[50]: count unique top freq
data 144867 2 training 104858
thanos::sroute:4029a8a2-6c74-4b7e-a6d8-
route_schedule_uuid 144867 1504 1812
f9e069f...
route_type 144867 2 FTL 99660
trip_uuid 144867 14817 trip-153811219535896559 101
source_center 144867 1508 IND000000ACB 23347
source_name 144574 1498 Gurgaon_Bilaspur_HB (Haryana) 23347
destination_center 144867 1481 IND000000ACB 15192
destination_name 144606 1468 Gurgaon_Bilaspur_HB (Haryana) 15192
Delhivery (6) 19/08/22, 10:36 PM
In [51]:
delhivery_data.describe(include='all').T
Out[51]: count unique top freq
data 144867 2 training 104858
2018-09-28 2018-09-12
trip_creation_time 144867 14817 101
05:23:15.359220 00:00:16.535741
thanos::sroute:4029a8a2-
route_schedule_uuid 144867 1504 6c74-4b7e-a6d8- 1812
f9e069f...
route_type 144867 2 FTL 99660
trip-
trip_uuid 144867 14817 101
153811219535896559
source_center 144867 1508 IND000000ACB 23347
Gurgaon_Bilaspur_HB
source_name 144574 1498 23347
(Haryana)
destination_center 144867 1481 IND000000ACB 15192
Gurgaon_Bilaspur_HB
destination_name 144606 1468 15192
(Haryana)
2018-09-21 2018-09-12
od_start_time 144867 26369 81
18:37:09.322207 00:00:16.535741
2018-09-24 2018-09-12
od_end_time 144867 26369 81
09:59:15.691618 00:50:10.814399
start_scan_to_end_scan 144867.0 NaN NaN NaN
actual_distance_to_destination 144867.0 NaN NaN NaN
actual_time 144867.0 NaN NaN NaN
osrm_time 144867.0 NaN NaN NaN
osrm_distance 144867.0 NaN NaN NaN
segment_actual_time 144867.0 NaN NaN NaN
segment_osrm_time 144867.0 NaN NaN NaN
segment_osrm_distance 144867.0 NaN NaN NaN
Non-Graphical Analysis: Value counts and

unique attributes
Unique values (counts) for each Feature
Delhivery (6) 19/08/22, 10:36 PM
In [52]:
for col in delhivery_data.columns:
l = len(col)
if l < 7: print(col, "\t\t\t\t:", delhivery_data[col].nunique())
elif l == 16: print(col, "\t\t:", delhivery_data[col].nunique())
elif l < 16: print(col, "\t\t\t:", delhivery_data[col].nunique())
elif l<30: print(col, "\t\t:", delhivery_data[col].nunique())
else: print(col, "\t:", delhivery_data[col].nunique())
data : 2
trip_creation_time : 14817
route_schedule_uuid : 1504
route_type : 2
trip_uuid : 14817
source_center : 1508
source_name : 1498
destination_center : 1481
destination_name : 1468
od_start_time : 26369
od_end_time : 26369
start_scan_to_end_scan : 1915
actual_distance_to_destination : 144515
actual_time : 3182
osrm_time : 1531
osrm_distance : 138046
segment_actual_time : 747
segment_osrm_time : 214
segment_osrm_distance : 113799
Unique values (names) are checked for each

Features
In [53]:
for colname in cat_cols:
print("\nUnique values of ",colname," are : ",list(delhivery_data[colname
Unique values of data are : ['training', 'test']
Unique values of route_type are : ['Carting', 'FTL']
Inference
No abnormalities found.
Unique values (counts) are checked for each

Features unique values
In [54]:
delhivery_data["data"].value_counts().sort_values(ascending=False)
training 104858
Out[54]:
test 40009
Name: data, dtype: int64
Delhivery (6) 19/08/22, 10:36 PM
In [55]:
delhivery_data["route_type"].value_counts().sort_values(ascending=False)
FTL 99660
Out[55]:
Carting 45207
Name: route_type, dtype: int64
Inference
Around 50% of the trips has distance more than 50 kms. Maximum distance travelled
is 2186 kms
50% of delivery has the actual time of 149 mins and maximum time taken for longest
delivery is 6265 mins
50% of delivery time calculated in OSRM engine is 60 mins and maximum time taken
for longest delivery is 2032 mins
50% of delivery in OSRM engine is 65 kms and maximum time taken for longest
delivery is 2840 kms
50% of delivery has the segment actual time 147 mins and maximum time taken for
longest delivery is 6230 mins
50% of delivery has the segment OSRM time of 65 mins and maximum time taken for
longest delivery is 2564 mins
50% of delivery has the segment OSRM distance of 70 kms and maximum distance
taken for longest delivery is 3523 kms
50% of trip time difference between the start and end 280 mins and max trip time is
7898 mins
Most of the route type is Carting and it is around 8908
Most of the orders are coming from state Maharashtra
Most of the orders are delivered to state Maharashtra
Most of the orders are delivered from city Mumbai in Maharashtra state
Most of the orders are delivered to city Mumbai in Maharashtra state, Hence busiest
corridor under busiest state is Mumbai to Mumbai (round trip)
Average distance between the busiest corridor Mumbai to Mumbai is 14.62 km
Average time between the busiest corridor Mumbai to Mumbai is 55.29 minutes
Missing value detection

In [56]:
missing_values(delhivery_data)
Delhivery (6) 19/08/22, 10:36 PM
Total records = 144867

Out[56]: # of Missing % of Missing
source_name 293 0.20
destination_name 261 0.18
data 0 0.00
start_scan_to_end_scan 0 0.00
segment_osrm_time 0 0.00
segment_actual_time 0 0.00
osrm_distance 0 0.00
osrm_time 0 0.00
actual_time 0 0.00
actual_distance_to_destination 0 0.00
od_start_time 0 0.00
od_end_time 0 0.00
trip_creation_time 0 0.00
destination_center 0 0.00
source_center 0 0.00
trip_uuid 0 0.00
route_type 0 0.00
route_schedule_uuid 0 0.00
segment_osrm_distance 0 0.00
In [57]:
mask = False
mask = mask | delhivery_data["source_name"].isnull()
source_name_null_data = delhivery_data[mask]
source_name_null_data
Delhivery (6) 19/08/22, 10:36 PM
Out[57]: data trip_creation_time route_schedule_uuid route_type trip_uuid
thanos::sroute:4460a38d-
2018-09-25 trip-
112 training ab9b-484e-bd4e- FTL
08:53:04.377810 153786558437756691
f4201d0...
2018-09-25 trip-
08:53:04.377810 153786558437756691
f4201d0...
2018-09-25 trip-
08:53:04.377810 153786558437756691
f4201d0...
2018-09-25 trip-
08:53:04.377810 153786558437756691
f4201d0...
2018-09-25 trip-
08:53:04.377810 153786558437756691
f4201d0...
... ... ... ... ...
thanos::sroute:cbef3b6a-
2018-10-03 trip-
144484 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...
2018-10-03 trip-
144485 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...
2018-10-03 trip-
144486 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...
2018-10-03 trip-
144487 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...
2018-10-03 trip-
144488 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...
In [58]:
mask = False
mask = mask | delhivery_data["destination_name"].isnull()
dest_name_null_data = delhivery_data[mask]
dest_name_null_data
Delhivery (6) 19/08/22, 10:36 PM
Out[58]: data trip_creation_time route_schedule_uuid route_type trip_uuid
2018-09-25 trip-
08:53:04.377810 153786558437756691
f4201d0...
2018-09-25 trip-
08:53:04.377810 153786558437756691
f4201d0...
thanos::sroute:d0ebdacd-
2018-10-01 trip-
982 test e09b-47d3-be77- FTL
20:56:18.155260 153842737815495661
c9c4a05...
thanos::sroute:d0ebdacd-
2018-10-01 trip-
983 test e09b-47d3-be77- FTL
20:56:18.155260 153842737815495661
c9c4a05...
thanos::sroute:2f43f11e-
2018-09-24 trip-
4882 training d3ba-4590-9355- FTL
07:18:06.087341 153777348608709328
82928e1...
... ... ... ... ...
2018-10-03 trip-
144478 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...
2018-10-03 trip-
144479 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...
2018-10-03 trip-
144480 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...
2018-10-03 trip-
144481 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...
2018-10-03 trip-
144482 test 79ea-4d5e-a215- FTL
09:06:06.690094 153855756668984584
b558a70...
In [59]:
delhivery_data['source_name'].fillna(delhivery_data['source_center'],inplace
delhivery_data['destination_name'].fillna(delhivery_data['destination_center'
Inference
Inference
Source and destination name attributes are having small number of missing values.
Labels are also inconsitent in source and destination name attributes
Delhivery (6) 19/08/22, 10:36 PM
Merging of rows and aggregation of fields

In [60]:
df1= delhivery_data
#df1 = df1[df1['trip_uuid']=='trip-153741093647649320']
In [61]:
grp = df1.groupby(['data','trip_uuid', 'trip_creation_time','route_type','source
df2 = grp.agg({'od_start_time' : 'min', 'od_end_time' : 'max','actual_distance_t
'segment_actual_time':'sum','segment_osrm_time':'sum','segment_osrm_dis
df2.sort_values(by=['trip_uuid', 'trip_creation_time','od_start_time'],inplace
df2 = df2.reset_index()
df2
Out[61]: data trip_uuid trip_creation_time route_type source_center
trip- 2018-09-12
0 training FTL IND462022AAA
153671041653548748 00:00:16.535741
trip- 2018-09-12 Kanpur_

153671041653548748 00:00:16.535741
trip- 2018-09-12
2 training Carting IND572101AAA
153671042288605164 00:00:22.886430
trip- 2018-09-12 Doddablpur_

3 training Carting IND561203AAB
153671042288605164 00:00:22.886430
trip- 2018-09-12 Bangalore_

153671043369099517 00:00:33.691250
... ... ... ... ... ...
trip- 2018-10-03 Tirchchndr_S

26364 test Carting IND628204AAA
153861115439069069 23:59:14.390954
trip- 2018-10-03 Thisayanvilai_

153861115439069069 23:59:14.390954
trip- 2018-10-03 Peikulam_S

153861115439069069 23:59:14.390954
trip- 2018-10-03
26367 test FTL IND583201AAA Hospet
153861118270144424 23:59:42.701692
trip- 2018-10-03 Sandur_W

26368 test FTL IND583119AAA
153861118270144424 23:59:42.701692
In [62]:
grp = df2.groupby(['data','trip_uuid', 'trip_creation_time','route_type'])
delhivery_data_v2 = grp.agg({'source_center':'first','source_name':'first',
'segment_actual_time':'sum','segment_osrm_time':'sum','segment_osrm_dis
delhivery_data_v2.columns = ["data","trip_uuid", "trip_creation_time", "route_ty
delhivery_data_v2
Delhivery (6) 19/08/22, 10:36 PM
trip- 2018-09-27
153800653897073708 00:02:18.970980
trip- 2018-09-27
1 test Carting IND400072AAB
153800654935210748 00:02:29.352390
trip- 2018-09-27
2 test FTL IND302014AAA Jaipur_Hub
153800658820968126 00:03:08.209931
trip- 2018-09-27
3 test Carting IND421302AAF
153800659468028518 00:03:14.680535
trip- 2018-09-27
153800661729668086 00:03:37.296972
... ... ... ... ... ...
trip- 2018-09-26 Vadodara_Kar

153800579708680929 23:49:57.087036
trip- 2018-09-26 Sivaganga

153800585467019097 23:50:54.670423
trip- 2018-09-26
153800603160412602 23:53:51.604388
trip- 2018-09-26 Mainpu

14815 training FTL IND205001AAB
153800605670819251 23:54:16.708455
trip- 2018-09-26 GZB_Mohan

153800606794535545 23:54:27.945614
Inference
So we have reduced the number of rows from 144316 to just 14817. We now have 21
columns.
We combined delivery details of a package with multiple rows into single row.
Validate Duplicate Records

In [63]:
delhivery_data_v2[delhivery_data_v2.duplicated()]
Out[63]: data trip_uuid trip_creation_time route_type source_center source_name destination_center
Inference
Delhivery (6) 19/08/22, 10:36 PM
No duplicates found.
Feature Engineering
Split and extract features out of Source Name
In [64]:
delhivery_data_v2['source_city'] = delhivery_data_v2['source_name'].str.split
delhivery_data_v2['source_place'] = delhivery_data_v2['source_name'].str.split
delhivery_data_v2['source_code'] = delhivery_data_v2['source_name'].str.split
delhivery_data_v2['source_state']= delhivery_data_v2['source_name'].str.split
Split and extract features out of Destination Name

In [65]:
delhivery_data_v2['destination_city'] = delhivery_data_v2['destination_name'
delhivery_data_v2['destination_place'] = delhivery_data_v2['destination_name'
delhivery_data_v2['destination_code'] = delhivery_data_v2['destination_name'
delhivery_data_v2['destination_state']= delhivery_data_v2['destination_name'
Extract features like month, year and day from Trip Creation
Time
In [66]:
delhivery_data_v2['trip_creation_year'] = delhivery_data_v2['trip_creation_time'
delhivery_data_v2['trip_creation_month'] = delhivery_data_v2['trip_creation_time
delhivery_data_v2['trip_creation_day']= delhivery_data_v2['trip_creation_time'
delhivery_data_v2["trip_creation_hour"] = delhivery_data_v2["trip_creation_time"
delhivery_data_v2["trip_creation_day"] = delhivery_data_v2["trip_creation_time"
delhivery_data_v2["trip_creation_week"] = delhivery_data_v2["trip_creation_time"
delhivery_data_v2["trip_creation_dayofweek"] = delhivery_data_v2["trip_creation_
Calculate the time taken between od_start_time and

od_end_time
In [67]:
delhivery_data_v2['od_time_taken'] = (delhivery_data_v2['od_end_time'] - delhive
#(df.from_date - df.to_date) / pd.Timedelta(minutes=1)
Handling categorical values

Do one-hot encoding of categorical variable -
route_type
Delhivery (6) 19/08/22, 10:36 PM
In [68]:
dummies = pd.get_dummies(delhivery_data_v2.route_type,drop_first = True)
In [69]:
delhivery_data_v2 = pd.concat([delhivery_data_v2,dummies],axis=1)
Do one-hot encoding of categorical variable - data

In [70]:
dummies = pd.get_dummies(delhivery_data_v2.data,drop_first = True)
In [71]:
delhivery_data_v2 = pd.concat([delhivery_data_v2,dummies],axis=1)
Visual Univariate Analysis - Numerical

Variables
In [72]:
num_cols
['start_scan_to_end_scan',
Out[72]:
'actual_distance_to_destination',
'actual_time',
'osrm_time',
'osrm_distance',
'segment_actual_time',
'segment_osrm_time',
'segment_osrm_distance']
Outliers Detection and Handling -

start_scan_to_end_scan
In [73]:
col = num_cols[0]
univariate_analysis(delhivery_data_v2, col)
Delhivery (6) 19/08/22, 10:36 PM
In [74]:
delhivery_data_v2[delhivery_data_v2['start_scan_to_end_scan']<0]
In [75]:
detect_outliers_group(delhivery_data_v2, col)
trip- 2018-09-27 Surat_C

21 test Carting IND395023AAD
153800750141776702 00:18:21.418023
trip- 2018-09-27
153801189681302558 01:31:36.813300
trip- 2018-09-27 Bhiwandi_M

72 test FTL IND421302AAG
153801202166779166 01:33:41.668026
trip- 2018-09-27 Lucknow_H

153801386437307256 02:04:24.373330
trip- 2018-09-27
101 test Carting IND110037AAM Delhi_Airport
153801517906891689 02:26:19.069179
... ... ... ... ... ...
trip- 2018-09-26
14638 training Carting IND403726AAB Goa_ZuariNg
153799260763324327 20:10:07.633497
trip- 2018-09-26 Bangalore_Ne

153799315216859426 20:19:12.168865
trip- 2018-09-26
14687 training Carting IND110044AAB Del_Okhla_P
153799639329945683 21:13:13.299701
trip- 2018-09-26 Bengaluru_Ka

14789 training Carting IND560067AAC
153800448393383196 23:28:03.934090
trip- 2018-09-26 Sivaganga_W

153800585467019097 23:50:54.670423
In [76]:
Delhivery (6) 19/08/22, 10:36 PM
In [77]:
handle_outliers_group(delhivery_data_v2, col)
In [78]:
In [79]:
Inference
Average time taken to deliver from source to destination are relatively higher where
the number of transits are high.
Most of the one way parcels are likely to have least number of transits.
More data are right skewed in start scan to end scan attribute and also where the
number of transits are less than three.

actual_distance_to_destination
In [80]:
col = num_cols[1]
Delhivery (6) 19/08/22, 10:36 PM
In [81]:
trip- 2018-09-27
2 test FTL IND302014AAA Jaipur_Hub
153800658820968126 00:03:08.209931
trip- 2018-09-27
153800661729668086 00:03:37.296972
trip- 2018-09-27 Mumbai_Kaly

153800662027930085 00:03:40.279575
trip- 2018-09-27
27 test Carting IND000000AFT
153800774414397735 00:22:24.144231
trip- 2018-09-27 Chandigarh_M

39 test Carting IND160002AAC
153800890376792315 00:41:43.768170
... ... ... ... ... ...
trip- 2018-09-26 Kanpur_

153800093896706050 22:28:58.967321
trip- 2018-09-26
14751 training Carting IND110037AAK Delhi_Kapshe
153800194935773399 22:45:49.357982
trip- 2018-09-26 Muzaffrp

153800360857000808 23:13:28.570233
trip- 2018-09-26
153800419141788066 23:23:11.418113
trip- 2018-09-26 Tirupur_Kol

153800495416332487 23:35:54.163552
In [82]:
In [83]:
Delhivery (6) 19/08/22, 10:36 PM
Inference
More data are right skewed in actual distance to destination attribute.
Outliers Detection and Handling - actual_time

In [84]:
col = num_cols[2]
In [85]:
Delhivery (6) 19/08/22, 10:36 PM
trip- 2018-09-27 Murshidabad

153800744556994688 00:17:25.570190
trip- 2018-09-27
31 test Carting IND712310AAE
153800813417374126 00:28:54.174126
trip- 2018-09-27 PNQ Vad

153800827701849043 00:31:17.018741 DPC (Ma
trip- 2018-09-27
153800830425161914 00:31:44.251864
trip- 2018-09-27 Gulbarga_

153800950402244509 00:51:44.022677
... ... ... ... ... ...
trip- 2018-09-26 Bangalore_N

153799615285931023 21:09:12.859557
trip- 2018-09-26 Kollam_C

153799857684519878 21:49:36.845451
trip- 2018-09-26
14751 training Carting IND110037AAK
153800194935773399 22:45:49.357982
trip- 2018-09-26 Muzaffrpu

153800360857000808 23:13:28.570233
trip- 2018-09-26 Noida_C

153800571655403292 23:48:36.554285
In [86]:
In [87]:
Inference
990 outliers are detected in actual time attribute.
Delhivery (6) 19/08/22, 10:36 PM
Outliers Detection and Handling - osrm_time

In [88]:
col = num_cols[3]
In [89]:
trip- 2018-09-27 Mumbai_Kaly

153800662027930085 00:03:40.279575
trip- 2018-09-27 CCU_Beli

153800970294643704 00:55:02.946700
trip- 2018-09-27 Jamshedpu

51 test FTL IND832109AAB
153800984084960070 00:57:20.849835
trip- 2018-09-27 Anand_V

153801468900715290 02:18:09.007402
trip- 2018-09-27 MAA_Poon

153801610726869838 02:41:47.269073
... ... ... ... ... ...
trip- 2018-09-26 Chandigarh_M

14697 training FTL IND160002AAC
153799728722932020 21:28:07.229557
trip- 2018-09-26
14754 training FTL IND361001AAA Jamnagar_
153800218813854926 22:49:48.138801
trip- 2018-09-26
153800419141788066 23:23:11.418113
trip- 2018-09-26 Bengaluru_

153800448393383196 23:28:03.934090
trip- 2018-09-26 Kolkata_

153800550884418718 23:45:08.844409
Delhivery (6) 19/08/22, 10:36 PM
In [90]:
In [91]:
Inference
More data are right skewed in OSRM attribute.

More number of round trips takes more than 2 transits to deliver.
1070 outliers are detected in OSRM time attribute.
Outliers Detection and Handling - osrm_distance

In [92]:
col = num_cols[4]
In [93]:
Delhivery (6) 19/08/22, 10:36 PM
trip- 2018-09-27 Mumbai_Ka

153800662027930085 00:03:40.279575 _Dc (Ma
trip- 2018-09-27 Brahmapuri_D

153800693388046603 00:08:53.880714
trip- 2018-09-27
153800750141776702 00:18:21.418023
trip- 2018-09-27
57 test FTL IND282002AAD
153801047876051786 01:07:58.760784
trip- 2018-09-27 Anand_VU

153801468900715290 02:18:09.007402
... ... ... ... ... ...
trip- 2018-09-26
14754 training FTL IND361001AAA Jamnagar_Dc
153800218813854926 22:49:48.138801
trip- 2018-09-26 Chotila_C

153800223678375371 22:50:36.783999
trip- 2018-09-26
14779 training Carting IND395023AAD
153800389194808105 23:18:11.948320
trip- 2018-09-26 Mumbai

153800419141788066 23:23:11.418113
trip- 2018-09-26 Rampur_R

153800436074116276 23:26:00.741419
In [94]:
In [95]:
Inference
Delhivery (6) 19/08/22, 10:36 PM
More data are right skewed in OSRM distance attribute.

1003 outliers are detected in OSRM distance attribute.

segment_actual_time
In [96]:
col = num_cols[5]
In [97]:
Delhivery (6) 19/08/22, 10:36 PM
trip- 2018-09-27
153800830425161914 00:31:44.251864
trip- 2018-09-27
67 test Carting IND396191AAC Vapi_IndEsta
153801182428197910 01:30:24.282255
trip- 2018-09-27 Bhiwandi_

72 test FTL IND421302AAG
153801202166779166 01:33:41.668026
trip- 2018-09-27 Bengaluru_Bo

153802092485997218 04:02:04.860231
trip- 2018-09-27 Ahmedab

153802358988013009 04:46:29.880386
... ... ... ... ... ...
trip- 2018-09-26 Kollam_

153799857684519878 21:49:36.845451
trip- 2018-09-26
14751 training Carting IND110037AAK Delhi_Kapshe
153800194935773399 22:45:49.357982
trip- 2018-09-26 Muzaffrp

153800360857000808 23:13:28.570233
trip- 2018-09-26 Bengaluru_

153800448393383196 23:28:03.934090
trip- 2018-09-26
153800571655403292 23:48:36.554285
In [98]:
In [99]:
Inference
Delhivery (6) 19/08/22, 10:36 PM
More data are right skewed in segment actual time attribute.

870 outliers are detected in segment actual time attribute.

segment_osrm_time
In [100…
col = num_cols[6]
In [101…
Delhivery (6) 19/08/22, 10:36 PM
Out[101… data trip_uuid trip_creation_time route_type source_center
trip- 2018-09-27 Kanpur_Cent

153800688276350851 00:08:02.763752 (Uttar Pr
trip- 2018-09-27 Vapi_Ind

153801182428197910 01:30:24.282255
trip- 2018-09-27 Dehradun_Se

153801859365729551 03:23:13.657578
trip- 2018-09-27 Pune_Tatha

153802053032411379 03:55:30.324382 (Mahar
trip- 2018-09-27 Pune_Tatha

153802191796534285 04:18:37.965599 (Mahar
... ... ... ... ... ...
trip- 2018-09-26 Mumbai An

153800419141788066 23:23:11.418113 (Mahar
trip- 2018-09-26 Muzaffrpur_B

153800486451898339 23:34:24.519194
trip- 2018-09-26 Noida_Sec 0

14804 training FTL IND201301AAF
153800536221106758 23:42:42.211304 (Uttar Pr
trip- 2018-09-26 Kolkata_Dank

153800550884418718 23:45:08.844409 (West B
trip- 2018-09-26 Mainpuri_Agr

153800605670819251 23:54:16.708455 (Uttar Pr
In [102…
In [103…
Inference
Delhivery (6) 19/08/22, 10:36 PM
More data are right skewed in segment OSRm time attribute.

segment_osrm_distance
In [104…
col = num_cols[6]
In [105…
Delhivery (6) 19/08/22, 10:36 PM
Out[105… data trip_uuid trip_creation_time route_type source_center
trip- 2018-09-27
153802898973535963 06:16:29.735723
trip- 2018-09-27 Durgapur_C

298 test FTL IND713205AAB
153804913581490167 11:52:15.815278
trip- 2018-09-27 Chandigar

335 test FTL IND160002AAC
153805947380180455 14:44:33.802060
trip- 2018-09-27 Visakhapatna

153807672514287673 19:32:05.143140
trip- 2018-09-27
473 test Carting IND400072AAB Mumbai Hub
153807813360229446 19:55:33.602541
... ... ... ... ... ...
trip- 2018-09-26
153793840860911258 05:06:48.609360
trip- 2018-09-26
153794472414216798 06:52:04.142401
trip- 2018-09-26 Manbazar

153795866685428331 10:44:26.854538
trip- 2018-09-26
153797075209653066 14:05:52.096792
trip- 2018-09-26 Chandigar

153799728722932020 21:28:07.229557
In [106…
In [107…
Inference
Delhivery (6) 19/08/22, 10:36 PM
More data are right skewed in segment OSRm distance attribute.

1017 outliers are detected in segment OSRM distance attribute.
Visual Univariate Analysis - Categorical

Variables
In [108…
for i in range(len(cat_cols)):
fig, ax = plt.subplots(1, 2, figsize = (12, 4))
plt.suptitle(cat_cols[i], fontsize = fontsize, fontweight = fontweight)
uni_barplot(delhivery_data_v2, cat_cols[i], ax[0], { "nolabel": True })
uni_pieplot(delhivery_data_v2, cat_cols[i], ax[1], { "nolabel": True })
Inference
60% of deliveries used carts.
Delhivery (6) 19/08/22, 10:36 PM
In [109…
sourcestate10 = delhivery_data_v2["source_state"].value_counts()[0:10]
destinationstate10 = delhivery_data_v2["destination_state"].value_counts()[
fig, ax = plt.subplots(1,2,figsize=(16,5))
sns.barplot(x = np.linspace(0,1,10), y = sourcestate10.values, data = sourcestat

ax[0].set_xticklabels(sourcestate10.index,rotation=45)
ax[0].set_title("Source State")
sns.barplot(x = np.linspace(0,1,10), y = destinationstate10.values, data =destin

ax[1].set_xticklabels(destinationstate10.index,rotation=45)
ax[1].set_title("Destination State")
plt.suptitle("The Top 10 Source and Destination States")

plt.show()
Inference
Haryana, Maharastra and Karnataka are the popular source and destination states.
Trip creation month

In [110…
col = 'trip_creation_month'
plt.suptitle('Distribution of Trip Month', fontsize = FONTSIZE, fontweight
uni_barplot(delhivery_data_v2, col, ax[0], { "nolabel": True })
uni_pieplot(delhivery_data_v2, col, ax[1], { "nolabel": True })
Delhivery (6) 19/08/22, 10:36 PM
Inference
The trips are recorded only for the months of September and October. The recording
perhaps stopped after that. So we do not analyse further on the basis of month.
Trip creation week

In [111…
delhivery_data_v2["trip_creation_dayofweek"] = delhivery_data_v2["trip_creation_
sns.countplot(x = "trip_creation_dayofweek",data=delhivery_data_v2,order=['Mon'
plt.title("Distribution of trips on each day of week")
plt.show()
Inference
So we see that maximum number of trips are happening on Wednesday and minimum on
Sunday.
Trip creation Hour
Delhivery (6) 19/08/22, 10:36 PM
In [112…
sns.distplot(delhivery_data_v2["trip_creation_hour"])
plt.title("Distribution of Trip Hour")
plt.show()
Inference
So, we observe a kind of bimodal distribution with minimum trips occuring during the day
hours (8 AM to 1 PM) and maximum occuring during late night or early morning hours (8
PM to 2 AM).
Visual Bivariate Analysis - Numerical

Variables
In [113…
for col in num_cols:
plt.suptitle(col, fontsize = fontsize, fontweight = fontweight)
boxplot_bicol(delhivery_data_v2,"data",col,ax[0])
boxplot_bicol(delhivery_data_v2,"route_type",col,ax[1])
WARNING:matplotlib.font_manager:findfont: Font family ['Comic Sans MS'] not

found. Falling back to DejaVu Sans.
Delhivery (6) 19/08/22, 10:36 PM
Delhivery (6) 19/08/22, 10:36 PM
In [114…
pointplot(delhivery_data_v2,"data",col,"route_type",'',ax[0])
pointplot(delhivery_data_v2,"route_type",col,"data",'',ax[1])
#pointplot(yulu_data_v1,"weather","count","season",'Count of booking across ea
Delhivery (6) 19/08/22, 10:36 PM
Delhivery (6) 19/08/22, 10:36 PM
Inference
So we see that the time taken by full truck load deliveries is on average, a lot higher
(>300 hours) than the cart deliveries (<100 hours).
The full truck load deliveries cover much longer distances onaverage (>150 kms)
than carting deliveries (~ 25 kms)
Time and distances follow similar trends against the hour of the day. Maximum time
and distance deliveries are likely to be made during peak morning hours of 10 AM to
12 PM as well as 5 PM, 7 PM and 1 AM.
In [115…
plt.figure(figsize = (8, 5))
sns.heatmap(delhivery_data_v2[num_cols].corr(), annot=True, vmin=-1, vmax =
plt.show()
Delhivery (6) 19/08/22, 10:36 PM
Inference
So we see that certain fields are highly correlated :
cut-off factor : osrm_time, actual_time, osrm_distance,

actual_distance_to_destination, start_scan_to_end_scan.
start_scan_to_end_scan : osrm_time, actual_time, osrm_distance,

actual_distance_to_destination.
osrm_time, actual_time, osrm_distance, actual_distance_to_destination are all highly

correlated to each other, which is expected because distance will effect time, and
osrm calculation will be somewhat close to actual (even if not perfect).
segment_osrm_time and segment_osrm_distance are also highly correlated as

expected.
we see poor correlation between segment_actual_time and segment_osrm_time

(even though overall actual_time and osrm_time are highly correlated).
In [116…
sns.pairplot(delhivery_data_v2[num_cols])
plt.show()
Delhivery (6) 19/08/22, 10:36 PM
Inference
All the numerical attributes are linearly related with each other.
Visual Bivariate Analysis - Categorical

Variables
In [117…
fig , ax = plt.subplots(1,2,figsize=(15,5))
countplot(delhivery_data_v2,"data","route_type",ax[0])
countplot(delhivery_data_v2,"route_type","data", ax[1])
plt.show()
Delhivery (6) 19/08/22, 10:36 PM
Route Type Distributions for Top 3 Source

States
In [118…
top3s = delhivery_data_v2[(delhivery_data_v2["source_state"]=='Maharashtra'
top3s = top3s[['route_type','source_state']]
st = ['Maharashtra','Karnataka','Haryana']
g = sns.countplot(x='source_state',hue='route_type', data=top3s, order = st
percx = []
for e in st:
percx.append(top3s[(top3s['source_state']==e)&(top3s["route_type"]=="Carting"
for e in st:
percx.append(top3s[(top3s['source_state']==e)&(top3s["route_type"]=="FTL"
i=0
for p in g.patches:
txt = str((round(percx[i]*100))) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
g.text(txt_x+0.1,txt_y,txt)
i+=1
plt.show()
Delhivery (6) 19/08/22, 10:36 PM
Inference
So we see that for top 3 source states,
Maharashtra hs 85% Carting and 15% FTL,
Karnataka has 88% Carting and 12% FTL,
Haryana has 75% Carting and 25% FTL.
Route Type Distributions for Top 3

Destination States
In [119…
top3s = delhivery_data_v2[(delhivery_data_v2["destination_state"]=='Maharashtra'
top3s = top3s[['route_type','destination_state']]
st = ['Maharashtra','Karnataka','Haryana']
g = sns.countplot(x='destination_state',hue='route_type', data=top3s, order
percx = []
for e in st:
percx.append(top3s[(top3s['destination_state']==e)&(top3s["route_type"]==
for e in st:
percx.append(top3s[(top3s['destination_state']==e)&(top3s["route_type"]==
i=0
for p in g.patches:
txt = str((round(percx[i]*100))) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
g.text(txt_x+0.1,txt_y,txt)
i+=1
plt.show()
Delhivery (6) 19/08/22, 10:36 PM
Inference
So we see that for top 3 destination states,
Maharashtra has 86% Carting and 14% FTL,
Karnataka has 86% Carting and 14% FTL,
Haryana has 81% Carting and 19% FTL.
Appropriate test to check whether

"Compare the difference between the time
taken between od_start_time/od_end_time
and start_scan_to_end_scan"
Statistical Hypothesis Test - Pearson’s Correlation
Coefficient
Step 1 - Define Null and Alternate Hypothesis
Null Hypothesis (H0) : The two samples are independent.

Alternate Hyphothesis (Ha) : There is a dependency between the samples.
Significance Level (alpha) : 0.05
Step 2 - Validate the assumptions
Delhivery (6) 19/08/22, 10:36 PM
Observations in each sample are independent and identically distributed (iid).

Observations in each sample are normally distributed.
Observations in each sample have the same variance.
Normality check of the data

Histogram and QQ-Plots
In [120…
histplot(delhivery_data_v2['od_time_taken'],"Time taken between od_start_time/od

qqplot(delhivery_data_v2['od_time_taken'], "qqplot for Time taken between od_sta
histplot(delhivery_data_v2['start_scan_to_end_scan'],"Time taken to deliver from

qqplot(delhivery_data_v2['start_scan_to_end_scan'], "qqplot for Time taken to de
Applying log on the data - Log Normal Distribution
In [121…
histplot(np.log(delhivery_data_v2['od_time_taken']),"Time taken between od_start

qqplot(np.log(delhivery_data_v2['od_time_taken']), "qqplot for Time taken betwee
histplot(np.log(delhivery_data_v2['start_scan_to_end_scan']),"Time taken to deli

qqplot(np.log(delhivery_data_v2['start_scan_to_end_scan']), "qqplot for Time tak
Delhivery (6) 19/08/22, 10:36 PM
Applying BoxCox Distribution
In [122…
fitted_od_time_taken,lmbda = stats.boxcox(delhivery_data_v2['od_time_taken'
fitted_start_scan_to_end_scan,lmbda = stats.boxcox(abs(delhivery_data_v2['start_
histplot(fitted_od_time_taken,"Time taken between od_start_time/od_end_time"

qqplot(fitted_od_time_taken, "qqplot for Time taken between od_start_time/od_end
histplot(fitted_start_scan_to_end_scan,"Time taken to deliver from source to des

qqplot(fitted_start_scan_to_end_scan, "qqplot for Time taken to deliver from sou
Delhivery (6) 19/08/22, 10:36 PM
Variance check of the data

Anderson-Darling Test
In [123…
anderson(fitted_od_time_taken)
AndersonResult(statistic=28.82281265821257, critical_values=array([0.576, 0
Out[123…
.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. , 2
.5, 1. ]))
In [124…
anderson(fitted_start_scan_to_end_scan)
AndersonResult(statistic=23.295076091240844, critical_values=array([0.576,
Out[124…
0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. ,
2.5, 1. ]))
Step 3 - Pearson’s Correlation Coefficient

In [125…
stat, p_value = pearsonr(fitted_od_time_taken, fitted_start_scan_to_end_scan
Step 4 - Check p-value with siginificance level

In [126…
if p_value <= 0.05:
print("Reject NULL Hypothesis")
else:
print("Failed to Reject NULL Hypothesis")
Reject NULL Hypothesis
Delhivery (6) 19/08/22, 10:36 PM
Inference
Since P-Value of this test lies below 0.05, Then we can safely reject the null
hypothesis and conclude od_start_time / od_end_time and start_scan_to_end_scan
attribute are dependent on each other
Visual Analysis
In [127…
sns.distplot(delhivery_data_v2["start_scan_to_end_scan"], label="start_scan_to_e
sns.distplot(delhivery_data_v2["od_time_taken"], label="od_time_taken")
plt.legend()
plt.show()
In [128…
sns.scatterplot(data = delhivery_data_v2, x = 'od_time_taken', y = 'start_scan_t
plt.show()
Delhivery (6) 19/08/22, 10:36 PM
In [129…
sns.pointplot(data = delhivery_data_v2, x = 'od_time_taken', y = 'start_scan_to_
<matplotlib.axes._subplots.AxesSubplot at 0x7f83db6eaad0>
Out[129…

"Compare the difference between
actual_time aggregated value and OSRM
time aggregated value"
Coefficient



Delhivery (6) 19/08/22, 10:36 PM
In [130…
histplot(delhivery_data_v2['actual_time'],"Actual Time",ax[0][0])
qqplot(delhivery_data_v2['actual_time'], "qqplot for Actual Time", ax[0][1])
histplot(delhivery_data_v2['osrm_time'],"OSRM Time",ax[1][0])
qqplot(delhivery_data_v2['osrm_time'], "qqplot for OSRM Time", ax[1][1])
In [131…
histplot(np.log(delhivery_data_v2['actual_time']),"Actual Time",ax[0][0])
qqplot(np.log(delhivery_data_v2['actual_time']), "qqplot for Actual Time",
histplot(np.log(delhivery_data_v2['osrm_time']),"OSRM time",ax[1][0])
qqplot(np.log(delhivery_data_v2['osrm_time']), "qqplot for OSRM Time", ax[1
Delhivery (6) 19/08/22, 10:36 PM
In [132…
fitted_actual_time,lmbda = stats.boxcox(delhivery_data_v2['actual_time'])
fitted_osrm_time,lmbda = stats.boxcox(delhivery_data_v2['osrm_time'])
histplot(fitted_actual_time,"Actual Time",ax[0][0])
qqplot(fitted_actual_time, "qqplot for Actual Time", ax[0][1])
histplot(fitted_osrm_time,"OSRM time",ax[1][0])
qqplot(fitted_osrm_time, "qqplot for OSRM time", ax[1][1])
Delhivery (6) 19/08/22, 10:36 PM

In [133…
anderson(fitted_actual_time)
Out[133…
2.5, 1. ]))
In [134…
anderson(fitted_osrm_time)
Out[134…
2.5, 1. ]))

In [135…
stat, p_value = pearsonr(fitted_actual_time, fitted_osrm_time)

In [136…
if p_value <= 0.05:
else:
Delhivery (6) 19/08/22, 10:36 PM
Inference
hypothesis and conclude actual time and OSRM time attribute are dependent on
each other.
Visual Analysis
In [137…
sns.distplot(delhivery_data_v2["actual_time"], label="actual_time")
sns.distplot(delhivery_data_v2["osrm_time"], label="osrm_time")
plt.legend()
plt.show()
In [138…
sns.scatterplot(data = delhivery_data_v2, x = 'actual_time', y = 'osrm_time'
plt.show()
Delhivery (6) 19/08/22, 10:36 PM
In [139…
sns.pointplot(data = delhivery_data_v2, x = 'actual_time', y = 'osrm_time')
<matplotlib.axes._subplots.AxesSubplot at 0x7f83bf405d10>
Out[139…

"Compare the difference between
actual_time aggregated value and
segment actual time aggregated value"
Coefficient



Delhivery (6) 19/08/22, 10:36 PM
In [140…
histplot(delhivery_data_v2['actual_time'],"Actual Time",ax[0][0])
qqplot(delhivery_data_v2['actual_time'], "qqplot for Actual Time", ax[0][1])
histplot(delhivery_data_v2['segment_actual_time'],"Segment Actual Time",ax[

qqplot(delhivery_data_v2['segment_actual_time'], "qqplot for Segment Actual Time
In [141…
histplot(np.log(abs(delhivery_data_v2['actual_time'])),"Actual Time",ax[0][
qqplot(np.log(abs(delhivery_data_v2['actual_time'])), "qqplot for Actual Time"
histplot(np.log(abs(delhivery_data_v2['segment_actual_time'])),"Segment Actual T
qqplot(np.log(abs(delhivery_data_v2['segment_actual_time'])), "qqplot for Segmen
Delhivery (6) 19/08/22, 10:36 PM
In [142…
fitted_actual_time,lmbda = stats.boxcox(delhivery_data_v2['actual_time'])
fitted_segment_actual_time,lmbda = stats.boxcox(delhivery_data_v2['segment_actua
histplot(fitted_actual_time,"Actual Time",ax[0][0])
qqplot(fitted_actual_time, "qqplot for Actual Time", ax[0][1])
histplot(fitted_segment_actual_time,"Segment Actual Time",ax[1][0])

qqplot(fitted_segment_actual_time, "qqplot for Segment Actual Time", ax[1][
Delhivery (6) 19/08/22, 10:36 PM

In [143…
anderson(fitted_actual_time)
Out[143…
2.5, 1. ]))
In [144…
anderson(fitted_segment_actual_time)
AndersonResult(statistic=39.9044356195227, critical_values=array([0.576, 0.
Out[144…
656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. , 2.
5, 1. ]))

In [145…
stat, p_value = pearsonr(fitted_actual_time, fitted_segment_actual_time)

In [146…
if p_value <= 0.05:
else:
Delhivery (6) 19/08/22, 10:36 PM
Inference
hypothesis and conclude actual time and Segment actual time attribute are
dependent on each other.
Visual Analysis
In [147…
sns.distplot(delhivery_data_v2["actual_time"], label="actual_time")
sns.distplot(delhivery_data_v2["segment_actual_time"], label="segment_actual_tim
plt.legend()
plt.show()
In [148…
sns.scatterplot(data = delhivery_data_v2, x = 'actual_time', y = 'segment_actual
plt.show()
Delhivery (6) 19/08/22, 10:36 PM
In [149…
sns.pointplot(data = delhivery_data_v2, x = 'actual_time', y = 'segment_actual_t
<matplotlib.axes._subplots.AxesSubplot at 0x7f83bc801b50>
Out[149…

"Compare the difference between osrm
distance aggregated value and segment
osrm distance aggregated value"
Coefficient



Delhivery (6) 19/08/22, 10:36 PM
In [150…
histplot(delhivery_data_v2['osrm_distance'],"OSRM Distance",ax[0][0])
qqplot(delhivery_data_v2['osrm_distance'], "qqplot for OSRM Distance", ax[0
histplot(delhivery_data_v2['segment_osrm_distance'],"Segment OSRM Distance"

qqplot(delhivery_data_v2['segment_osrm_distance'], "qqplot for Segment OSRM Dist
In [151…
histplot(np.log(delhivery_data_v2['osrm_distance']),"OSRM Distance",ax[0][0
qqplot(np.log(delhivery_data_v2['osrm_distance']), "qqplot for OSRM Distance"
histplot(np.log(delhivery_data_v2['segment_osrm_distance']),"Segment OSRM Distan

qqplot(np.log(delhivery_data_v2['segment_osrm_distance']), "qqplot for Segment O
Delhivery (6) 19/08/22, 10:36 PM
In [152…
fitted_osrm_distance,lmbda = stats.boxcox(delhivery_data_v2['osrm_distance'
fitted_segment_osrm_distance,lmbda = stats.boxcox(delhivery_data_v2['segment_osr
histplot(fitted_osrm_distance,"OSRM Distance",ax[0][0])
qqplot(fitted_osrm_distance, "qqplot for OSRM Distance", ax[0][1])
histplot(fitted_segment_osrm_distance,"Segment OSRM Distance",ax[1][0])

qqplot(fitted_segment_osrm_distance, "qqplot for Segment OSRM Distance", ax
Delhivery (6) 19/08/22, 10:36 PM

In [153…
anderson(fitted_osrm_distance)
Out[153…
2.5, 1. ]))
In [154…
anderson(fitted_segment_osrm_distance)
AndersonResult(statistic=70.3080027928263, critical_values=array([0.576, 0.
Out[154…
656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. , 5. , 2.
5, 1. ]))

In [155…
stat, p_value = pearsonr(fitted_osrm_distance, fitted_segment_osrm_distance

In [156…
if p_value <= 0.05:
else:
Delhivery (6) 19/08/22, 10:36 PM
Inference
hypothesis and conclude OSRM distance and Segment OSRM distance attribute are
Visual Analysis
In [157…
sns.distplot(delhivery_data_v2["osrm_distance"], label="osrm_distance")
sns.distplot(delhivery_data_v2["segment_osrm_distance"], label="segment_osrm_dis
plt.legend()
plt.show()
In [158…
sns.scatterplot(data = delhivery_data_v2, x = 'osrm_distance', y = 'segment_osrm
plt.show()
Delhivery (6) 19/08/22, 10:36 PM
In [159…
sns.pointplot(data = delhivery_data_v2, x = 'osrm_distance', y = 'segment_osrm_d
<matplotlib.axes._subplots.AxesSubplot at 0x7f83b8761250>
Out[159…

"Compare the difference between osrm
time aggregated value and segment osrm
time aggregated value"
Coefficient



Delhivery (6) 19/08/22, 10:36 PM
In [160…
histplot(delhivery_data_v2['osrm_time'],"OSRM Time",ax[0][0])
qqplot(delhivery_data_v2['osrm_time'], "qqplot for OSRM Time", ax[0][1])
histplot(delhivery_data_v2['segment_osrm_time'],"Segment OSRM Time",ax[1][0

qqplot(delhivery_data_v2['segment_osrm_time'], "qqplot for Segment OSRM Time"
In [161…
histplot(np.log(delhivery_data_v2['osrm_time']),"OSRM Time",ax[0][0])
qqplot(np.log(delhivery_data_v2['osrm_time']), "qqplot for OSRM Time", ax[0
histplot(np.log(delhivery_data_v2['segment_osrm_time']),"Segment OSRM Time"

qqplot(np.log(delhivery_data_v2['segment_osrm_time']), "qqplot for Segment OSRM
Delhivery (6) 19/08/22, 10:36 PM
In [162…
fitted_osrm_time,lmbda = stats.boxcox(delhivery_data_v2['osrm_time'])
fitted_segment_osrm_time,lmbda = stats.boxcox(delhivery_data_v2['segment_osrm_ti
histplot(fitted_osrm_time,"OSRM Time",ax[0][0])
qqplot(fitted_osrm_time, "qqplot for OSRM Time", ax[0][1])
histplot(fitted_segment_osrm_time,"Segment OSRM Time",ax[1][0])

qqplot(fitted_segment_osrm_time, "qqplot for Segment OSRM Time", ax[1][1])
Delhivery (6) 19/08/22, 10:36 PM

In [163…
anderson(fitted_osrm_time)
Out[163…
2.5, 1. ]))
In [164…
anderson(fitted_segment_osrm_time)
Out[164…
2.5, 1. ]))

In [165…
stat, p_value = pearsonr(fitted_osrm_time, fitted_segment_osrm_time)

In [166…
if p_value <= 0.05:
else:
Delhivery (6) 19/08/22, 10:36 PM
Inference
hypothesis and conclude OSRM time and Segment OSRM time attribute are
Visual Analysis
In [167…
sns.distplot(delhivery_data_v2["osrm_time"], label="osrm_time")
sns.distplot(delhivery_data_v2["segment_osrm_time"], label="segment_osrm_time"
plt.legend()
plt.show()
In [168…
sns.scatterplot(data = delhivery_data_v2, x = 'osrm_time', y = 'segment_osrm_tim
plt.show()
Delhivery (6) 19/08/22, 10:36 PM
In [169…
sns.pointplot(data = delhivery_data_v2, x = 'osrm_time', y = 'segment_osrm_time'
<matplotlib.axes._subplots.AxesSubplot at 0x7f83a6036290>
Out[169…
Normalize/ Standardize the numerical

features using MinMaxScaler or
StandardScaler
In [170…
num_cols = ["od_time_taken","start_scan_to_end_scan", "actual_distance_to_destin
Standard Scaling
In [171…
scaler = preprocessing.StandardScaler()
standard_df = scaler.fit_transform(delhivery_data_v2[num_cols])
delhivery_data_v4 = pd.DataFrame(standard_df)
delhivery_data_v4.columns = num_cols
Min-Max Scaling
In [172…
scaler = preprocessing.MinMaxScaler()
minmax_df = scaler.fit_transform(delhivery_data_v2[num_cols])
delhivery_data_v5 = pd.DataFrame(minmax_df)
delhivery_data_v5.columns = num_cols
In [173…
num_cols
Delhivery (6) 19/08/22, 10:36 PM
['od_time_taken',
Out[173…
'start_scan_to_end_scan',
'actual_distance_to_destination',
'actual_time',
'osrm_time',
'osrm_distance',
'segment_actual_time',
'segment_osrm_time',
'segment_osrm_distance']
In [174…
fig, (ax1, ax2, ax3) = plt.subplots(ncols = 3, figsize =(20, 5))
ax1.set_title('Before Scaling')
sns.kdeplot(delhivery_data_v2['od_time_taken'], ax = ax1, color ='r')

sns.kdeplot(delhivery_data_v2['start_scan_to_end_scan'], ax = ax1, color ='b'
ax2.set_title('After Standard Scaling')
sns.kdeplot(delhivery_data_v4['od_time_taken'], ax = ax2, color ='red')

sns.kdeplot(delhivery_data_v4['start_scan_to_end_scan'], ax = ax2, color ='blue'
ax3.set_title('After Min-Max Scaling')
sns.kdeplot(delhivery_data_v5['od_time_taken'], ax = ax3, color ='black')

sns.kdeplot(delhivery_data_v5['start_scan_to_end_scan'], ax = ax3, color ='g'
plt.show()
In [175…
ax[0].set_title('Before Scaling')
sns.kdeplot(delhivery_data_v2[col], ax = ax[0], color ='red')
ax[1].set_title('After Standard Scaling')
sns.kdeplot(delhivery_data_v4[col], ax = ax[1], color ='blue')
ax[2].set_title('After Min-Max Scaling')
sns.kdeplot(delhivery_data_v5[col], ax = ax[2], color ='green')
plt.show()
Delhivery (6) 19/08/22, 10:36 PM
Delhivery (6) 19/08/22, 10:36 PM
Inference
After normalization, All the numerical attributes got to a similiar scale ranges from 0
to 1
After standardization, It translates the data to the mean vector of original data to the
origin and squishes or expands.
Business Insights
1,44,867 number of records and 17 attributes are present in this dataset.
Source and destination name attributes are having small number of missing values.
Labels are also inconsitent in source and destination name attributes.
All the numerical attributes mean and median values are not close to each other
which clearly indicates data is not normally distributed.
Also the range of numerical attributes are widely distributed which shows there
Delhivery (6) 19/08/22, 10:36 PM
might be some outliers present in the data.
Min value of segment actual time is -244.
More number of FTL route types are present in raw data but we cannot conclude
before aggregating the rows.
Gurgaon_Bilaspur_HB (Haryana) seems to be most popular source and destination

center.
Average time taken to deliver from source to destination are relatively higher where
the number of transits are high.
Most of the one way parcels are likely to have least number of transits.
More number of round trips takes more than 2 transits to deliver.
All the numerical attributes are rightly skewed and requires some treatment.
Number of trips are dropping extensively in recent days.
60% of deliveries used carts.
Gurgaon_Bilaspur_HB (Haryana), Bhiwandi_Mankoli_HB (Maharastra) and

Bangalore_Nelmngla_H (Karnataka) are the most popular source centers.
Gurgaon_Bilaspur_HB (Haryana), Bangalore_Nelmngla_H (Karnataka) and

Bhiwandi_Mankoli_HB (Maharastra) are the most popular destination centers.
All these data captured in September and October 2018.
New feature Trip creation year, month, day, date and time information are extracted
from trip creation time attribute.
City, Place and Area information are extracted from both source and destination
name attribute.
New feature OD time taken is calculated based on the difference between OD Start
time and OD end time.
Almost all the numerical attributes are strongly linearly correlated with each other.
In all months, Carts are highly used compared to full truck loads.
More number of parcels are started in september compared to october.
More number of parcels are delivered in september compared to october.
Delhivery (6) 19/08/22, 10:36 PM
Karnataka, Maharastra, Tamilnadu, Haryana are the top states from where the
parcels are originated.
From Karnataka, Maharastra, Haryana and Tamilnadu, More number of parcels are
sent in carts compared to full truck loads.
Karnataka, Maharastra, Tamilnadu, Haryana and Telangana are the top states to
where the parcels are delivered.
In Karnataka, Maharastra, Haryana and Tamilnadu, More number of parcels are

delivered by carts compared to full truck loads.
Karnataka, Maharastra, Tamilnadu, Haryana and Telangana are the top states
involved in more number of trips.
In Karnataka, Maharastra and Haryana, One way parcels are more preferred.
In Tamilnadu and West Bengal, One way and round way type of parcels are equally
likely used.
OD_start_time / OD_end_time and start scan to end scan attribute are closely related
with each other
Actual time and OSRM time attribute are closely related with each other
Actual time and Segment actual time attribute are closely related with each other
OSRM distance and Segment OSRM distance attribute are closely related with each
other
OSRM time and Segment OSRM time attribute are closely related with each other
hypothesis and conclude OSRM time and Segment OSRM time attribute are
hypothesis and conclude OSRM distance and Segment OSRM distance attribute are
hypothesis and conclude actual time and Segment actual time attribute are
hypothesis and conclude actual time and OSRM time attribute are dependent on
each other.
Delhivery (6) 19/08/22, 10:36 PM
hypothesis and conclude od_start_time / od_end_time and start scan to end scan
attribute are dependent on each other
Recommendations
1. Delhivery company can increase their business by giving offers / discounts to busiest
corridor under busiest state
2. Delhivery company can increase their business by giving offers / discounts to route
type FTL
3. Delhivery company should focus more on Southern states as more parcels are
orginated and delivered
4. Delhivery company should plan to use shortest path from source to destination
center

Delhivery Mani

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Delhivery Mani

Uploaded by

Copyright:

Available Formats

Delhivery (6) 19/08/22, 10:36 PM

data - tells whether the data is testing or training data

Out[42]: data trip_creation_time route_schedule_uuid route_type trip_uuid source_c

Drop Unknown Fields

Observations on shape & data types of all

Index(['data', 'trip_creation_time', 'route_schedule_uuid', 'route_type',

Conversion of Categorical attributes

Analyzing basic statistics about each

Out[49]: count mean std min 25%

start_scan_to_end_scan 144867.0 961.262986 1037.012769 20.000000 161.000000

actual_distance_to_destination 144867.0 234.073372 344.990009 9.000045 23.355874

actual_time 144867.0 416.927527 598.103621 9.000000 51.000000

osrm_time 144867.0 213.868272 308.011085 6.000000 27.000000

osrm_distance 144867.0 284.771297 421.119294 9.008200 29.914700

segment_actual_time 144867.0 36.196111 53.571158 -244.000000 20.000000

segment_osrm_time 144867.0 18.507548 14.775960 0.000000 11.000000

segment_osrm_distance 144867.0 22.829020 17.860660 0.000000 12.070100

Out[50]: count unique top freq

data 144867 2 training 104858

route_type 144867 2 FTL 99660

trip_uuid 144867 14817 trip-153811219535896559 101

source_center 144867 1508 IND000000ACB 23347

source_name 144574 1498 Gurgaon_Bilaspur_HB (Haryana) 23347

destination_center 144867 1481 IND000000ACB 15192

destination_name 144606 1468 Gurgaon_Bilaspur_HB (Haryana) 15192

Out[51]: count unique top freq

data 144867 2 training 104858

route_type 144867 2 FTL 99660

source_center 144867 1508 IND000000ACB 23347

destination_center 144867 1481 IND000000ACB 15192

start_scan_to_end_scan 144867.0 NaN NaN NaN

actual_distance_to_destination 144867.0 NaN NaN NaN

actual_time 144867.0 NaN NaN NaN

osrm_time 144867.0 NaN NaN NaN

osrm_distance 144867.0 NaN NaN NaN

segment_actual_time 144867.0 NaN NaN NaN

segment_osrm_time 144867.0 NaN NaN NaN

segment_osrm_distance 144867.0 NaN NaN NaN

Non-Graphical Analysis: Value counts and

Unique values (names) are checked for each

Unique values of data are : ['training', 'test']

Unique values of route_type are : ['Carting', 'FTL']

Unique values (counts) are checked for each

Missing value detection

Total records = 144867

source_name 293 0.20

destination_name 261 0.18

Out[57]: data trip_creation_time route_schedule_uuid route_type trip_uuid

... ... ... ... ...

293 rows × 19 columns

Out[58]: data trip_creation_time route_schedule_uuid route_type trip_uuid

... ... ... ... ...

261 rows × 19 columns

Merging of rows and aggregation of fields

Out[61]: data trip_uuid trip_creation_time route_type source_center

trip- 2018-09-12 Kanpur_

trip- 2018-09-12 Doddablpur_

trip- 2018-09-12 Bangalore_

... ... ... ... ... ...

trip- 2018-10-03 Tirchchndr_S

trip- 2018-10-03 Thisayanvilai_

trip- 2018-10-03 Peikulam_S

trip- 2018-10-03 Sandur_W

26369 rows × 18 columns