You are on page 1of 31

Anticipation vs.

Reactive Reoptimization
for Dynamic Vehicle Routing
with Stochastic Requests

Marlin W. Ulmer

Abstract

Due to new business models and technological advances, dynamic vehicle routing is gaining
increasing interest. Especially solving dynamic vehicle routing problems with stochastic
customer requests becomes increasingly important, for example, in e-commerce and same-
day delivery. Solving these problems is challenging, because it requires optimization along
two dimensions. First, as a reaction to new customer requests, current routing plans need to
be reoptimized. Second, potential future requests need to be anticipated in current decision
making. Decisions need to be derived in real-time. The limited time often prohibits extensive
optimization in both dimensions and the question arises how to utilize the limited calculation
time effectively. In this paper, we analyze the merits of reactive route reoptimization and
anticipation for a dynamic vehicle routing problem with stochastic requests. To this end, we
compare an existing method from each dimension as well a policy allowing for a tunable
combination of the two approaches. We show how the appropriate optimization combination is
strongly connected to the degree of dynamism, the percentage of unknown requests. We also
show that our combination does not provide significant benefit compared to the respectively
best optimization dimension.
Keywords: Dynamic Vehicle Routing, Stochastic Requests, Degree of Dynamism, Reoptimiza-
tion, Anticipation, Mixed-Integer Programming, Approximate Dynamic Programming

1
1 Introduction
In a dynamic vehicle routing problem (DVRP), information is uncertain and revealed over time
forcing subsequent adaptions of routing plans. Dynamic routing problems have become a center of
attention in the research community driven by technological advances and new business models
[13, 48]. The recent review by Psaraftis et al. [36] shows a substantial increase in research on
DVRPs in the last years. DVRPs are expected to be one main future research field in city logistics
[41]. The literature review by Psaraftis et al. [36] identifies stochastic customer requests as the main
driver of uncertainty in practice, e.g., for emergency vehicles, technical, healthcare, and courier
services as well as parcel and passenger transportation in the growing fields of same-day delivery,
shared mobility, or demand responsive transportation. Other examples of uncertainty are stochastic
service times [54], demands [16], and travel times [42].
Even though routing applications with uncertain requests differ in their objectives and constraints,
they all share the requirement of assigning new requests to vehicles and of determining efficient
routing plans for the fleet. On the operational level, they are often limited in their accessible
resources such as number of vehicles or working hours of drivers. Most of the approaches aim at
utilizing these limited resources to maximize revenue or the number of served customers. Because
drivers and customers often expect fast responses, decision making needs to be conducted in
real-time and the time for computational calculations is highly limited [44].

1.1 The Degree of Dynamism

DVRPs with stochastic requests differ in their level of uncertainty, in particular, they differ in the
question of how many customers are initially known and how many customers may stochastically
request over the time horizon. We refer to the initially known customers as early request customers
(ERC) and the stochastically requesting customers as late request customers (LRC). Larsen et al.
[24] denote the expected percentage of uncertain requests as the degree of dynamism (DOD). The
DOD is defined by the ratio between expected numbers of late request customers #LRC and overall
customers #ERC + #LRC:

#LRC
DOD = .
#ERC + #LRC

2
The DOD is one main dimension to classify DVRPs with stochastic requests. A moderate DOD
may be experienced for applications such as oil distribution, patient transports, or grocery delivery
[38, 9]. The range of applications with high DOD comprises emergency vehicles or courier services
[27, 47]. High DOD can especially be found in emerging applications such as same-day delivery,
demand-responsive transportation, or shared mobility [53, 17, 6]. For a detailed classification of
DVRP-applications, the interested reader is referred to [48, pp. 32-37].

1.2 Reactive Reoptimization vs. Anticipation

In DVRPs, a tentative route plan is subsequently updated. Because decisions are made in real-
time, the calculation time for updates is usually limited and heuristic methods are applied. As a
review of the literature reveals, these heuristics usually focus on one of two dimensions: reactive
reoptimization and anticipation. The first dimension spends the majority of calculation time to
reactively reoptimize routing plans based on the information currently accessible. The second
dimension invests the calculation time available to anticipate potential future developments such as
new customer requests or future routing updates in their decision making.
Reactive reoptimization (from now on called “reoptimization”.) has a long history in the dynamic
vehicle routing literature due to the comprehensive knowledge based on static and deterministic
routing. These heuristics solve static mixed-integer routing models in every state of the problem.
They ignore that the information of dynamic routing problems may change and the ability to alter
our routes in the future.
With advances in information and communication technology, the anticipation of stochastic
information about potential future developments becomes relevant. Gendreau et al. [13] see the
informational process as “an important dimension to consider” in the optimization decisions. As
Speranza [44] and Savelsbergh and Van Woensel [41] state, anticipation is a major challenge in
DVRP research. Anticipatory solution approaches often draw on simulation of future information
and routing developments to evaluate the decisions in a current state. However, due to the limited
calculation time, they usually draw on simple routing heuristics and typically neglect routing
reoptimization.
Therefore, both dimensions have advantages and disadvantages. Reoptimization may free
valuable resources allowing for more efficient routing plans while anticipation may incorporate

3
important future developments in current decision making allowing for more flexible and effective
routing and assignment decisions.

1.3 Purpose of this Paper

The main purpose of this paper is the analysis from which dimension DVRPs with stochastic requests
may be approached with respect to their DOD. To this end, we compare and combine two existing
solution approaches from each dimension: reactive reoptimization by means of mixed-integer
programming and anticipation based on approximate dynamic programming. We also introduce
a straightforward combination of both dimensions. The combined approach is tunable to allow
shifting the focus between the two directions. To analyze the merits of both optimization dimensions,
we run a computational evaluation on the vehicle routing problem with stochastic requests (VRPSR,
[47]) and a variety of instance settings mimicking the conditions of different practical applications
with varying DOD. As we show, there is no dominating optimization dimension. However, the
optimization focus should depend on the degree of uncertainty. A low DOD demands for an
emphasis on reoptimization while moderate and high DODs require anticipation. We also show that
the combination does not provide significantly better results than the best individual optimization
approach regardless the instance setting. We further provide an analysis how instance specifics and
runtime availability affect the two optimization dimensions.
Our contributions are as follows. This paper is the first to explicitly compare and quantify
merits and shortcomings of reoptimization and anticipation. The experiments in this paper identify
the DOD as a strong indicator for the suitability of the two optimization dimensions. The paper
also presents insight how the suitability of an optimization dimensions depends on the instance
characteristics (for example, request distribution) and on the runtime available in a decision point.
This paper further indicates that combining these two dimensions is not trivial and discusses potential
future directions of combinations. In summary, the paper provides a quantitative foundation to
support the emerging conversation in the community how dynamic vehicle routing problems should
be approached. To foster this discussion, a comprehensive outlook on future challenges for the
dynamic vehicle routing community is given.
The paper is organized as follows. In §2, we define and model the VRPSR. We further analyze
how work on the VRPSR (or related problems) addresses the two optimization components. We

4
present our approach in §3 and conduct the computational study in §4. The paper concludes with a
summary and an outlook in §5.

2 Problem Definition: The VRPSR


To analyze the performance of anticipatory approaches and reoptimization, we draw on the vehicle
routing problem with stochastic requests (VRPSR) containing the two major decision components
of DVRPs: the assignment and the routing decision. However, we reduce the complexity of the
problem by considering one vehicle only. The assignment decision is therefore condensed to the
question whether to accept a request for service by the vehicle or to reject the request. In the
following, we first present the problem statement. We then model the problem as a Markov decision
process. Finally, we present related literature to the VRPSR and analyze presented solution methods.

2.1 Problem Statement

In the VRPSR, a vehicle serves customers in a service area within a time horizon. The tour starts and
ends in a depot. A set of early request customers (ERC) is known in advance. These customers are
required to be served. Over the course of the day, new late request customers (LRC) stochastically
request service. These late request customers are unknown until their requests. Their request times
and locations follow a known probability distribution. Whenever the vehicle serves a customer, the
dispatcher decides about the subset of occurred requests to be assigned to the vehicle and an update
of the planned tour.1 If an LRC is assigned, the vehicle is required to serve this customer in the
remainder of the time horizon. The dispatcher aims at maximizing the number of assigned LRC.

2.2 Markov Decision Process

In the following, we model the problem as a Markov decision process (MDP, [37]). The notation
is listed in Table 1. A decision epoch k occurs whenever the vehicle serves a customer. A
state Sk = (tk , Pk , Ck , θk , Cknew ) consists of the point of time tk , the vehicle’s position Pk , the set
1
The time of decision making is connected to the business model at hand. Some business models may require
decision making whenever a customer requests. However, [50] show that the tendency of solution methods hold for
both modeling choices.

5
Table 1: Notation of the Markov Decision Process

Notation Description
Sk Decision State
tk Point of Time
Pk Position of the Vehicle
C Set of Customers to Serve
θk Planned Tour
Cknew New Requests
x Decision
Ckx,assign Set of Newly Assigned Customers
Ckx Updated Set of Customers to Serve
θkx Updated Planned Tour
R(Sk , x) Reward of Decision x in State Sk
Skx Post-Decision State
ωk Transition

Ck of not-yet-served ERC and assigned LRC as well as their planned sequence, the tour θk =
(Pk , Ck1 , Ck2 , . . . , Ckn , D). Further, Sk contains the set of l new requests Cknew = {C j , . . . , C j+l },
where l is a realization of a random variable representing the number of new LRC requests.
Decisions x are made about the set of new LRCs to serve, Ckx,assign of Cknew to assign leading
to an updated set of customers to serve Ckx = Ck ∪ Ckx,assign . A decision further updates the tour
θk to θkx that sequences Ckx and determines the next customer to visit. A decision is feasible if the
planned tour allows serving all customers and returning to the depot within the time limit. The
immediate reward R(Sk , x) is the number of newly assigned LRC. We split the decision transition
between states into a decision transition to a post-decision state and a stochastic transition from
a post-decision state to a new decision state. A post-decision state Skx = (tk , Ckx , θkx ) consists of
the point of time, the not yet served customers Ckx , and the planned tour θkx including the vehicle’s
position. The stochastic transition ωk results from the vehicle’s travel, service of the next customer,
and a new (stochastic) set of LRCs. The transition updates the point of time, the vehicle position,
the set of customers yet-to-serve, and the set of newly arrived LRCs. The combination of decision
and transition leads to a new decision state Sk+1 . The process terminates in state SK when the time
limit is reached and the vehicle has returned to the depot.

6
7

1 1

6
6

2 5 2 5

Figure 1: State, Decision, and Post-Decision State

2.3 Example

Figure 1 presents an example of a decision state, a decision, and the resulting post-decision state.
State Sk = (60, Pk , {C 1 , C 2 , C 5 }, (Pk , C 1 , C 2 , C 5 , D), {C 6 , C 7 }) is depicted on the left side at
time tk = 60. The vehicle just served a customer and the current position of the vehicle is in the
upper right portion of the service area. Three customers C 1 , C 2 , C 5 are already assigned. The
planned tour θk is indicated by the dashed line and starts in Pk , serves customers in the order
C 1 , C 2 , C 5 , and returns to the depot. Two new customers Cknew = {C 6 , C 7 } have requested service.
Decision x = (C 6 , (Pk , C 6 , C 5 , C 2 , C 1 , D)) determines the assignment of customer Ckx,assign =
{C 6 } and updates the tour to θkx = (Pk , C 6 , C 5 , C 2 , C 1 , D) changing the previous sequence of
customers. The immediate reward is R(Sk , x) = 1. The application of x leads to post-decision
state Skx = (60, {C 1 , C 2 , C 5 , C 6 }, (Pk , C 6 , C 5 , C 2 , C 1 , D)). The next location to travel to is C 6
indicated by the solid line. When arriving at C 6 , a set of new requests is revealed leading to the
next state Sk+1 .

2.4 Solutions and Optimal Policies

The objective for the VRPSR is to identify an optimal decision policy π ∗ ∈ Π with Π the overall set
of policies. A policy π is a sequence of decision rules (X0π , . . . , XK
π
) to assign every state Sk to a
decision Xkπ (Sk ). Decision rule Xkπ (Sk ) is the decision rule dependent on state Sk induced by π in
decision epoch k. An optimal policy π ∗ maximizes the expected rewards over all decision epochs
k = 0, . . . , K subsequently applying π ∗ as depicted in Equation (1):

7
" K
#
X
π ∗ = arg max E R(Sk , Xkπ (Sk ))|S0 . (1)
π∈Π
k=0

For every state Sk , π ∗ satisfies the Bellman Equation as depicted in Equation (2):
( " K
#)
π∗ π∗
X
Xk (Sk ) = arg max R(Sk , x) + E R(Sj , Xj (Sj ))|Sk . (2)
x∈X(Sk ) j=k+1

In every decision epoch k = 0, . . . , K, π ∗ selects the decision from the overall set of decisions
X(Sk ) given Sk maximizing the sum of immediate reward and the expected future rewards. The
second term of Equation (2) is the sum of expected reward in decision epochs k + 1, . . . , K given
post-decision state Skx . This term is also called the value V (Skx ) of a post-decision state Skx .

2.5 Literature Review

In the following, we analyze the related literature. Because the number of recent reviews is vast
[32, 36, 39], we solely classify the approaches with respect to their optimization focus. Table 2
shows a selection of work on DVRPs with stochastic requests similar to the VRPSR. In all of the
considered problems, routing and assignment decisions are made dynamically over the time horizon.
Approaches as classified under the term reoptimization if the majority of calculation time is spent
to solve the routing problem based on current information. Approaches focusing on estimation of
the impact of current decisions on future rewards (or costs) are classified as anticipatory. There is
a minority of papers balancing the calculation time on both research directions. We denote these
approaches as combination. Notably, as we show in our computational evaluation, a combination is
not necessarily advantageous.
Early works analyze the impact of routing heuristics or waiting strategies to the number of
served requests and travel duration [35, 3, 46, 24, 30, 47]. They neither reoptimize their routes nor
anticipate potential future developments.
Reoptimization is mainly achieved via mixed-integer programming (MIP) often combined
with metaheuristics applied on a rolling horizon [12, 45, 19, 11, 5, 25, 31, 43, 1]. In some cases,
reoptimization approaches are combined with a mild anticipation. These methods do not rely
on the Bellman Equation or simulation of the MDP but acknowledge potential future requests
implicitly, for example by means of dummy customers [20, 10]. Thus, these methods do not

8
Table 2: Literature Classification

Reoptimization Combination Anticipation


Psaraftis (1980) [35]
Bertsimas and Van Ryzin (1991) [3]
Tassiulas (1996) [46]
Gendreau et al. (1999) [12] X
Swihart and Papastavrou (1999) [45] X
Ichoua et al. (2000) [19] X
Larsen et al. (2002) [24]
Mitrovic-Minic and Laporte (2004) [30]
Bent and Van Hentenryck (2004) [2] X
Ichoua et al. (2006) [20] X
Gendreau et al. (2006) [11] X
Chen et al. (2006) [8] X
Hvattum et al. (2006) [18] X
Thomas (2007) [47] X
Branchini et al. (2009) [5] X
Ghiani et al. (2009) [14] X
Meisel (2011) [28] X
Ferrucci et al. (2013) [10] X
Lin et al. (2014) [25] X
Ninikas and Minis (2014) [31] X
Schyns (2015) [43] X
Arslan et al. (2018) [1] X
Klapp et al. (2018) [23] X
Ulmer et al. (2018) [51] X
Voccia et al. (2018) [53] X
Ulmer et al. (2018) [49] X

explicitly incorporate potential future developments in their current routing updates. To analyze the
impact of reoptimization in our case study, we solve the static MIP based on current information by
means of rolling horizon reoptimization (RHR). We define the approach in §3.1.
Anticipation is mainly achieved by means of approximate dynamic programming [28, 51, 49, 23]
estimating the reward-to-go for states and decisions. Due to computational necessity, routing is
often conducted via basic routing heuristics like cheapest insertion. In the case of Klapp et al. [23],
routing is even simplified to the real line. Ulmer et al. [51] present an offline method to decide
about customer assignments based on the expected value of a decision taken. Offline methods
are highly valuable, because their online application usually requires only a fraction of a second.
However, preliminary tests on combining this method with route reoptimization leads to inferior

9
results because the reoptimization could not be integrated in the extensive offline simulation runs.
The offline value evaluation could, therefore, not be transferred to the reoptimized online routes.
The current state-of-the-art heuristic for the VRPSR is presented by Ulmer et al. [49] combining
online and offline simulations within a rollout algorithm (RA). We select this approach for our
computational evaluation and describe the procedure in §3.2.
There exists a small fraction of works combining reoptimization and anticipation. These
approaches evaluate a limited set of routing plans by means of sampling, in particular, by strictly
limiting the sampling horizon and neglecting any future adaptions of routing plans [14]. Most
prominent is the multiple-scenario approach (MSA) by Bent and Van Hentenryck [2], also applied
in [18] and [53]. For the MSA, a set of scenarios is generated. For each scenario, a routing plan is
derived. The plan “most similar” to the other plans is then implemented. However, the MSA is not
able to reflect future dynamic decision making. Further, Ghiani et al. [15] and Ulmer et al. [51]
show that for the VRPSR, the MSA is inferior to methods of approximate dynamic programming.

3 Approaches
Solving the Bellman Equation to optimality is challenging due to the curses of dimensionality in
state, decision, and transition space [33]. Thus, we only can apply heuristics. In this section, we
present and combine one heuristic approach from each dimension, reoptimization and anticipation.
The first approach focuses on current routing reoptimization via mixed-integer programming on a
rolling horizon (RHR) freeing resources to serve future requests. This approach is presented in §3.1.
The second one, the rollout algorithm (RA), solely focuses on anticipatory optimization estimating
the reward to go by means of simulation while drawing on a cheapest insertion routing heuristic
(CI, [40]). We then propose a scheme to combine these two approaches in §3.3. Finally, we present
tuning details.

3.1 Rolling Horizon Reoptimization

A direct optimization for the reward function of the VRPSR would require solving the traveling
new |
salesman problems (TSPs) for every of the 2|Ck subsets of new requests. Then, the TSP maximizing
the number of newly integrated requests would be selected. Due to the exponential increase of

10
TSPs to solve per new request, this cannot be achieved in real-time. Hence, the RHR draws on
an insertion heuristic to select the requests to assign. Then, RHR minimizes the duration of the
resulting tour. This procedure is common and, for example, applied in Gendreau et al. [12] or Bent
and Van Hentenryck [2]. More precisely, RHR assigns the maximal amount of requests which can
be feasibly inserted in the current tour θk via CI. The insertion leads to new planned tour θ̄k . Then,
routing plan θ̄k is optimized with respect to the overall travel duration. To this end, we draw on the
solution of a mixed-integer program (MIP) of an open TSP as defined in the following.
Let a planned tour θ̄k with n − 1 customers plus the current location and the depot be given.
We denote the current location as 0 and the depot as n. Let further be cij the travel time between
locations i and j. As decision variables, we define yij , ∀i, j ∈ {0, . . . , n}:

1 if travel from i to j is planned.

yij =
0 else

For subtour elimination, we use Miller-Tucker-Zemlin subtour elimination constraints [29] and
the corresponding decision variables ui . This leads to the following adaption of the well-known
TSP-model:

n−1 X
X n
min cij · yij (0)
i=0 j=0

s. t. yn0 = 1 (1)
n
X
yij = 1, ∀i ∈ {0, . . . , n} (2)
j=0
n
X
yij = 1, ∀j ∈ {0, . . . , n} (3)
i=0

ui − uj + (n + 1) · yij ≤ n, ∀i, j ∈ {1, . . . , n} (4)

yij ∈ {0, 1}, ui ∈ R, ∀i, j ∈ {0, . . . , n}. (5)

The objective is to minimize the travel time. Constraints (2) and (3) ensure the service of every
customer. Constraint (4) represents the Miller-Tucker-Zemlin subtour elimination constraints. We

11
apply two adaptions. First, we enforce a connection between the depot and the current location
of the vehicle in condition (1). Second, the objective function in (0) ignores the corresponding
travel duration between depot and current location. The solution of the MIP then induces the next
customer to visit. The resulting decision policy RHR applies the combination of greedy subset
selection and solving the MIP on a rolling horizon.

3.2 Rollout Algorithm

For anticipation, we draw on the current state-of-the-art approach, the online-offline rollout algo-
rithm by Ulmer et al. [49]. The RA focuses on the assignment decisions and applies CI to determine
routing. The idea of the RA is to estimate the second term of the Bellman Equation for a state by
simulating a number of trajectories into the future. Within these simulation runs, decision making is
conducted by a base policy, in our case an offline value function approximation (VFA) by Ulmer et
al. [51].
Figure 2 sketches the RA-procedure [48]. For the detailed algorithm, the interested reader is
referred to the Appendix. Let state Sk , a set of i = 1, . . . , n decisions, and post-decision states Skxi
be given. For every post-decision state, the RA simulates m trajectories, i.e., the routing and request
developments based on a particular sampled problem realization ωj . Within the simulation runs,
at points indicated by the stars, routing and assignment decisions needs to be taken in an efficient
manner because runtime is highly limited. To this end, the RA draws on the VFA by Ulmer et al.
[51] as base policy. The VFA estimates the value of a post-decision state with respect to the current
point of time and the free time budget left. For routing, the VFA draws on CI. Every simulated
trajectory returns a realized reward. The approximation of the second term of Equation (2) is then
the average over all trajectories’ rewards. Based on these approximated values, the RA selects the
subset of customers to assign. Notably, within the simulations, no reoptimization of the routing is
possible due to computational intractability.

3.3 Combining RHR and RA

In this section, we describe how we combine the two heuristics. The idea is that we anticipate future
development and reoptimize the routing as well. To this end, we apply both RA and RHR in every

12
1k+1 1K-1
xb ω1 ω1
1k Sk+1(ω1k1 ) Sk+1(ω1k1 ) Sk+2(ω1 )1k+1
... SK(ω1K-1
1 )
ω1

Sxk1

...

...

...

...

...
...
1k+1 1K-1
x1 ω1k
m Sk+1ω(mω1km) xb
Sk+1(ω1km) ωm
Sk+2(ωm )1k+1
... ωm
SK(ω1K-1
m )

2k
ω1 …
x2
Sk Sxk2

...
2k
ωm …

...
xn
nk
ω1 …
Sxkn

...
nk
ωm ...

Figure 2: The Online-Offline Post-Decision Rollout

decision epoch and split the accessible calculation time. The combination is decomposed to two
steps in each decision epoch. The subset selection is conducted via RA. Subsequently, the planned
tour is reoptimized via RHR. To shift the focus between the two methods, we introduce a weighting
parameter γ ∈ (0, 1) indicating the percentage of runtime dedicated to the RA. Assuming an overall
allowed runtime of T, the runtime for the RA is T × γ and the time for the RHR is T × (1 − γ)
seconds. For example, given γ = 0.5, the policy first spends half the time to decide about the subset
selection via RA. Then, the policy uses the other half to reoptimize the resulting planned tour. We
denote these policies πγc with c for “combined”. Consequently, π0c reflects RHR, π1c reflects the RA.
To determine a suitable γ for a specific instance setting, we test policies πγc with γ = 16 , 62 , . . . , 56
for 100 tuning runs. For each instance setting, we sample 100 realizations and apply policies πγc to
these realizations. We then select the parameters γ leading to the highest average solution quality.
We denote the respective best policy as “Combination”.

3.4 Implementation Details

The algorithms are implemented in Java. We run the tests on Windows Server 2008 R2, 64 bit,
with Intel-Xeon E7-4830@2.13GHz, 64 cores, and 128GB RAM. We draw on CPLEX Concert
Version 12.5.1 based on 32 cores to solve the model presented in §3.1. We use the current tour θk as
initial solution. For our main study, we assume an overall given calculation time of T = 30 seconds
per decision epoch. As the computational study shows, this results in gaps of 0.2% up to 3.0% on
average per decision epoch. Generally, one core allows for around 60 sample runs in 30 seconds for

13
the RA. Hence, we set m = 60 × γ. As the foremost goal of the computational study is to provide
insight on the balance between routing reoptimization and anticipation, we do not pursue possible
enhancements such as parallelizing the RA or additional features in the CPLEX solver for the MIP
in the RHR.
An interesting aspect are the runtimes of the experiments. Even though we only allow up to
30 seconds per decision point, we usually experience 40-60 decision points per run. This leads to
calculation times of up to 30 minutes for a single realization. To achieve statistical significance, a
sufficient number of realizations per instance setting is required. Thus, we run 100 tuning and 100
evaluation runs per instance setting. We run the tuning for 5 different γ and the evaluation for RA,
RHR, and the best combination. Thus, we run 500 + 300 = 800 runs per instance setting, each run
taking up to 30 minutes. Given the 12 different instance settings, the main experiments therefore
take up to 4800 hours of calculation, or 200 days. Thus, even though the real-time application of all
methods is feasible, the computational evaluation of the policies becomes very time consuming.
This computational challenge is often observed in research on dynamic vehicle routing.

4 Computational Evaluation
In this section, we analyze the approaches. We first define the test instances. Then, we compare the
approaches with respect to solution quality and analyze the impact of the instances’ attributes on
the approaches’ outcomes.

4.1 Instances

Customers request service over a time horizon of 360 minutes via a uniform Poisson process. The
expected number of customers is set to 60. Customer locations are distributed in a 10km × 10km
service area. This reflects a medium-sized city. The depot is located in the center of the service area.
The vehicle travels at a rate of 20km per hour. The distance between two customers is measured
by the Euclidean norm but multiplied by a factor 1.5 to reflect a road network [4]. We generate
instances to represent different applications. To this end, we vary the DOD. We define instances
with a DOD of 0.25 for applications where the number of stochastic requests is small, a DOD of
0.50 for applications with moderate DOD, 0.75 for applications with high DOD, and a DOD of

14
Table 3: Average Number of Served LRC and Standard Error of the Policies

DOD Distribution RA RHR Myopic Combination γ


25 U 8.08 (± 0.44) 8.70 (± 0.45) 7.47 (± 0.43) 8.48 (± 0.43) 1/6
25 2C 10.58 (± 0.39) 12.87 (± 0.42) 12.89 (± 0.41) 12.74 (± 0.41) 1/6
25 3C 10.86 (± 0.39) 12.79 (± 0.40) 12.39 (± 0.40) 12.83 (± 0.40) 5/6
50 U 20.16 (± 0.52) 18.87 (± 0.41) 17.18 (± 0.48) 19.75 (± 0.40) 5/6
50 2C 22.21 (± 0.58) 24.68 (± 0.52) 24.23 (± 0.58) 24.57 (± 0.53) 4/6
50 3C 21.48 (± 0.47) 23.56 (± 0.49) 22.72 (± 0.48) 23.41 (± 0.48) 3/6
75 U 31.71 (± 0.64) 26.79 (± 0.47) 25.44 (± 0.51) 28.49 (± 0.51) 4/6
75 2C 36.33 (± 0.88) 36.51 (± 0.64) 35.63 (± 0.63) 36.80 (± 0.65) 3/6
75 3C 35.04 (± 0.70) 33.82 (± 0.63) 33.19 (± 0.56) 34.66 (± 0.62) 5/6
100 U 42.10 (± 0.73) 35.01 (± 0.55) 33.07 (± 0.52) 36.75 (± 0.56) 5/6
100 2C 48.16 (± 0.81) 44.47 (± 0.73) 43.90 (± 0.72) 45.37 (± 0.70) 5/6
100 3C 46.07 (± 0.80) 41.73 (± 0.68) 41.35 (± 0.66) 43.32 (± 0.70) 5/6

1.00 with stochastic requests only. To analyze the importance of anticipation in subset selection
and reoptimization in routing decisions, we define three different spatial distributions of requests.
We consider uniformly distributed customers (U ), customers spread in three clusters (3C), and
customers spread in two clusters (2C). For U , a spatial realization (x, y) is defined as (x, y) ∼
U [0, 10] × U [0, 10]. For 2C, the customers are equally distributed to each cluster. The cluster
centers are located at µ1 = (2.5, 2.5), µ2 = (7.5, 7.5). The standard deviation within the clusters is
σ = 0.5. For 3C, the cluster centers are located at µ1 = (2.5, 2.5), µ2 = (2.5, 7.5), µ3 = (7.5, 5).
50% of the requests are assigned to cluster two, 25% to each other cluster. The standard deviations
are set to σ = 0.5.2 As we show in the evaluation, the distributions provide different reoptimization
and anticipation potentials. All realizations start with an optimal TSP-tour through the early request
customers C0 .

4.2 Solution Quality

To evaluate the overall performance of the approaches, we additionally apply a myopic policy
(“Myopic”), always accepting the largest feasible subset and updating the planned tour via CI. The
individual results are shown in Table 3. We additionally depict the standard errors as an established
measure of significance. The standard errors are calculated as the standard deviation divided by
2
Both the tuning as well as the test instances are made available on request for every instance setting.

15
the square root of runs. We further depict the best γ-parameter for each instance setting. Because
of the statistical noise the differences between the policies are not always significant. To give an
impression of the “best” and “worst” policies, we use the concept of policy dominance. A policy
is dominant compared to another policy, if the results are significantly better. That means that the
mean value minus standard error of the dominant policy is larger than the mean value plus the
standard error of the dominated policy. We highlight dominance by different font types. For each
instance setting, we depict policies that are not dominated by any other policies in bold. These
policies are part of the best policies for the instance setting. We depict policies that are not dominant
compared to any other policy in italic. These policies are part of the worst set of policies for each
instance setting.
We observe that no policy is dominant for all instances but that the performances of RA, RHR,
Myopic, and Combination vary with respect to the instances. We observe a shift between RA and
RHR. For small DOD-values, RHR provides the best results. With increasing degree of dynamism,
RHR becomes dominant. In most of the cases, the Combination is part of the set of dominant
policies. However, for every instance setting, either RA or RHR is part of the set as well. There is
no instance setting where the Combination is the only dominant policy.
To give a better impression of the results, we compare the approaches’ improvement to the
myopic policy. To this end, we calculate the average solution quality Q(π, i) for every approach π
and every individual instance setting i. The improvement is then defined as

Q(π, i) − Q(Myopic, i)
. (3)
Q(Myopic, i)
All policies show significant improvement compared to the myopic policy. On average, the
RA achieves an improvement of 5.2%, the RHR achieves 4.4%, and the Combination a 6.2%
improvement. The Combination as a compromise between reoptimization and anticipation achieves
the best results on average. That confirms that neither reoptimization nor anticipation are dominant
for every instance setting and that the selection of a suitable optimization focus depends on the
instance settings. However, looking at the individual results in Table 3, the Combination achieves the
best results for two of the 12 instance settings only. This indicates that the individual instance settings
may have a strong tendency for either reoptimization or anticipation, and, that our straightforward
combination may not be effective for the problem at hand. We analyze the individual results in the

16
20
RA RHR Combination
15
Improvement (in %)

10

0
0.25 0.50 0.75 1.00
‐5

‐10
(a) Improvement of the Policies for Varying DOD
20
RA RHR Combination
15
Improvement (in %)

10

0
U 3C 2C
‐5

‐10
(b) Improvement of the Policies for Varying Distributions

Figure 3: Analysis for Varying DOD and Distribution

following.
Figure 3 depicts the average improvement of the policies Combination, RA, RHR with respect
to the DOD and the customer distribution. Figure 3a shows the aggregated results for DODs of
0.25, 0.50, 0.75, and 1.00. Figure 3b shows the aggregated results for the three different customer
distributions: uniformly distributed in the service area U , distributed in three clusters 3C, and
distributed in two clusters 2C.
We first analyze the impact of the DOD shown in Figure 3a. For a low DOD of 0.25, the RHR
in isolation achieves the highest improvement with 6.51%. These instance settings comprise about

17
45 ERCs and only about 15 LRCs. The routes are relatively long even in the beginning of the
horizon and the potential for reoptimization is high. For a low DOD of 0.25, the RA performs
significantly worse than the myopic policy. With only a few LRC, an anticipation of future requests
is challenging and the RA gives away current rewards drawing on a potentially weak anticipation.
The combination achieves an improvement of 5.30% but is not able to reach the results of RHR.
Hence, for applications with low DOD, a sole focus on reoptimization may be beneficial.
For an increasing DOD, we observe that first Combination and then RA provide the best results.
These values also reflect the development of the significantly dominant policies in Table 3. The
importance of anticipation increases with the DOD for several reasons. First, for a high DOD, the
assignment decisions via subset selection become more important, because more requests need to
be assigned or rejected, respectively. Further, the stream of new requests and, therefore, anticipation
becomes more reliable. In cases with a DOD of 0.75 and 1.00, the impact of decisions on future
rewards becomes significant and the RA in isolation outperforms every other policy substantially.
Furthermore, the improvement of RHR decreases. For increasing DOD, the potential for routing
reoptimization decreases because the number of customers in the route per decision epoch is
small. This indicates that reoptimization does not provide significant benefit in highly uncertain
environments and anticipation becomes advantageous.
A similar behavior can be observed with respect to the spread in customer distributions shown
in Figure 3b. In case of uniformly distributed customers, customer requests occur across the entire
service area and the distance between requests is relatively large. All policies show significant
improvements. Because the large average distance between requests results in long tour durations,
a reoptimization is able to save a large amount of time leading to a substantial improvement of
RHR. However, because individual requests may be far away from current routes, the assignment
decisions are even more important. As a result, the RA outperforms all other policies significantly.
The importance of assignments decreases when the geographical spread is reduced as for 3C and
2C. For 3C, the customers generally occur in one of three distinct regions. For 2C, customers
occur only in two small regions of the service area. When the customers accumulate in clusters,
assignment decisions become less important because there are no expensive “outliers” anymore and
a greedy acceptance policy provides reasonable results. Hence, the improvement of RA compared to
the myopic policy decreases. However, because customers are generally close, the insertion of new

18
Table 4: Solution Quality and Standard Error of the Policies for Distribution 3C and Varying
Runtimes

DOD Runtime (in s) RA RHR Combination γ


0.25 5 10.63 (± 0.38) 12.65 (± 0.37) 12.55 (± 0.38) 4
0.25 10 10.66 (± 0.39) 12.63 (± 0.37) 12.70 (± 0.36) 3
0.25 15 10.65 (± 0.38) 12.73 (± 0.37) 12.83 (± 0.37) 5
0.25 30 10.86 (± 0.39) 12.79 (± 0.40) 12.83 (± 0.40) 5
0.25 60 10.67 (± 0.38) 12.70 (± 0.37) 12.76 (± 0.38) 3
0.5 5 21.51 (± 0.47) 23.43 (± 0.46) 22.90 (± 0.47) 4
0.5 10 21.46 (± 0.49) 23.49 (± 0.46) 23.62 (± 0.46) 4
0.5 15 21.69 (± 0.46) 23.55 (± 0.46) 23.38 (± 0.47) 4
0.5 30 21.48 (± 0.47) 23.56 (± 0.49) 23.41 (± 0.48) 3
0.5 60 21.61 (± 0.47) 23.51 (± 0.45) 23.60 (± 0.46) 3
0.75 5 34.55 (± 0.69) 33.90 (± 0.60) 33.58 (± 0.62) 4
0.75 10 34.46 (± 0.68) 33.78 (± 0.60) 34.06 (± 0.62) 5
0.75 15 34.24 (± 0.74) 33.72 (± 0.64) 34.23 (± 0.64) 3
0.75 30 35.04 (± 0.70) 33.82 (± 0.63) 34.66 (± 0.62) 5
0.75 60 34.95 (± 0.74) 33.91 (± 0.60) 34.63 (± 0.63) 3
1.00 5 44.84 (± 0.78) 41.94 (± 0.65) 44.14 (± 0.77) 5
1.00 10 45.58 (± 0.77) 41.86 (± 0.64) 42.57 (± 0.66) 5
1.00 15 45.94 (± 0.74) 41.86 (± 0.64) 42.71 (± 0.64) 5
1.00 30 46.07 (± 0.80) 41.73 (± 0.68) 43.32 (± 0.70) 5
1.00 60 46.25 (± 0.77) 41.94 (± 0.65) 42.99 (± 0.65) 3

requests is cheap and a few minutes freed by RHR may allow additional services. In essence, when
customer locations are generally close, reoptimization becomes beneficial compared to anticipation.

4.3 The Impact of Runtime

In our main computational study, we assume an accessible runtime of 30 seconds per decision
epoch. However, the amount of runtime depends on the business model. To analyze how the runtime
impacts our choice of solution method, we vary the runtime in this section. We select the instances
with distribution 3C, because for this distribution, neither the RA nor the RHR are dominant. Again,
we run 100 tuning and 100 test runs for varying runtimes of 5, 10, 15, 30, and 60 seconds per
decision epoch.
The individual results are depicted in Table 4. Again, we depict the dominant strategies in

19
5

3
Improvement (in %)

0
5 10 15 30 60
‐1

‐2
RA RHR Combination
‐3
Runtime (in Seconds)

Figure 4: Improvement of the Policies Compared to Myopic with Respect to Runtime

bold and the dominated policies in italic. Notably, the solution quality’s tendency of RHR and
RA for a DOD is independent of the runtime. For a DOD of 0.25 and 0.50, RHR outperforms RA
significantly. For a DOD of 0.75 and 1.00, RA outperforms RHR. Again, the Combination is never
dominant to both RHR and RH at the same time. It is noteworthy that the Combination is often
not able to outperform RA or RHR even with substantially larger runtimes available. For example,
for a DOD of 1.0 and 60 seconds runtime available, Combination accepts fewer customers than
RA with only 5 seconds runtime available. This indicates that the reoptimization component of the
Combination may mislead the anticipation of the rollout component, for example, by reordering the
sequence of cluster visits.
Following the procedure of Equation (3), we calculate the average improvement over the myopic
policy for the four different DOD for every runtime duration. In Figure 4, the x-axis shows the
runtime in seconds and the y-axis shows the improvement compared to the myopic policy. We
observe that the RA and the combination show a significant increase with respect to increasing
runtime. Allowing more runtime and more simulation runs for the RA, the simulation results
become more reliable. Furthermore, reducing the runtime available, the number of simulation runs
decreases and decisions are made based on a poor approximation. In essence, runtime plays an
essential role for the RA. The results also highlight the importance of offline anticipation methods

20
2

0.25 0.50
1.5
Average Gap (in %)

0.75 1.00

0.5

0
5 10 15 30 60
Runtime (in Seconds)

Figure 5: Average MIP Gap Per Decision Epoch

such as value function approximation, especially for problems where runtime in decision epochs is
limited. For RHR, the improvement with respect to runtime is marginal. A runtime of 5 seconds
already provides reasonable results but an increase in runtime does not add much benefit.
One reason that the increase is marginal may be that the MIP’s optimization process of RHR
generally terminates within the first 5 seconds. This is not the case as we show by analyzing the
average MIP-gaps per decision point. The gaps indicate the percentage difference between the
currently found solution and a lower bound. The gaps are depicted in Figure 5 with respect to
runtime and DOD. First, we observe that the gaps decrease with increasing DOD. For a high DOD,
the number of customers in the tour to reoptimize is relatively small and the gaps are small as
well. For a low DOD, the number of customers in the tour is larger and the reoptimization requires
more time. Thus, the gaps are larger as well. We observe that the gaps decrease significantly with
increasing runtime. The MIP’s optimization process of RHR generally does not terminate early.
However, the MIP gap of the delivered solution seems to play a minor role in the RHR’s solution
quality. This indicates that spending a significant effort to solve a static subproblem to optimality in
a stochastic dynamic decision context may not provide any significant benefit. This observation is
in alignment with observations from the stochastic optimization literature [26].

21
5 Conclusion and Future Research
In this final section, we summarize our paper and present a comprehensive outlook on future
research on dynamic vehicle routing.

5.1 Conclusion

In this paper, we have analyzed two different methods of approaching the dynamic vehicle routing
problem with stochastic requests. One approach aims at efficient routing plans by means of
reactive reoptimization based on current information via a static mixed-integer program. The other
approach focuses on anticipation of future requests by means of a rollout algorithm. We also
presented a combination of the two methods splitting the calculation time between reoptimization
and anticipation. In a computational study, we have analyzed the performance of the methods with
respect to the instances’ degree of dynamism, the customer distribution, and the runtime accessible
for calculations. We have derived three major statements:

1. The focus of the approach should depend on the degree of dynamism, the percentage of
uncertain requests. For low DOD, approaches should focus on reoptimization. For moderate
DOD, both methods are beneficial. For high DOD, reoptimization may not provide any
significant benefit and anticipation becomes mandatory.

2. For the VRPSR, the distribution of requests is another important dimension to decide whether
to apply reoptimization or anticipation. The wider the geographical dispersion of customers,
the more important anticipatory assignment decisions become. For clustered customers, the
integration of new requests becomes less expensive and the anticipation has less impact, but
even small route improvements by reoptimization may allow service of additional customers.

3. Varying the calculation time accessible per decision epoch does not change the order of the
policies’ solution quality. However, reoptimization is less affected by calculation time changes.
For anticipation, less calculation time results in poorer solution quality. For reoptimization,
the solution quality does not change significantly, even though the MIP gap closes with more
calculation time available.

22
5.2 Future Research

The work compares two different method dimensions, reoptimization and anticipation, for a vehicle
routing problem with stochastic requests. It shows how the method should be selected based on the
level of uncertainty. It further shows that a straightforward combination of both method dimensions
is challenging. It finally indicates that experiments for dynamic vehicle routing methodology are
often computationally challenging. Based on these insights, future work should focus on extending
the work to different business models, improving the methodology, and deriving strategies for
computational evaluation. In the following, we will outline potential future research in these three
areas.

Business Models

In this paper, we experienced uncertainty in customer requests, for example observed in business
models such as courier services, dial-a-ride, or same-day delivery. However, there are business
models where the uncertainty may originate from other dimensions, for example, stochastic demand,
service times, or travel times. While for stochastic requests, the DOD is a well-known measure of
uncertainty, such measures are not yet established for other sources of uncertainty. An investigation
may be worthwhile to develop such a measure, for example based on the coefficient of variation.
This measure can then be used to study whether the results of this paper transfer to other sources of
uncertainty.
Another challenge in dynamic vehicle routing is scalability in both problem size and complexity.
Many problems may consider not only one vehicle but a large fleet. Other business models may
require the consideration of additional constraints such as delivery deadlines, time-windows (TWs),
or pickup and delivery. This poses challenges for both anticipation and reactive reoptimization. For
anticipation, rollout algorithms are required to simulate the realistic development of a fleet given a
limited runtime within the simulations. For reactive reoptimization, instead of a traveling salesman
problem, a comprehensive vehicle routing problem needs to be solved within the limited runtime.
Furthermore, for both methods, the existence of fleets adds significantly more complexity to the
assignment decisions.
Constraints pose additional challenges and may also impact the suitability of anticipation

23
and reoptimization approaches. For example, TWs add another dimension of uncertainty while
simultaneously restricting the flexibility of decision making. Uncertainty not only manifests in
where and when customers request but also in the TW the customers demands. This additional
uncertainty complicates anticipation. Simultaneously, TWs restrict the set of potential decisions
and flexibility. Thus, route-reoptimization decisions may stay unaltered for a longer period of time.

Methodology

In this paper, we presented two methods, one for each optimization dimension. We further presented
a straightforward combination where we combine anticipatory assignment decisions with route
reoptimization. This combination did not add significant benefit. Future research may focus on
better ways of combining the two approaches. Instead of decomposition, a generic approach
integrating both simultaneously may lead to significant improvements. A suitable combination
may also be dependent of state-parameters like point of time. Hence, future research may focus on
smoothing the combination by state-dependent calculation time allocation.
Another aspect of suitable methodology is the runtime available for decision making. In the
problem at hand, we assumed half a minute in calculation time available per decision points. Other
problems may allow even less time for decision making, for example in dial-a-ride or e-commerce
applications. Future research may therefore develop runtime-efficient methodology. For example,
routing decisions could provide “robust” solutions that may be suitable for a sequence of decision
points with slight adaptions. Another potential possibility is to (continuously) reoptimize the routes
between customer requests. For anticipation, simulation may be guided in the “right” direction.
This can be enforced by reducing simulations for ineffective decision, for example by means of
indifference zone selection (IZS, [22]) or optimal learning [34]. Another promising approach is to
design a selected set of scenarios capturing prototypical aspects of potential realizations [21]. It
may also be beneficial to shift parts of the calculations to an offline phase, for example, by means of
value function approximations.

Evaluation

In our computational evaluation, we analyzed 12 different instance settings. To capture the uncer-
tainty in our problem, we ran a number of training and test runs for each instance setting. These

24
experiments were very time-consuming and, even though we often observed a clear tendency in the
quality of our policies, statistical significance could not be reached for every case.
Future research may focus on runtime-efficient evaluation techniques that allow statistically
significant statements for a variety of instance settings. To compare policies and select tuning
parameters, one way might be to increase tuning or evaluation runs until significant statistical
differences can be observed. This significance can be measured either by standard errors as
confidence intervals around the mean values, by means of IZS or by analysis of variance. Further
research should also focus on how insight can be determined for variations in the problem dimensions
when not all combinations can be tested.
Another interesting research direction is the determination of meaningful benchmark policies.
Beside comparing different policies, generic benchmarks are not established in dynamic vehicle
routing yet. Some papers use myopic benchmark policies as a lower bound. However, upper bounds
are difficult to achieve. In deterministic optimization, upper bounds can be derived by means of
problem relaxations or dual solutions. That is generally not possible in dynamic vehicle routing.
Solving the dual programs is very challenging [7]. One way to achieve an upper bound is by solving
the deterministic perfect information problem. However, even solving the deterministic problem
is generally very challenging [52]. Further, the achieved upper bound is not necessarily very tight
[26].

Acknowledgment
This paper is the product of many discussions over the years with my colleagues Dirk C. Mattfeld,
Jan F. Ehmke, Justin C. Goodson, Stefan Voß, and Barrett W. Thomas but also with many interested
researchers at conferences or workshops, with editors and anonymous referees in several review
processes. All these constructive discussions revealed that there might be a benefit analyzing merits
and shortcomings of both reoptimization and anticipation in dynamic vehicle routing. I thank all of
my colleagues but especially the editors and anonymous reviewers for their helpful and constructive
input.

25
References
[1] A. M. Arslan, N. Agatz, L. Kroon, and R. Zuidwijk. Crowdsourced deliverya dynamic pickup
and delivery problem with ad hoc drivers. Transportation Science, 2018.

[2] R. W. Bent and P. Van Hentenryck. Scenario-based planning for partially dynamic vehicle
routing with stochastic customers. Operations Research, 52(6):977–987, 2004.

[3] D. J. Bertsimas and G. Van Ryzin. A stochastic and dynamic vehicle routing problem in the
Euclidean plane. Operations Research, 39(4):601–615, 1991.

[4] F. P. Boscoe, K. A. Henry, and M. S. Zdeb. Nationwide comparison of driving distance versus
straight-line distance to hospitals. The Professional Geographer, 64(2):188–196, 2012.

[5] R. Branchini, A. Armentano, and A. Løkketangen. Adaptive granular local search heuristic
for a dynamic vehicle routing problem. Computers and Operations Research, 36:2955–2968,
2009.

[6] J. Brinkmann, M. W. Ulmer, and D. C. Mattfeld. Short-term strategies for stochastic inventory
routing in bike sharing systems. Transportation Research Procedia, 10:364–373, 2015.

[7] D. B. Brown, J. E. Smith, and P. Sun. Information relaxations and duality in stochastic dynamic
programs. Operations Research, 58(4-part-1):785–801, 2010.

[8] H.-K. Chen, C.-F. Hsueh, and M.-S. Chang. The real-time time-dependent vehicle routing
problem. Transportation Research Part E: Logistics and Transportation Review, 42(5):383–
408, 2006.

[9] J. F. Ehmke and A. M. Campbell. Customer acceptance mechanisms for home deliveries in
metropolitan areas. European Journal of Operational Research, 233(1):193–207, 2014.

[10] F. Ferrucci, S. Bock, and M. Gendreau. A pro-active real-time control approach for dynamic
vehicle routing problems dealing with the delivery of urgent goods. European Journal of
Operational Research, 225(1):130–141, 2013.

26
[11] M. Gendreau, F. Guertin, J.-Y. Potvin, and R. Séguin. Neighborhood search heuristics for a
dynamic vehicle dispatching problem with pick-ups and deliveries. Transportation Research
Part C: Emerging Technologies, 14(3):157–174, 2006.

[12] M. Gendreau, F. Guertin, J.-Y. Potvin, and E. Taillard. Parallel tabu search for real-time
vehicle routing and dispatching. Transportation Science, 33(4):381–390, 1999.

[13] M. Gendreau, O. Jabali, and W. Rei. 50th anniversary invited articlefuture research directions
in stochastic vehicle routing. Transportation Science, 2016.

[14] G. Ghiani, E. Manni, A. Quaranta, and C. Triki. Anticipatory algorithms for same-day
courier dispatching. Transportation Research Part E: Logistics and Transportation Review,
45(1):96–106, 2009.

[15] G. Ghiani, E. Manni, and B. W. Thomas. A comparison of anticipatory algorithms for the
dynamic and stochastic traveling salesman problem. Transportation Science, 46(3):374–387,
2012.

[16] J. C. Goodson, B. W. Thomas, and J. W. Ohlmann. Restocking-based rollout policies for the
vehicle routing problem with stochastic demand and duration limits. Transportation Science,
50(2):591–607, 2015.

[17] C. H. Häll, J. T. Lundgren, and S. Voß. Evaluating the performance of a dial-a-ride service
using simulation. Public Transport, pages 1–19, 2012.

[18] L. M. Hvattum, A. Løkketangen, and G. Laporte. Solving a dynamic and stochastic vehicle
routing problem with a sample scenario hedging heuristic. Transportation Science, 40(4):421–
438, 2006.

[19] S. Ichoua, M. Gendreau, and J.-Y. Potvin. Diversion issues in real-time vehicle dispatching.
Transportation Science, 34(4):426–438, 2000.

[20] S. Ichoua, M. Gendreau, and J.-Y. Potvin. Exploiting knowledge about future demands for
real-time vehicle dispatching. Transportation Science, 40(2):211–225, 2006.

27
[21] M. Kaut and S. W. Wallace. Evaluation of scenario-generation methods for stochastic pro-
gramming. Pacific Journal of Optimization, 3:257–271, 2003.

[22] S.-H. Kim and B. L. Nelson. A fully sequential procedure for indifference-zone selection in
simulation. ACM Transactions on Modeling and Computer Simulation (TOMACS), 11(3):251–
273, 2001.

[23] M. A. Klapp, A. L. Erera, and A. Toriello. The one-dimensional dynamic dispatch waves
problem. Transportation Science, 52(2):402–415, 2018.

[24] A. Larsen, O. B. G. Madsen, and M. M. Solomon. Partially dynamic vehicle routing-models


and algorithms. Journal of the Operational Research Society, 53(6):637–646, 2002.

[25] C. Lin, K. L. Choy, G. T. Ho, H. Lam, G. K. Pang, and K. Chin. A decision support
system for optimizing dynamic courier routing operations. Expert Systems with Applications,
41(15):6917–6933, 2014.

[26] F. Maggioni and S. W. Wallace. Analyzing the quality of the expected value solution in
stochastic programming. Annals of Operations Research, 200(1):37–54, 2012.

[27] M. S. Maxwell, M. Restrepo, S. G. Henderson, and H. Topaloglu. Approximate dynamic


programming for ambulance redeployment. INFORMS Journal on Computing, 22(2):266–281,
2010.

[28] S. Meisel. Anticipatory Optimization for Dynamic Decision Making, volume 51 of Operations
Research/Computer Science Interfaces Series. Springer, 2011.

[29] C. E. Miller, A. W. Tucker, and R. A. Zemlin. Integer programming formulation of traveling


salesman problems. Journal of the ACM (JACM), 7(4):326–329, 1960.

[30] S. Mitrović-Minić and G. Laporte. Waiting strategies for the dynamic pickup and delivery
problem with time windows. Transportation Research Part B: Methodological, 38(7):635–655,
2004.

[31] G. Ninikas and I. Minis. Reoptimization strategies for a dynamic vehicle routing problem
with mixed backhauls. Networks, 64(3):214–231, 2014.

28
[32] V. Pillac, M. Gendreau, C. Guéret, and A. L. Medaglia. A review of dynamic vehicle routing
problems. European Journal of Operational Research, 225(1):1–11, 2013.

[33] W. B. Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality,


volume 842 of Wiley Series in Probability and Statistics. John Wiley & Sons, New York, 2011.

[34] W. B. Powell and I. O. Ryzhov. Optimal Learning, volume 841 of Wiley Series in Probability
and Statistics. John Wiley & Sons, New York, 2012.

[35] H. N. Psaraftis. A dynamic programming solution to the single vehicle many-to-many


immediate request dial-a-ride problem. Transportation Science, 14(2):130–154, 1980.

[36] H. N. Psaraftis, M. Wen, and C. A. Kontovas. Dynamic vehicle routing problems: Three
decades and counting. Networks, 67(1):3–31, 2016.

[37] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming.


John Wiley & Sons, New York, 2014.

[38] U. Ritzinger, J. Puchinger, and R. F. Hartl. Dynamic programming based metaheuristics for
the dial-a-ride problem. Annals of Operations Research, pages 1–18, 2014.

[39] U. Ritzinger, J. Puchinger, and R. F. Hartl. A survey on dynamic and stochastic vehicle routing
problems. International Journal of Production Research, pages 1–17, 2015.

[40] D. J. Rosenkrantz, R. E. Stearns, and P. M. Lewis. Approximate algorithms for the traveling
salesperson problem. In Switching and Automata Theory, 1974., IEEE Conference Record of
15th Annual Symposium on, pages 33–42. IEEE, 1974.

[41] M. Savelsbergh and T. Van Woensel. City logistics: Challenges and opportunities. Transporta-
tion Science, 50(2):579–590, 2016.

[42] M. Schilde, K. F. Doerner, and R. F. Hartl. Integrating stochastic time-dependent travel speed
in solution methods for the dynamic dial-a-ride problem. European Journal of Operational
Research, 238(1):18–30, 2014.

[43] M. Schyns. An ant colony system for responsive dynamic vehicle routing. European Journal
of Operational Research, 245(3):704–718, 2015.

29
[44] M. G. Speranza. Trends in transportation and logistics. European Journal of Operational
Research, 264(3):830 – 836, 2018.

[45] M. R. Swihart and J. D. Papastavrou. A stochastic and dynamic model for the single-vehicle
pick-up and delivery problem. European Journal of Operational Research, 114(3):447–464,
1999.

[46] L. Tassiulas. Adaptive routing on the plane. Operations Research, 44(5):823–832, 1996.

[47] B. W. Thomas. Waiting strategies for anticipating service requests from known customer
locations. Transportation Science, 41(3):319–331, 2007.

[48] M. W. Ulmer. Approximate Dynamic Programming for Dynamic Vehicle Routing. Operations
Research/Computer Science Interfaces Series. Springer, 2017.

[49] M. W. Ulmer, J. C. Goodson, D. C. Mattfeld, and M. Hennig. Offline-online approximate


dynamic programming for dynamic vehicle routing with stochastic requests. Transportation
Science, 2018.

[50] M. W. Ulmer, L. Heilig, and S. Voß. On the value and challenge of real-time information
in dynamic dispatching of service vehicles. Business & Information Systems Engineering,
59(3):161–171, Jun 2017.

[51] M. W. Ulmer, D. C. Mattfeld, and F. Köster. Budgeting time for dynamic vehicle routing with
stochastic customer requests. Transportation Science, 52(1):20–37, 2018.

[52] M. W. Ulmer, N. Soeffker, and D. C. Mattfeld. Value function approximation for dynamic
multi-period vehicle routing. European Journal of Operational Research, 2018.

[53] S. A. Voccia, A. M. Campbell, and B. W. Thomas. The same-day delivery problem for online
purchases. Transportation Science, 2018.

[54] S. Zhang, J. W. Ohlmann, and B. W. Thomas. Dynamic orienteering on a network of queues.


Transportation Science, 52(3):691–706, 2018.

30
Algorithm 1: Post-Decision Rollout Algorithm
Input : State Sk , Decisions {x1 , . . . , xn }, Reward-Function R(S, x) → N0 , Post-Decision
States Skx = (Skx1 , . . . , Skxn ), Realizations {ω1 , . . . , ωm }, Base Policy πb
Output : Decision x∗
1 for all S x ∈ Skx do
2 i←0
3 V̂ (S x ) ← 0
4 // Simulation
5 while (i ≤ m) do
6 S ← Sk
7 S x ← Skx
8 while (S x ≤ SK ) do
9 S ← (S x , ωi )
10 x ← X πb (S)
11 V̂ (Skx ) ← V̂ (Skx ) + m1 R(S, x)
12 S x ← (S, x)
13 end
14 i←i+1
15 end
16 end
17 // Selection
18 R∗ ← 0
19 for all Skx ∈ Pk do
20 if (R(Sk , x) + V̂ (Skx ) ≥ R∗ ) then x∗ ← x
21 end
22 return x∗

Appendix
In the Appendix, we describe the rollout algorithm in detail. Algorithm 1 describes the procedure of
decision making for a post-decision RA. Given a state Sk , decisions x1 , . . . , xn , the reward-function
R, the post-decision states (PDSs) Pk = (Skx1 , . . . , Skxn ), a set of sampled realizations {ω1 , . . . , ωm },
and the base policy πb . Then, for every PDS Skx , the RA simulates the remaining horizon of each
realization ωi . At the decision epochs k + j > k, the decision X πb (Sk+j x
) induced by πb is applied.
For our RA, the base policy consists of CI-routing and subset selection by means of value function
approximation. The observed rewards R(Sk+j , X πb (Sk+j )) are accumulated for j = 1, . . . , K − k.
The overall reward-to-go of a PDS V̂ (Skx ) is the average of the single rewards per realization. The
RA selects the decision x∗ leading to the maximum sum of immediate reward R(Sk , x∗ ) and average

future rewards V̂ (Skx ).

31

You might also like