You are on page 1of 11
European journal Operatiana esearch 258 (2017) 456-266 Contents lists available at ScienceDirect European Journal of Operational Research journal homepage: vww.olseviorcomvlocatelejor ELSEVIER Discrete Optimization Route relaxations on GPU for vehicle routing problems Marco Antonio Boschetti*, Vittorio Maniezzo", Francesco Strappaveccia* {Deparment of Mathematics. Univers of Bologna. va Sec 3. Cesena. ay erty of Bologna DIS vi Sort, Cerna ly Universty 9 Balen, DIN. ile iorgmente 2 Boog. ey ARTICLE INFO ABSTRACT ‘State-Space Relnation (SSR) i= an approach ofen ured to compute by dynamic programming (DP) efec- tive bounds for many combinatorial optimization problems. Currently, the most effective exact approaches for solving many Veicle Routing Problems {VRPs) ae DP algorithms making use of SSR for computing ‘helt bounding components. In particular, most of these make wse of the q-route and ng-routerelax- tons, The bounding procedures based on theze relaxations provide good quality bounds but they az ‘often time consuming fo compute, even for moderate size instances. In this paper we investigate the use ‘of GPU computing for solving qoute and ng-ouce relaxations The results prove thas the proposed GPU Dynami programming iamplen to the sequential versions atlons are able to achieve remarkable computing Cin reductions, up to 40 times with respect (© 2016 Elsevier BV All rights reserved 1. Introduction The Vehicle Routing Problem (VRP) is among the most studied problems in combinatorial optimization, and retains unabated in- terest both because, though simple to state, it enjoys intriguing mathematical properties and because it can be quickly specified into problems of primary economic interest The literature on the core problem variants and on the possible real-world variations {got huge after the seminal paper which introduced it (Dantzig & Ramser, 1959), and includes dedicated books (Golden, Raghavan, & ‘Wasil, 2008: Toth & Vigo. 2001) and. more recently, also dedicated Working groups of research associations (Verolog Web Site. 2014), The core problem can be quickly stated as finding a least cost set of routes to service a number of customers from a central de- pot, given a cost matrix specifying the travel cost between all cus- ‘tomerjdepot locations. The problem can be further complicated, by adding real world constraints. A commonly included constraint as- sumes that the routes are to be travelled by vehicles in order to deliver or to collect goods from customers, thus the total amount of goods loaded on each truck cannot exceed its capacity, in weight or in volume, This gives rise to the Capacitated VRP variant (CVRP), Alternatively, each customer can ask either for a delivery ot for a collection of goods (or both, yielding the Pickup and Delivery vari- ants (VRPPD), including the case where there are coupled pairs of pickups and deliveries. In case all deliveries of each route are to be made first, then all collections, we have the CVRP with back- hauls (VRPB).A further quite common constraint considers feasible time windows for the visits at the customers (VRPTW), Moreover, in small area settings each truck could go back to the depot to reload (multi-trip VRP, MTVRP), while in bigger areas the use of ‘more depots is common by the vehicles of the fleet (multi-depot VRP, MDVRP). The vehicles of the fleet can be equal ot different (heterogeneous fleet CVRP, HVRP), they could be requested to re- {urn to the depot they started fom, or not (open CVRP, OVRP), they could be requested to repeat the same routes with a given Periodicity over the planning horizon (periodic VRP, PVRP), ec. Furthermore, all listed constraints, and many more coming from operational practice, can be freely combined to model actual use cases. For example, a recent work on city logistics operational op- timization (Hoschesti & Manievzo, 2015) models its problem as a CVRP with time windows, multi-trip, heterogeneous fleet and pickup and delivery. Given its theoretical and practical relevance, the VRP witnessed, a wealth of diverse approaches for its solution and still fosters a lively research community studying either exact or heuristic meth- ‘ods or a combination of them, as recently done by matheutistic approaches (Boschetti, Maniezzo, Roffili, & Bolufe Rabler, 2009). A detailed survey is clearly out of scope for this paper, in the follow- ing we will recall just a few contributions. Exact approaches are of mote ditect interest for this paper Again, different approaches have been used, ranging from dynamic programming (0 & Sun, 2010) to branch and bound (Christofides, Mingozzi, & Toth, 19812) from branch and cut (Lysgaard, Letch- ford, & Eglese, 2004) to column generation (Desrachers, Desrosiers, & Solomon, 1992), In all cases, a central feature is the ability to MA fosshet el Eropean Jura ef Operational Research 258 (207) 456-456 “sr compute tight lower bounds. Again, different approaches for com- puting bounds are proposed in the literature and, recent, bound ing procedures based on non-elementary paths, such a8 g-routes (Christofides et al, 19812) and most notably ng-routes (Baldacci Mingozzi & Roberti, 2011; Martinelli, Pecin, & Poggi, 2014) appear to be particulary effective for the CVRP. ‘GPU (Graphics Processing Unit) computing is achieving increas- ing interest among the optimization community, given the pos sibility to significantly speed up tasks at the core of many ap- proaches of interest. thus to ultimately achieve substantial eff ciency improvements (Brodikorb, Hagen, Schul2, & Hasle, 2012) The number of contributions in the field is in fact increasing. at fast rate, with actual applications especially in the area of bioin- formatics ‘One optimization approach particularly prone to be imple- ‘mented using GPU computing is dynamic programming, because both its basic data structures (essentially an n-dimensional array) and its staged computations can take advantage of the GPU ar- chitectute, However, dynamic programming is not the only ap- proach which can benefit from GPU computing, in fact Schull Hasle, Brodtkorb, and Hagen (2012) in their extensive survey, 1e- port results obtained also by different metaheuristics, ACO, PSO, genetic algorithms, differential evolution, and devote a section t0 Tocal search and related hybrids. Notwithstanding, the majority of the applications so far, are based on dynamic programming, and, are usually in the area of bioinformatics. An interesting survey of GPU-accelerated bioinformatic applications can be found in Stone Hardy, Usimtsev, and Schlten (2010), where several widely differ- tent areas of application are overviewed, usually reporting imple- ‘mentations which achieve speedups of 20x or mote, with some case where two orders of magnitude speedups are possible. Applications to combinatorial optimization problems have al- ready been reported for a number of problems, mainly for the core version of basic problems. Thus, Boyer, El Baz and Moussa (2012) proved that on a consumer GPU it is possible to obtain a speed-up of 20 on the Knapsack Problem and Harish, Vineet. and Narayanan (2009) and Bulug. Gilbert, and Budak (2010) provided an extensive spectrum of graph problems (Depth First Search, etc.) which are substantially accelerated by using a GPU, Ortega-Atrana, Torres, Llanos, and Gonzalez-Bseribano (2013) and Kumat, Misra an Tomar (2011) used GPU acceleration on algorithms like the Bellman-Ford and Dijkstra for solving the Single Source Shortest Path Problem (SSSPP), The GPU has given remarkable results also fon the All Paits Shortest Path Problem solved with the Floyd— Warshall algorithm (Kat2 & Kider, 2008; Lund & Smith, 2010). Ac tual industrial applications of manycore accelerated codes are still scatce, one can be found for the Two-Dimensional Guillotine Cut- ting Problem (Boschelti, Maniezzo, & Strappaveccia, 2016), In this paper we propose the frst work where GPU computing, is used for implementing vehicle routing optimization components Specifically we propose a GPU implementation of the dynamic pro- ramming procedures for computing q-route and ng-route bounds. The implementation on GPU of an optimization algorithm is a complex task that involves the study of tailored data structures and corresponding routines. The paper reports in detail the choices we ‘made to achieve the most efficient parallel implementation of the G-toute and ng-route bound computation routines and substanti- ates this with computational results on standard problem bench- ‘marks from the literature, ‘The paper is structured as follows. In Section 2 we describe the CCVRP and we report the set patttioning model used in the liter- ature for it. In Section 3 bounding procedures based on q-route and ng-route relaxations are described, whereas their GPU imple- ‘mentation is described in Section 4. Computational results are re- ported and discussed in Section 5 and conclusions are drawn in Section 6, 2. Problem description and mathematical formulations The CVRE consists in finding the leastcost Set of routes to be serviced by a homogeneous feet of m vehicles, each with capac- ‘ty Q in order to service each of n customers, whose index set is, All routes start and return to comman depot, convention- ally indexed by 0. Let V=V; U{0}. Input data consist of the re- quests q of each customer i= 1,....m and ofthe traveling costs ty t= Ovsats j=0,.rem, between each pat of customers and Between each cistomer and the depot. The problem can thus be defined on a complete weighted graph GH W.AC, where A~ (Gj) i) eV), and C= (oy 4) eV) 38 the corresponding possibly asymmetric, cost matrix. in rea-word {plications Gis typically an overay graph superimposed an an Bctual road network. and vertices in Vcorespond to geocode fa- Sites, while ares in A correspond to least-cost paths, computed fccording tothe metic to minimize (eg. distance tie. et) ‘Ihe problem can be formulated in ciferent ways, we refer the reader oot at Vigo 2001 fora thorough overview. I this pax Der we ust consider the set partitioning model, which i the most ommon one making use of g-oute and ngzoute relaxations. "he st partitioning formulation, originally proposed by Balinski dane Quand (1964, asocates a decision variable x t0 each feas- ble vehicle route, that isto each route that can be travelled by 2 vehicle, leaving the depot, servicing a subset of customers that cle lecively do nat exceed the vehicle capacity and finaly returning to the depot. Let 2 be the index set of al feasible routes, etc. be the cost of route cand let be a binary coefficient, which is equal to Tif and only i customer ic Vy belongs to route Cc. The formulation is 3 follows = min Yee @ st Dawa fel @ De @ me{0.1) bea 4) Constraints (2) ensure that each customer is serviced by exactly ‘one feasible route and constraint (3) imposes the fleet cardinality. I is noteworthy that, in case the cost matrix satisfies the triangle inequality, equalities (2) could be turned into greater than or equal to inequalities, thus turning the problem into an extended set cov- ering problem, which is computationally easier to deal with. 3. Relaxations based on dynamic programming ‘Among the exact methods for solving variants of VRP (e. CVRP) the most effective ones ate often based on a set partic tioning model and a column generation approach. These methods start with a limited set of columns and they iteratively add new columns, which have the potential to enter the optimal solution The new columns are generated by solving the so-called pricing problem ‘The pricing problem corresponds to the Elementary Shortest Path Problem with Resource Constraints (ESPPRC), also known as Resource Constrained Elementary Shortest Path Problem (RCESPP), which is strongly NP-Hard as shown by Dror (1994). This problem is defined on a set R= (R!.....R”) of avilable resources and on a graph lV, A, C), where Vis the set of vertices containing m cus- tomers and two vertices $ and ¢ representing the source and the 458 A bashes et al Buropean Journal of Operational Research 258 (207) 456-466 sink of the path, which represent the depot of the correspond- ing VRP, A isthe set of ares, and Cis the cost matrix, Each arc (oi) © A has an associated cost c,, possibly negative in case of pricing problems because it corresponds to the cast ¢y penalized bby dual variables, and a resource vector f= (fy.--.1™). which specifies the amount of each resource that is needed to make use of that ate For any path P= ((9= ii). ha Hy = 9), the cost is given by the sum of the costs of the arcs belonging to P ies cP) = SPC, The path Pi feasible if the resource cone sumption of the path’ does not exceed the available resources, i. Soy rh,, = RE for every resource k= 1,,...m. The problem con sists in finding the least cost feasible elementary vertex 5 t0 vertex f ‘The ESPPRC can be solved with Dynamic Programming (DP) as, for example, proposed by Feille, Dejax, Gendreau, and Gueguen (2004) and Righint and Salant (2008). The state-space size of the ‘exact DPs proposed in the literature so far increases too quickly ‘when the size of instances increases. Even for moderate size in= stances the state-space size could be too large for solving the problem in realistic time on a powerful workstation. Therefore, Christofides, Mingozzi. and Toth (1981) proposed the State-Space Relaxation (SSR), where the state-space of the DP is relaxed to ‘compute lower (resp. upper) bounds to the original minimization (resp. maximization) problem. The main idea behind SSR is well summarized by Righini and Salani (2008), who observe that with SSSR the search space explored by DP is projected onto a lower dimensional space so that only the minimum cost state is re- tained among all the corresponding states in the higher dimen- sional space. Inthe literature some interesting state-space relax- ations for the ESPPRC are proposed by Christofides et al (19813), Righini and Salani (2006, 2008, 2008), and Baldacci et al. (2011) ‘These algorithms, as all dynamic programming algorithms, trade space for time, enumerating all the interesting solutions for the relaxed problem. In these cases, the elementarity of the path is relaxed or partially relaxed to find interesting paths, where cy- cling is not avoided but itis reduced by limiting the amount of available resources (e.g. weight, time) or by applying specific rules (eg, avoid cycles of cardinality two or involving 2 given subset of vertices). These almost elementary paths can be the base for the construction of feasible solutions or they can be a good starting point for the computation of tight bounds. In the next sections we describe three of these methods (g-route, through-g-route and ng- route) and we analyze the characteristics of each of them, path from the 31. q-Route relaxation Let. be the cost ofthe lest cost path P= sh = 1 (ot recs cemgotay) fom te dept to cst wih foal oad geatP) =e ty, Such & path i aed pth Ate path with the additional arc (i, 0) is called q-route and has cost Fa td ‘ses el (861) extend the @-path defnton to im- pose tat each path sould not contain fap of wo aes Let 2 ibe the wren ju por ton the path coespondng to ter gg be he cos ofthe least cost pth ening se ete th the conan tat the vertex 7g) preceding fi nt ex fo Mtg The cos io te et cost path had tom the eg oo wt fpr and without oops ft aes crated lowe fa = 4 i) +d, 9-9. +ds Given the function hq, j 1}, the functions fg, i) and gg, #) can be ‘computed as follows: fr@- a.) 41 6) hid otherwise n(8)=6 8 (3 £9. in ys {h(Q. 5. i} 1G.) = argihingathia. 5D} 6) and $64.0) = min, {H19..} ” The recursion is initialized by setting (9.1) = day and 2(a. 1) 0 for g= qi, J1G.i) = 00 and (9.1) =—1 for every q # 4, and (4.1) — ce for every 4 The pseudocode of the algorithm is as follows: Auconmrion Q-Paras(a.Q. 4.4) 1 {Initialization 2 for g=0,..,Q do 3 feria t....ndo 4 6qd=e 5 tana 6 then f(9.0) = 400.0), 1(@. =0 7 else f(G.8) =, (9. 1 Bj Recursion 8 forq=0,...Q do 10 foriet,.ndo 1 for ndo 12 HG FDAG@Q@-HDFD B then hq, J.) = fg— a) + di) 14 else HG. J.D) = 9-9.) +dG.0 18 109.8) = mina (HOG, 5.) 16 (qui) = argmin gthGa.j.0} 8G.) = midjan ign sas (M59) 18 return fx, Using this dynamic programming procedure we can find for- ward shortest paths from the depot to each vertex i using a vehicle of capacity Q avoiding loops involving two arcs. Considering the example in Fig. 1, we set x (.8) = 6 for the state (g, i) that is generated for vertex i= 8 from a state of vertex {6 Hence, it will not be possible to generate a state (q, 6) ftom state (q, 8), avoiding loops involving two arcs. However, we cannot yet guarantee the path elementarity In some situations, as in Section 2.2, we could need to compute also the backward g-paths, which are the shortest paths from every MA sche l/Bropean Journal of Operational Research 258 (207) 456-486 459 1 0 J N ‘ol aati) mn tie-a.deeh ig. 2. fo ph computation vertex ito the depot using a vehicle of capacity Q. When the graph is symmetsc (ie. dy = dy. for every i,j © V). the values J, of backward g-paths are equivalent to the values of the forward q-paths (ie, f(9.0) = J-(G.i). On the other hand, in the case of asymmetric graphs, where dy # dy for some 1, je V, the backward «paths can be computed by the Same recursion as in the forward case, but applied to the transpose D of the cost mattix D = (4, ‘The g-route relaxation can be computed in pseudo-polynomial time with a complexity of O(12Q). In Fig. 2 we present graphically, the computation of a single f(g 1) value. Notice that for evaluating, cach state value fig. given q— q, we have to iterate overall the nn vertices, which are saved in contiguous locations in the computer memory, allowing a better computational performance, 32, Through-g-route relaxation The through-g-route relaxation (see Christofides et al, 19812) is a better relaxation obtained evaluating the cost (a, 1) of the least cost route, without loops of two arcs, starting from the depot. passing through I and going back to the depot, with 2 total Ioad 4. ‘The Vig, #) values are computed by joining two q-paths by using values f(g, 1) and 49.1) as follows: (aresaracdy, » pe mn PERGD era ra—as VD = MM) min fg.) + 69-+q 2.0. 6(4,0 +4944 -4.0}, otherwise 8) ‘The algorithm to compute through- hy tadid + € « Neds) 4 then // Swap values 5 Soup = hy Chie) 5 hag (tha) = hyd t+ Nehds) 7 gtd +t 4 Nthds) = foray 8 {1 Swap Predecessors 8 Honap = Xp (Chdide) 10 aoylthdids) = (hdd +» Neds) 0 Mptthdis +t =Nthds) = Hovey 12 synethreads(} 13. |} Reduction 14 for s= 2k k=Nthds/4,...,0.d0 15 ifthdidx FG. + 94-40 6 then h(thdide) = /@.0+6a~4,~ 3.0 0 it hGhdide) = G10.) + JG FG —4D 8 then hihdide) = 814.04 F+0=4.0) 19. syncthreads) 20 j/ Compute the minimum by a parallel reduction 21 for s=2+k k= Nthds/4,...,0 do 2 ifthdidx <= 23 then if h(chdidx + 5) < h(chaid) 24 then h(thdide) = h(thdidx +s) 25. syncthreads() 26 || Data Update 27 ifthdiax 0 28 then (9.1) =h(0) Procedure GPU-TsRoucH-Q-RourEs initializes at lines 1-8 the local variables for the assigned block g. At lines 10-19 the NThds available threads compute the Ag partial values, according (0 £4, (6), and store them in the shares memory for computing the mini- ‘umm at ines 21-28 by means of the algorithm described in Haris (2009). 43, GPU ng-route ‘The ng-route relaxation is computationally mote expensive than the qeroute and the through-q-oute relaxations and the main problem to face for its implementation isthe efficient management ff the ng-sets generated by expression (9) ‘The ng-sets associated to each stage (Qi) ate dynamically gen- ‘erated, and have a variable size. Dynamic data structures on a GPU feavironment are not desirable, because their management could be a performance killer In this section we describe our strategies for addressing this problem and we present the resulting parallel algorithm for the ng-route relaxation on GPU, Notice from the expression (3) that all the NG sets generated for a given stage (qi) are completely contained inset Nand aso contain the veftex | Therefore, the complete enumeration for all the possible NG sets at stage (g, iis given by all the subsets of N, containing the vertex i which are at most 280-1 (remember that, A(N)) is the maximum size of N}). For reasonable ACN) vals, we may easily enumerate a prion all the posible sets NG in a static data structure Tig shows how all the states assocated to sets NG generated by recursion (9) may be stored in a three-dimensional data struc- ture, where the sizes correspond to the possible loads 9, the vertex indexes i, and the indexes of sets NG, For doing that we need to de- fine the mapping between indexes and sets NC and to reformulate recursion (10) im its forward form: 10+ INO ~ yan ENED +4} Yd +d K.NG) € ® (12) where WKN) = (ENE): G.iNG) EWULKNO} (13) Using expression (12) we can compute independently all the values f(q+qy-k.NG) from every state (q, i, NG’) in the set of stages having load q (see Fig. 4), 43:1, Transition mapping To easily retrieve the index of each set NG, we decided to pre- compute each transition from a set NG associated to vertex j to a set NG" associated to vertex i, Given the starting vertex j and the index of the set NG associated to it, the transition map is able to provide the index of the resulting set NG" associated with the ending vertex i Let 5) be the sequence obtained from N, as follows: 5, = (lovin. stg), where f= [N | =1, ty € Ni for every k = 0... and ig = fey for every k=0,....—1. For each vertex (all the possi- ble seis NG are subset of N, Therefore, given a vertex i and a set NGEN,, we may define a bit mask with respect to the sequence 5) a5 follows: Masky(NG) = (bo,b,.--..By). where By =1 If iy © NG and by = 0 otherwise, Based on this bit mask, we may define a unique index for each, NGcN; Ist Index(NG) = > by +2 (4) Example 1. Given a vertex i=7 and the sequence 5,= (7.2.4,6)CNp, where Ny = (2,4,6,7,8,10} and A(N) —6, ‘We have Maskj(NG) = (1.1, 1,1.0,0} and Indexy(NG) = 122" + 1+ D4 1e22 4102? 400240025 =142+4484040=15. The index Index(NG) gives the possibility to retrieve the set ING on the fly. Moreover, we may define in advance which will be the index for the new set NG’ created by expanding another state. Given the starting vertex j, the index,(NG) of set NG, and the desti- nation vertex j, the mapping function Map(), i, indexs(NG)) returns the index Index,(NG’) of the set NC’ 432. Active sets The states generated at each stage by expression (9) do not include all the possible subsets of N; but only a subset of them, which we call active ses. We can generate the active set for each stage in advance by simply running the recursion. without computing the values f, but using the transition map defined previously, before starting the full recursion. This approach allows us to apply a modified ver sion of the so-called threads compaction, described in Harish et a (2008) and analyzed in Schulz (2013), This method consists in creating a mask for allowing the GPU to spawn only the threads useful for the computation. Using this technique and exploiting the property described before. we can take in consideration only the states useful for the relaxation and calibrate the computational resources an these, avoiding the over- hhead induced by not working threads For each stage (q, 1) the cardinality of the corresponding ac- tive set is given by the function DimActiveSet(q,) and the index of the Keth set belonging to the active set is given by function AciveSer(q. i.) MA fosshet el Eropean Jura ef Operational Research 258 (207) 456-456 6 433. Enhanced serial version The use of active sets at each stage, together with the indexing for the sets NC, improve also the performance ofthe serial version of the method, as shown in the following, [Aiconrin NC-Paris ENMANCED(n 0.4.4) (iisaization 2 for gm 0,...0 3 forint sno 4 ig=9 5 thea 4.1.0) = 4,0), 14,40) =0 6 else (G1 0) =00, 10.10) 7 ag-Paths 8 forg=0....0 9 for nde 10 for jot... do n for h=0,..., Dimaciveser(g— qi), do 2 Noir ActiveSet(@— Gu 5-8) au Neg = Map. Nr) “4 HIG UNG a) > MO~ 440 J NG) +40 15 then 1 NGggg) = £400. J Mga) + 00D) 16 HUQ NG) =) 17 etorn fx In this improved version of the serial algorithm NG-Pattis,func~ tions f and w make use of an index, instead of using the whole set ING. The functions ActiveSet and Map at lines 12 and 13 are nothing more than a direct access to data structures saved on two arrays, 43.4. Parallel GPU version The parallel version for GPU of the algorithm NG-PaTHs EN- MawceD is given in the following (GPU-NG-Parss(n,0.d,q,ActiveSet, Map) Variables f and a ate initialized asin the serial algorithm 2. || Number of Active Threads for each set of stages @ 3 forg=0....0é0 4 fori=t,....ndo 5 ActTds(q) — AetThds(q) + Dimactiveets9, 1) 5 jf Initialize StartLabels 7 Qo 8 ndo 8 0. Dimactivesetsg.) do 10 NG ~ AtiveSe(g ih) n Sabelsy(q) Ada) 2 Slabs (4) AUMNC ne) 1B Kemet Setup 14 Buocas B = (ActTAds(q)/T.n—1), Tanzans T 15. jf Main Loop 16 forq’'=0.....0d0 37 NG-Paris-Keanetac:B,T => 18 (qf, 4.4, fx, ActiveSet, Map, SLabelsy, SLabelsnc) 19 return f.3 Inside the main procedure, GPU-NG-Panis, at ines 3-5, we count the number af active labels foreach stage g Using this value inside the main loop of the procedure, we spawn the necessary ‘numberof threads for each iteration of the loop (line 14). At lines 7-12 we initialize the Slabely and Stable, structures, contain ing, for each 4, the indices ofthe active NG for each vertex. As for the serial version ofthe algorithm, we return the array f with the function values withthe array of the predecessor vertex, %y and the aray ofthe predecessor path yp (ne 19) [NG-Parnis-KeRw (4.4, ActiveSet, Map, SLabelsy, Stabelsyc) 1 fd = blockddxx + blockDim.x+threadldxx 2 im blockdxy, j= SLabelsy(q ide), NGpay = Sbabelsye (id) 3 Niner = MAP(.J.NGde) HST FANG apg) > LQ I NGindexd + U0 1) 5 then fg a Nyy) = FE.) Nps) 0.1) 6 Hd +4 i.NEZ ‘indee) Inside the GPU kernel, NG-PaTHs-KeRNet, at lines 1-3, we de- fine the indices (i, j, NGgee) of the label. To retrieve the j and NGinjec We use the idx index for each thread that enumerates a specific location of the Stabelsy and SLabelsy lists. In this case, as shown at line 14 of the main algorithm GPU-NG-PATHS, we use bi-imensional blocks; the x dimension for managing the indices of SLabelsy and SLabelsy; and the y dimension to enumerate the rhodes. Aline 3, using the ansition map, we find the index ifthe new NG set for the expanded label, and at lines 4-5 we update, if necessary, the new label, The operation in these Ines is imple- ‘mented using the AtomicMin() CUDA primitive to manage the con- ‘current update ofa single variable by more threads simultaneously, in order to avoid race conditions and inconsistent results 44, GPU algorithms for the asymmetric case In real-world VRPs the cost matrix is usually asymmetric (Le, dy # dy for some i,j V). In this situation, when we also need to compute the backward q-paths or ng-paths from a vertex ito the depot, we can use the forward recutsions replacing the cost mi D with its transposed DF. Moreover, when we need to compute both forward and backward recutsions (eg, in the through-g-route relaxation), we may further improve the performance by exploiting, all che parallel features of the GPU environment. In particular, we ‘may compute concurrently the two recursions on the same GPU. {introducing another level of parallelism among, kernels, In our case we do not use streams to hide the memory trans- actions between the CPU and the GPU, but we execute the same kernel with different data on the same GPU, in a typical SIMD approach. The kernels are essentially those described in the pre- vious paragraph, The main diference is the use of the cudaMem- «pyasync() primitive that is a page-locked memory, also known as, pinned memory. 5. Computational results In this section we report the experimental results, where we compare the serial and the GPU parallel versions of the algorithms described in this paper. The computational experiments are performed on a worksta- tion equipped with an Intel 17 4820K @3.9 gigahertz with 32 gi- sgabytes of RAM and 2 NVIDIA GIX TITAN with 2688 CUDA Cores (0837 megahertz with 6 gigabytes of GDDRS RAM, ‘The computational experiments are cartied out on symmetric and asymmetsie well-known instances from the literature. In par ticular, we use some symmetric instances of the CVRPLIB (CVRPLIB. 2015); twelve large instances proposed by Li, Golden, and Wasi (2005); nine instances from series A, B, and P proposed by Augerat et al, (1995); twelve instances from series E, F and M proposed by Christofides and Eilon (1968), Fisher (1994), Christofides, Min- 3072, anc Toth (1981b), respectively, The asymmetric dataset con- {ains eight instances of the VRPLIB (VRPLIB, 2015) and two new instances of large size, called Balman859-1000 and Balman859- 2000, corresponding to real-word VRPs (downlodable from: hit Hfastarte.cst.unibo.t/data/#ACVRP). In our computational tests, we compare the efficiency of the implementations of the route relaxation algorithms using a se- sal CPU approach and using GPU computing. The serial and GPU algorithms are functionally equivalent, the only differences con- ce the implementation on different architectures, as described in Section 4, The implementations of serial and GPU versions are straight and without any peculiar enhancement not desctibed in

You might also like