Attend2Pack Bin Packing Through Deep Reinforcement Learning With Attention

Attend2Pack:
Bin Packing through Deep Reinforcement Learning with Attention
Jingwei Zhang * 1 Bin Zi * 1 Xiaoyu Ge 1 2
Abstract There are in general two types of bin packing problems,

online-BPP and offline-BPP, depending on whether the up-
arXiv:2107.04333v2 [cs.LG] 2 Aug 2021
This paper seeks to tackle the bin packing problem

coming loading items are known in advance. Specifically,
(BPP) through a learning perspective. Building
in offline-BPP, the information of all items to be packed are
on self-attention-based encoding and deep rein-
known beforehand, while items arrive one by one in online-
forcement learning algorithms, we propose a new
BPP and the agent is only aware of the current item at hand.
end-to-end learning model for this task of inter-
The typical application of online-BPP is palletizing, e.g.,
est. By decomposing the combinatorial action
instantaneously packing items transported by a conveyor
space, as well as utilizing a new training tech-
belt onto a pallet with a fixed bottom size (Zhao et al., 2021),
nique denoted as prioritized oversampling, which
where the number of items to be packed onto one pallet is
is a general scheme to speed up on-policy learn-
usually in the order of tens; while the practical use cases
ing, we achieve state-of-the-art performance in
of offline-BPP include loading cargos into containers so as
a range of experimental settings. Moreover, al-
to be transported by trucks or freighters, where the number
though the proposed approach attend2pack tar-
of cargos per container could be in the order of hundreds.
gets offline-BPP, we strip our method down to the
In this paper, we mainly focus on offline-BPP, especially
strict online-BPP setting where it is also able to
the single container loading problem (SCLP) (Gehring &
achieve state-of-the-art performance. With a set
Bortfeldt, 2002; Huang & He, 2009; Zhu et al., 2012) in
of ablation studies as well as comparisons against
which the goal is to pack all items into a bin in such a way
a range of previous works, we hope to offer as a
that utilization is maximized, while our proposed method
valid baseline approach to this field of study.
can be ablated to strict online-BPP settings (Sec.3.3).
1.2. Traditional Algorithms

1. Introduction
As a family of NP-hard problems (Lewis, 1983), the opti-
1.1. Background
mal solution of BPP cannot be found in polynomial time
At loading docks, the critical problem faced by the work- (Sweeney & Paternoster, 1992; Martello et al., 2000; Crainic
ers everyday is how to minimize the number of containers et al., 2008). Approximate methods and domain-specific
needed to pack all cargos in order to reduce transportation heuristics thus have been proposed to find near-optimal so-
expenses; at factories and warehouses, palletizing robots lutions. Such methods include: branch and bound (Martello,
are programmed to stack parcels onto pallets for efficient 1990; Lysgaard et al., 2004; Lewis, 2008); integer linear
shipping. The central problem in these real life scenes is programming (Dyckhoff, 1981; Cambazard & O’Sullivan,
the bin packing problem (BPP), which is having an ever- 2010; Salem & Kieffer, 2020); extreme-point-based con-
growing impact on logistics, trading and economy under structive heuristics (Crainic et al., 2008) where the corner
globalization and the boosting of e-commerce. of new items must be placed in contact with one of the
corner points inside the container; block-building-based
Given a set of items and a number of bins, BPP seeks for heuristic (Fanslau & Bortfeldt, 2010) that brings substantial
the packing configuration under which the number of bins performance gains, in which items are first constructed into
needed is minimized while utilization (Sec.2) is maximized. common building blocks then stacked into containers; as
*
Equal contribution 1 Dorabot Inc, Shenzhen, Guangdong, well as meta-heuristic algorithms (Abdel-Basset et al., 2018;
China 2 Australian National University, Canberra, Australia. Corre- Loh et al., 2008). However, the general applicability of
spondence to: <jingwei.zhang, bin.zi@dorabot.com>. these algorithms has been limited either by high computa-
tional costs or the requirement for domain experts to adjust
Reinforcement Learning for Real Life (RL4RealLife) Workshop in
the 38 th International Conference on Machine Learning, 2021. the algorithm specifications for each newly encountered
Copyright 2021 by the author(s). scenario.
Attend2Pack
1.3. Learning-based Algorithms We formulate the bin packing problem (BPP) as a Markov
decision process. At the beginning of each episode, the
Inspired by the success brought by powerful deep function
agent is given a set of N boxes each with dimensions
approximators (Mnih et al., 2015; He et al., 2016a; Vaswani
(ln , wn , hn ), as well as a P bin of width W and height
et al., 2017), BPP research has witnessed a surge of algo- N n n n
H with a length of L = n=1 max(l , w , h ) such
rithms that approach this problem from a learning perspec-
is long enough to contain all these N boxes under all
tive. As a combinatorial optimization problem, although
variations of packing configurations, as long as each box
it requires exponential time to find the optimal solution in
is in contact with at least one other box. These together
the combinatorial solution space, it is relatively cheap to
make up the input set of a 3D bin packing problem I =
evaluate the quality of a given solution for BPP. This makes
{(l1 , w1 , h1 ), (l2 , w2 , h2 ), . . . , (lN , wN , hN ), (L, W, H)}.
the family of policy gradient algorithms (Williams, 1992;
The goal for the agent is to pack all N boxes into the bin
Sutton et al., 1999) a good fit to such problems, as given
in such a way that the utility ru ∈ [0, 1] is maximized,
an evaluation signal that is not necessarily differentiable,
under geometry constraints that no overlap between items
policy gradient offers a principled way to guide search with
is allowed. The utility is calculated as
gradients, in contrast to gradient-free methods. Another ad-
PN n n n
vantage of learning-based approaches is that they are usually l w h
able to scale linearly to the input size, and have satisfactory ru (Cπ ) = n=1 , (1)
LCπ W H
generalization capabilities.
where Cπ is a packing configuration (discussed in more
Such attempts of utilizing deep reinforcement learning tech- details in Sec.2.1) generated by following packing policy
niques for combinatorial optimization starts from (Bello π, and LCπ represents for this packing configuration Cπ the
et al., 2016) which tackles the traveling salesman problem length of the minimum bounding box containing all the N
(TSP) with pointer networks (Vinyals et al., 2015). (Kool packed boxes inside of the bin. The utility is given to the
et al., 2018) proposes an attention-based (Vaswani et al., agent as a terminal reward ru only upon when the last box
2017) model that brings substantial improvements in the has been packed into the bin, meaning that the agent would
vehicle routing problem (VRP) domain. receive no step or intermediate reward during an episode.
As for BPP, (Zhao et al., 2021; Young-Dae et al., 2020; We choose this terminal reward setup since it is non-trivial to
Tanaka et al., 2020) focus on online-BPP and learn how to design local step reward functions that would perfectly align
place each given box, (Hu et al., 2017; Duan et al., 2019) with the global final evaluation signal. We also designed
learn to generate sequence order for boxes while placing and experimented with other terminal reward formulations
them using a fixed heuristic; therefore these works do not but found utility to be a simple yet effective solution, as well
target the full problem of offline-BPP. Ranked reward (RR) as the fact that it has a fixed range which makes it easier to
(Laterre et al., 2019) learns on the full problem by con- search for hyperparameters that can work robustly across
ducting self-play under the rewards evaluated by ranking. various datasets.
Although RR outperforms MCTS (Browne et al., 2012) When placing a box into the bin, the box can be rotated 90◦
and the GUROBI solver (Gurobi Optimization, 2018) es- along each of its three axes, leading to a total number of O =
pecially with large problem sizes, it directly learns on the 6 possible orientations. When it comes to the coordinate
full combinatorial action space which could pose major for placement inside a bin of L × W × H, there are also in
challenges for efficient learning. (Zhao et al., 2021; Jiang total a maximum number of L × W × H possible locations
et al., 2021) also propose end-to-end approaches for offline- to place a box (we denote the rear-left-bottom corner of the
BPP, however, their state representation for encoding might bin as the origin (0, 0, 0) and define the location of a box
not be optimized for BPP. In this work, we propose a new n inside the bin also by the coordinate (xn , y n , z n ) of its
attention-based model specifically designed for BPP. To fa- rear-left-bottom corner). This leads to a full action space of
cilitate efficient learning, we decompose the combinatorial size N × O × L × W × H for the full bin packing problem.
action space and propose prioritized oversampling to boost
on-policy learning. We conduct a set of ablation studies and 2.1. Decomposing the Action Space
show that our method achieves state-of-the-art results in a
range of comparisons. To deal with the potentially very large combinatorial action
space, we decompose the bin packing policy π of the agent
into two: a sequence policy π s that selects at each time step
2. Methods t the next box to pack st ∈ {1, . . . , N } \ s1:t−1 (with action
We present our approach in the context of 3D bin pack- space S of size |S| = N ) and a placement policy π p that
ing with single bin, from which its application to the 2D picks the orientation and the location for the currently se-
situation should be straightforward. lected box pt = (ost , xst , y st , z st ) (with action space P of
size |P| = O × L × W × H). We note that once a location
Attend2Pack
(xst , y st ) is selected for the current box, its coordinate along layers (Vaswani et al., 2017; Radford et al., 2019), with each
the height dimension z st is also fixed because all items must layer containing the following operations
have support along the height dimension; also since we ex- n n 1 2 N
pect the packing to be compact as possible that each box b̃ = b̄ + MHA LN b̄ , b̄ , . . . , b̄ , (4)
must be in contact with at least one other box, then once n n
bn = b̃ + MLP LN b̃ .

(5)
its coordinate y st along the width dimension is selected, its
coordinate xst along the length dimension can be automati- where MHA denotes the multi-head attention layer
cally located as the rearest feasible point where the current (Vaswani et al., 2017), LN denotes layer normalization
box would be in contact with at least one other box or the (Ba et al., 2016) and MLP denotes a feed-forward fully-
rear wall of the bin, and at the same time does not overlap connected network with ReLU activations. We organize the
with any other boxes. Therefore, the actual working action skip connections following (Radford et al., 2019) to allow
space for the placement policy π p is of size |P| = O × W , identity mappings (He et al., 2016b). This produces a set of
that by selecting a placing coordinate y st along the width box embeddings B = {b1 , b2 , · · · , bN } (each of dimension
dimension, the other two coordinates for st are located as |b| = d), which are kept fixed during the entire episode. We
the rearest-lowest feasible point. note that this is in contrast to several previous works (Li
This decomposition can be regarded as casting the full joint et al., 2020; Jiang et al., 2021) that need to re-encode the
combinatorial action space among two agents, formulat- box embeddings at every time step during an episode.
ing BPP as a fully-cooperative multi-agent reinforcement
learning task (Panait & Luke, 2005; Busoniu et al., 2008). 2.2.2. F RONTIER E MBEDDING
We note that now the learning system could be more prop- Besides the box embeddings B, in order to provide the agent
erly modeled as a sequential Markov game (Littman, 1994; with the description of the situation inside the bin, we addi-
Lanctot et al., 2017; Yang & Liu, 2020) with two agents: a tionally feed the agent with a frontier embedding f (also of
sequence agent with action space S and a placement agent size |f| = d) at each time step. More specifically, we define
with action space P sharing the same reward function. the frontier F of a bin as the front-view of the loading struc-
Following this decomposition, the joint policy π given input ture. For a bin of width W and height H, its frontier is of
I is factorized as size W × H, with each entry representing for that particular
location the maximum x coordinate of the already packed
π s1 , p1 , s2 , p2 , . . . , sN , pN I = boxes. To compensate for possible occlusions, at each time
N step t, we stack the last two frontiers {Ft−1 , Ft } (Ft denotes
π s st fIs (s1:t−1 , p1:t−1 ) π p pt fIp (s1:t , p1:t−1 ) ,
Y
the frontier before the box (lst , wst , hst ) has been packed

t=1 into the bin) and pass them through a convolutional network
(2) to obtain the frontier embedding f t .
s
in which the sequence policy π and the placement pol-
icy π p together generate the full solution configuration 2.3. Decoding Sequence
Cπs ,πp = {s1 , p1 , s2 , p2 , . . . , sN , pN } in an interleaving
We build on the pointer network (Vinyals et al., 2015; Bello
manner; fI is the encoding function that maps (partially
et al., 2016; Kool et al., 2018) to model the sequence policy
completed) configurations into state representations for a
π s in order to allow for variable length inputs.
particular input set I, which will be discussed below.
At time step t, the task for the sequence policy π s is to select
We will denote the deep network parameters for encoding as
the index st of the next box to be packed out of all unpacked
θ e (Sec.2.2), the parameters for decoding the sequence pol-
boxes: st ∈ {1, . . . , N } \ s1:t−1 . The embeddings for these
icy as θ s (Sec.2.3), and those for the placement policy as θ p
“leftover” boxes consistute the set B \ bs1:t−1 . Besides the
(Sec.2.4); their aggregation is denoted as θ. Dependencies
leftover embeddings, those already packed boxes along with
on these parameters are sometimes omitted in notation.
the situation inside the bin can be described by the frontier
and subsequently the frontier embedding f t . Therefore we
2.2. Encoding
instantiate the encoding function fIs (Eq.2) for the sequence
2.2.1. B OX E MBEDDINGS policy as
q̄st = fIs (s1:t−1 , p1:t−1 ; θ e ) = hB \ bs1:t−1 i, f t , (6)

At the beginning of each episode, the dimensions of each
box in the input are first encoded through a linear layer
n with hi representing the operation that takes in a set of
b̄ = Linear(ln , wn , hn ). (3)
d−dimensional vectors and return their mean vector which
The set of N such embeddings b̄ (|b̄| = d) are then passed is also d−dimensional. This encoding function generates
through several multi-head (with M heads) self-attention the query vector q̄st for sequence decoding.
Attend2Pack
Following (Bello et al., 2016; Kool et al., 2018), we add (ost , y st ) that decides for this box its orientation ost and its
a glimpse operation via multi-head (M heads) attention coordinate y st along the width dimension within the bin.
(Vaswani et al., 2017) before calculating the policy probabil- As discussed in Sec.2.1, once y st is selected, the other two
ities, which has been shown to bring performance gains. At coordinates of the box xst and z st can be directly located.
the beginning of each episode, from each of the box embed-
As for the encoding function fIp for the placement policy,
dings bn , M sets of fixed context vectors are obtained via
d it contains the processing of three kinds of embeddings:
learned linear projections. With h = M , the context vectors
n,m n,m
the current box embedding bst , the leftover embeddings
k̄ (glimpse key of size h) and v̄ (glimpse value of B \ bs1:t and the frontier embedding f t (which is the same
size h) for the mth head, as well as kn (logit key of size d) as for fIs since the situation inside the bin has not changed
shared across all M heads are obtained through after st is selected). Therefore fIp takes the form of
n,m
k̄ = Wk̄,m bn , v̄n,m = Wv̄,m bn , kn = Wk bn , (7)
qpt = fIp (s1:t , p1:t−1 ; θ e ) = bst , hB \ bs1:t i, f t . (12)

with Wk̄,m , Wv̄,m ∈ Rh×d and Wk ∈ Rd×d (in general the

key and value vectors are not required to be of the same This yields the query vector qpt for placement decoding.
dimensionality, here Wk̄,m and Wv̄,m are of the same size
only for notational convenience). The sequence query vector Unlike the sequence policy π s which is selecting over a
q̄st is split into M glimpse query vectors each of dimension fixed set of candidate boxes for each time step during an
|q̄s,m m m episode, the candidate pool for the placement policy π p is
t | = h. The compatibility vector c̄t (|c̄t | = N ) is
then calculated with its entries changing from time step to time step, which makes it not
 quite suitable to model π p using another pointer network
> n,m
 q̄s,m
t √ k̄
(Vinyals et al., 2015) with nonparametric softmax.
m if n 6 ∈ {s1:t−1 },
c̄t,n = h (8)
−∞ otherwise. Therefore, we deploy a placement network (parameter-
ized by θ p ) that takes in the query vector qpt and out-
Then we can obtain the updated sequence query for the mth put action probabilities with standard parametric softmax.
head as the weighted sum of the glimpse value vectors More specifically, the placement network θ p outputs logits
p
l̄t = θ p (qpt ) of size O × W . If a placement (oi , yj ) is infea-
N sible that after being placed the current box will cut through
qs,m
X
t = softmax(c̄m
t )n · v̄
n,m
. (9) walls of the container, its corresponding entry in the logit
n=1 p
vector l̄t,i×j will be masked as −∞, while all other feasible
p
Concatenating the updated sequence query vectors from all entries will be processed as C · tanh(l̄t,· ). This processed
p
M heads we obtain qst (|qst | = d). Passing it through another logit vector lt is then used to give out action probabilities
linear projection Wq ∈ Rd×d , the updated compatibility ct for placement selection
(|ct | = N ) is calculated by
π p · fIp (s1:t , p1:t−1 ; θ e ); θ p = softmax(lpt ).

( q s > n
(W qt ) k
(13)
C · tanh √ if n 6 ∈ {s1:t−1 },
ct,n = d (10)
−∞ otherwise, As in sequence selection, the placement pt is sampled during
training while chosen greedily during evaluation.
where the logits are clamped between [−C, C] following
(Bello et al., 2016; Kool et al., 2018) as this could help
2.5. Policy Gradient
exploration and bring performance gains (Bello et al., 2016).
So far we have been discussing that given an input set
Finally the sequence policy can be obtained via
I, how the sequence agent π s and the placement agent
π s · fIs (s1:t−1 , p1:t−1 ; θ e ); θ s = softmax(ct ). π p cooperatively yield a solution configuration Cπ =

(11)
{s1 , p1 , s2 , p2 , . . . , sN , pN } in an interleaving manner over
During training, the box index st is selected by sampling N time steps (π denotes the aggregation of π s and π p ). In
from the distribution defined by the sequence policy st ∼ order to train this system such that π s and π p could coop-
π s · fIs (s1:t−1 , p1:t−1 ) while during evaluation

the greedy eratively maximize the final utility ru ∈ [0, 1], which is
action is selected st = arg maxa π s afIs (s1:t−1 , p1:t−1 ) .

equivalent to minimizing the cost c(Cπ ) = 1 − ru (Cπ ), we
define the overall loss function as the expected cost of the
2.4. Decoding Placement configurations generated by following π:
Given the box index st selected by the sequence policy
π s , the placement policy π p will select a placement pt = J (θ|I) = ECπ ∼πθ [c(Cπ |I)] , (14)
Attend2Pack
which is optimized by following the REINFORCE gradient regular training. Following the above discussions, we fo-
estimator (Williams, 1992) cus on finding better sequence selections for those hard
h samples during the second round of learning. To achieve
∇θ J (θ|I) = ECπ ∼πθ c(Cπ |I) − b(I) · this we first follow a beam-search-based scheme; but we
later found that a much simpler re-learning procedure could
N
bring performance gains on par with the rather complicated
X i
∇θe,s log π s (st ; θ e,s ) + ∇θe,p log π p (pt ; θ e,p ) .
beam-search-based training, therefore we choose the simple
t=1
(15) alternative as our final design choice.
Prioritized oversampling is carried out in two stages: collect
We note that to avoid cluttering, we consume the depen- and re-learn. First, during training, when learning on a batch
dency of π s on fIs (s1:t−1 , p1:t−1 ; θ e ) into θ e and use θ e,s B of batch size B, the advantages of this batch are calculated
to denote the aggregation of θ e and θ s ; the same as for π p . as c(Cπ |I) − b(I) (Eq.15). We note that the lower the ad-
For the baseline function b(I) in the above equation, we vantage the better is the configuration Cπ , since here we use
use the greedy rollout baseline proposed by (Kool et al., the notion of costs instead of rewards. From those training
2018) as this has been shown to give superior performance. samples, a PO batch B PO of size B PO (usually B PO B)
More specifically, the parameters of the best model θ BL is then collected that contains those with the highest advan-
encountered during training is used to roll out greedily on tage values hence the worst solution configurations. After
the training input I to give out b(I); at the end of every a total number of B PO samples have been collected (e.g.
epoch i, the model θ BL will be compared against the current after B/B PO steps of normal training), the second step of PO
model θi and will be replaced if the improvement of θi over re-learning can be carried out. We note that it might not be
θ BL is significant according to a paired t-test (α = 5%). necessary to conduct normal training and PO re-learning
More details can be found in (Kool et al., 2018). using the same batch size, while we found out empirically
that this lead to more stable training dynamics especially
2.6. Prioritized Oversampling when learning with optimizers with adaptive learning rates
(Kingma & Ba, 2015). We also deploy a separate optimizer
Based on the observations from preliminary experiments, for PO re-learning and found this to be crucial to stabilize
we suspect that the placement policy could have a more learning, as the statistics of the PO batches could differ
local effect than the sequence policy on the overall quality dramatically from the normal training batches.
of the resulting configuration: we found that in hard testing
scenarios, a sub-optimal early sequence selection made by As for the PO re-learning, we experiment with two types
π s would make it challenging for the overall system to yield of learning strategies: beam-search-based training and nor-
a satisfactory solution configuration; while given a box to mal training. In beam-search-based training with a beam
place, the effect of a sub-optimal decision of π p usually size of K, for each PO sample, K instead of 1 solution
do not propagate far. Another obsevation that leads to this configurations will be generated by the end of an episode.
suspision is that those testing samples on which the trained More specifically, at t = 1, π s will sample K 2 box indices
system perform poorly are usually those that have a more {s1,1 , s1,2 , · · · , s1,K 2 }. These will all be given to π p which
strict requirement on the sequential order of the input boxes. will select a placement for each s1,k , out of which K candi-
dates with the highest π s (s1,k )π p (p1,k ) will be selected to
However, in our problem setup, the agent would only receive be passed as the starting configuration to the next time step.
an evaluation signal upon the completion of a full episode, In all time steps t > 1, π s will receive K candidate starting
which involves subsequently rolling out π s and π p over configurations, and it will sample for each of these K config-
multiple time steps, with sequence selection and placement urations K next box indices. These K 2 box indices will be
selection interleavingly conditioned and dependent on each passed to π p to select one placement Qt for each. Pruning will
other. This makes it not quite suitable to use schemes such again be conducted according to τ =1 π s (sτ )π p (pτ ) from
as utilizing a counterfactual baseline (Foerster et al., 2018) which K candidates are left. From preliminary experiments
to address multi-agent credit assignment, especially that in with this setup, we found that those K final trajectories
our case marginalizing out the effect of a sequence action picked out by beam-search do not necessarily yield better
st is not straightforward and could be expensive. performance, we suspect that the reason is that the prun-
So instead, inspired by prioritized experience replay (Schaul ing criteria used here does not necessarily translate to good
et al., 2016) for off-policy learning, we propose prioritized utility. We then adjust the algorithm that after the K config-
oversampling (PO) as a general technique to speed up on- urations have been generated at the end of episodes, we only
policy training. The general idea of prioritized oversam- choose the one with the best advantage to train the model.
pling is to give the agent a second chance of learning those We found that this give better performance. However, we
hard examples on which it had performed poorly during found that a much simpler training scheme, that instead
Attend2Pack
of conducting training-time beam-search, simply conduct

regular training on the PO samples during the re-learning
phase could already bring performance gains, albeit being
much simpler in concept and in implementation. Therefore
we choose the latter strategy as the final design choice.
We note that in order to compensate for oversampling, dur-
ing the normal training of batch B, we scale down the
losses of each of those samples that are collected into (a) Average utility obtained in evaluation during training,
B P O by η; while during PO re-learning, the coefficient each plot shows mean ± 0.1 · standard deviation.
for those PO samples is set to 1 − η. η is calculated by
tracking a set of running statistics. We denote the sam-
ples in B with positive advantages hence relatively infe-
rior performance as B + , and denote its size as B + . Then
we keep track of the running statistics (with momentum
(b) attend2pack (c) ×po (d) ×po-joint
β) of the sample size B + and of the average advantages
of B + , denoted as Bβ+ and a+ β respectively, which are up-
Figure 1. Plots (Fig.1a) and visualizations (Fig.1b-1d) for Sec.3.2,
+ + + + + on dataset generated by cutting 10 × 10 × 10 bin into 10 boxes.
dated by: aβ = (β·B β ·a β +(1−β)·B ·a )/(β·B + +(1−β)·B + ),
β
+ + +
Bβ = β · Bβ + (1 − β) · B . We conduct the same
running statistics tracking for the PO batches B PO from
which we get aPO gathered every B/B PO = 64 training steps. This means that
β . Then we get the coefficient η =
+ after an epoch of 1024 training steps, 1024/64 = 16 more
max(min(1 − β /aPO
a
β , 1.0), 0.5).
PO training steps would have been conducted than setups
without PO. Over a total number of 128 epochs this sums
3. Experiments up to 128 · 16 = 2 · 1024, equals to 2 full epochs of more
training steps. Thus to ensure a fair comparison, we set the
3.1. Experimental Setup
total number of epochs for those setups without PO to 130.
When conducing experiments to validate our proposed
The raw box dimensions are normalized by the maximum
method, we do not find it straightforward to compare against
size of the box edges before fed into the deep network. As
previous works, as the datasets used in previous evaluations
for the network, the initial linear layer FF contains 128
vary from one to another. The main varying factors are the
hidden units, which is followed by 3 multi-head (with M =
size of the bin, the number of boxes to pack, the range of
8 heads) self-attention layers with an embedding size of
the size of the boxes, as well as how the box dimensions are
d = 128. The costs c are scaled up by 3.0 before used
sampled. In order to offer as a valid benchmark, in this pa-
to calculate losses. Other important hyperparameters are
per we conduct ablations as well as careful studies against a
C = 10, β = 0.95.
variety of previous state-of-the-art methods on BPP, compar-
ing to each according to its specific setups. Unless otherwise
mentioned, we run 2 trainings under 2 non-tuned random 3.2. Ablation Study on Action Space
seeds for each training configuration. We first conduct an ablation study to evaluate the effective-
Across all experiments, we use the Adam optimizer ness of the action space decomposition, as well as the effect
(Kingma & Ba, 2015) with a learning rate of 1e−4 and of utilizing PO. For this experiment, we use the data setup
a batch size of B = 1024, and each epoch contains 1024 as in (Laterre et al., 2019), where boxes are generated by
such minibatches. After every epoch, evaluation is con- cutting a bin of size 10 × 10 × 10, and the resulting box
ducted on a fixed set of validation data of size 1e4. The edges could take any integer in the range [1 . . 10]. (Laterre
plots we present in this section are all of the performance et al., 2019) conducted experiments by cutting the afore-
of these evaluations during training, while the final testing mentioned bin into 10, 20, 30, 50 boxes. Here we conduct
results presented in tables are on testing data of size 1e4 ablations on the 10 boxes dataset, and a full comparison to
that are drawn independently from the training/validation this state-of-the-art algorithm (Laterre et al., 2019) will be
data, and the testing results of the best agent out of the 2 presented in Sec.3.4.
runs are presented in the tables. A fixed training budget Our full method is denoted attend2pack, which we com-
of 128 epochs is set for our full model. For those experi- pare against a ×po agent trained without utilizing PO, and
ments with prioritized oversampling (PO), a PO batch of a ×po-joint agent directly trained on the joint action space,
size B PO = 16 is collected from each training batch, which with one softmax policy network directly outputting action
means that a full batch ready for PO re-learning would be probabilities over the full action space. The plots of eval-
Attend2Pack
(a) 10 items (b) 20 items (c) 30 items (d) 50 items
(a) Average utility obtained in evaluation during training,

each plot shows mean ± 0.1 · standard deviation.
(e) 10 items (f) 20 items (g) 30 items (h) 50 items
Figure 3. Visualizations for Sec.3.4, on dataset generated by cut-

ting a square of 10 × 10 (2D) or a bin of 10 × 10 × 10 (3D) into
10, 20, 30, 50 items, with the box edges ranging within [1 . . 10].
(b) attend2pack (c) ×po-×seq-×att (d) ×po-×seq-×att

-×leftover -×leftover cut Table 1. Performance comparison on the cut dataset with ranked
reward (RR) (Laterre et al., 2019). The first two columns show
Figure 2. Results for Sec.3.3, on dataset generated by randomly results in rRR , while the last column gives ru of our method to
sampling the edges of 24 boxes from [2 . . 5], with a bin of W × ease comparisons of future works 2 .
H = 10 × 10. Fig.2d shows the packing configuration of Fig.2c
after being processed to compare with palletizing in onlin-BPP, RR (rRR ) ATTEND 2PACK (rRR ) ATTEND 2PACK (ru )
where the items at least partially outside of the 10 × 10 × 10 bin 2D-10 0.953 ± 0.027 0.995 ± 0.015 0.991 ± 0.028
2D-20 0.948 ± 0.020 0.997 ± 0.012 0.994 ± 0.022
is removed, after which the packing is rotated such that the rear 2D-30 0.954 ± 0.015 0.999 ± 0.006 0.999 ± 0.011
wall of the bin is now at the bottom of the pallet. 2D-50 0.960 ± 0.016 1.000 ± 0.000 1.000 ± 0.000
3D-10 0.903 ± 0.060 0.953 ± 0.042 0.932 ± 0.060
3D-20 0.850 ± 0.032 0.911 ± 0.034 0.872 ± 0.046
3D-30 0.844 ± 0.030 0.904 ± 0.033 0.863 ± 0.044
uation during training are shown in Fig.1a, which shows 3D-50 0.838 ± 0.034 0.926 ± 0.024 0.893 ± 0.032
that decomposing the joint combinatorial action space does
boost learning, and PO helps to further improve training
efficiency. Packing results are also visualized in Fig.1b-1d. cut experiments is not easily feasible, we conduct experi-
ments using their random data-generation scheme: each box
3.3. Ablation Study in the Context of Online BPP edge is sampled from the integers in the range of [2 . . 5],
We conduct another set of ablation studies in the context while the bin is of size 10×10×10. Their evaluation criteria
of online-BPP against the state-of-the-art learning-based is to count the average number of boxes that can be placed
solution in this domain (Zhao et al., 2021). As discussed inside the bin, which is a bit different from our training
in the introduction, in online-BPP, the agent is only aware scheme. Thus we compare the final results against theirs in
of the dimensions of the box at hand. Since our algorithm the following way: based on the total volume of the bin and
targets the offline-BPP problem, we strip our method down the expected volume of each box, we can calculate that the
to the strict online setting and also along the way evaluate maximum expected number of boxes that can be packed is
the effects of several components of our proposed method. 24. We thus set N = 24 when generating datasets. After
training, we remove all the boxes that are at least partially
(Zhao et al., 2021) conducted experiments with boxes gener- out of the 10 × 10 × 10 bin and count the number of boxes
ated by cut and by random sampling. As comparing to their still left inside of the bin to compare to (Zhao et al., 2021).
2
We note here the performance of RR are without MCTS The training configurations under this study are:
simulations, as these results are presented numerically in (Laterre
et al., 2019), while the results with MCTS simulations are only
presented in box plots. In the box plots, the RR method with
• attend2pack: Our full method;
MCTS simulation is compared against plain MCTS (Browne et al.,
• ×po: Our full method but without PO;
2012), the Lego heuristic (Hu et al., 2017), and the GUROBI
solver (Gurobi Optimization, 2018), where RR shows superior • ×po-×seq: No PO training and no sequence policy
performance against all of these methods. By roughly inferring
the numerical results from the box plots we can conclude that our learning. While this is very similar to the online-BPP,
method still outperforms RR with MCTS simulations except for since self-attention are used to calculate the box em-
the scenario in 3D with 10 boxes. beddings B, each individual box embedding bn still
Attend2Pack
2021), which packs on average 12.2 items with a utility of

54.7% (with a lookahead number of 5, the utility achieved
by their method still does not surpass 60%); while the
strictly online-BPP version of our method on average is
able to pack 15.6 items with a utility of 67.0%, achieving
state-of-the-art performance in the online-BPP domain. Vi-
sualizations of the packing results are shown in Fig.2b-2d.
(a) Average utility obtained in evaluation during training, 3.4. 2D/3D BPP on Boxes Generated by Cut
here plots each individual run for each configuration.
As introduced in Sec.3.2, (Laterre et al., 2019) (RR)
presented the state-of-the-art performance on datasets
generated by cut in both the 2D and the 3D domain.
More specifically, the boxes are generated by cutting a
10 × 10 × 10 bin (or a 10 × 10 square for 2D) into
10, 20, 30 or 50 boxes (pseudo-code in Appendix I).
(b) 30 items (c) 50 items (d) 100 items They use a P different reward than untility ru defined as
N n n 1/2
rRR = 2·( n=1 l w ) /(Lπ +Wπ ) for 2D and rRR =
Figure 4. Plots (Fig.4a) and visualizations (Fig.4b-4d) for Sec.3.5, 1
3·( n=1 ln wn hn ) /3/(Lπ Wπ +Wπ Hπ +Hπ Lπ ) for 3D. For the
PN
on dataset generated by randomly sampling the edges of 20, 30,
50, 100 boxes from [20 . . 80], with a bin of W ×H = 100×100. following experiments, we still train our method with the
cost of 1−ru but we compare the results in rRR . We present
the visualizations of the packing results are shown in Fig.3,
Table 2. Comparisons in Sec.3.5 on ru against GA(Wu et al., 2010), and the final testing results in Table 1. We can observe that
EP(Crainic et al., 2008), MTSL(Duan et al., 2019), CQL(Li et al., our method achieves state-of-the-art in this data domain.
2020) and MM(Jiang et al., 2021).
GA EP MTSL CQL MM ATTEND 2PACK 3.5. 3D BPP on Randomly Generated Boxes
20 0.683 0.627 0.624 0.670 0.718 0.767 ± 0.036
30 0.662 0.638 0.601 0.693 0.755 0.797 ± 0.032 We continue to conduct comparisons against the state-of-
50 0.659 0.663 0.553 0.736 0.813 0.831 ± 0.023 the-art algorithm (Jiang et al., 2021) on randomly generated
100 0.624 0.675 − − 0.844 0.870 ± 0.016
boxes, in which W × H = 100 × 100 for the bin, while the
edges of boxes are sampled from integers in the range of
[20 . . 80]. (Jiang et al., 2021) present experimental results
carries information about other boxes; with box numbers of 20, 30, 50 and 100, on which the
learning curves of our full method attend2pack are shown
• ×po-×seq-×att: On top of the former setup, we replace
in Fig.4a. We can see that the performance improves stably
the attention module in θ e with a residual network with
during training, except for one of the trainings for 100 boxes
approximately the same number of parameters. But
which encounters catastrophic forgetting. We suspect that
still this setup cannot be regarded as a strict online
for more number of boxes each episode would involve more
setting, since the encoding function for placement fIp
rollout steps thus an undesirable interaction between π s
(Eq.12) contains the leftover embeddings which still
and π p is more likely to propagate to unexpected gradient
gives information about the upcoming boxes.
updates. Since we observe larger variations for the training
• ×po-×seq-×att-×leftover: A strict online-BPP setup with 100 boxes, here we add one more run to give a more
where the leftover embeddings are removed from fIp accurate presentation of the performance of our algorithm
from the former setup. on this task. We present the visualizations of the packing
results in Fig.4b-4d, and the final testing results on 1e4
testing samples in Table 2, where we can see that our method
The results of this ablation are ploted in Fig.2a. We can see
achieves state-of-the-art performance in this data domain.
that both the leftover embeddings and the attention module
brings performance gains. It can also be concluded from the
big performance gap between ×po and ×po-×seq that the se- 3.6. Conclusions and Future Work
quence policy is able to produce reasonable sequence orders. To conclude, this paper proposes a new model attend2pack
We observe that PO is again able to improve performance. that achieves state-of-the-art performance on BPP. Possible
After processing the packing results of the agent of future directions include extending this framework to multi-
×po-×seq-×att-×leftover with the aforementioned proce- bin settings, evaluating it under more packing constraints,
dures, we compare against the performance of (Zhao et al., and more sophisticated solutions for credit assignment.
Attend2Pack
References Gurobi Optimization, I. Gurobi optimizer reference manual,

2018.
Abdel-Basset, M., Manogaran, G., Abdel-Fatah, L., and
Mirjalili, S. An improved nature inspired meta-heuristic He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
algorithm for 1-d bin packing problems. Personal and ing for image recognition. In Proceedings of the IEEE
Ubiquitous Computing, 22(5):1117–1132, 2018. conference on computer vision and pattern recognition,
pp. 770–778, 2016a.
Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization.
Advances in NIPS 2016 Deep Learning Symposium, pp. He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings
arXiv preprint arXiv:1607.06450, 2016. in deep residual networks. In European conference on
Bello, I., Pham, H., Le, Q. V., Norouzi, M., and Bengio, computer vision, pp. 630–645. Springer, 2016b.
S. Neural combinatorial optimization with reinforcement Hu, H., Zhang, X., Yan, X., Wang, L., and Xu, Y. Solv-
learning. arXiv preprint arXiv:1611.09940, 2016. ing a new 3d bin packing problem with deep reinforce-
Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., ment learning method. Workshop on AI application in E-
Cowling, P. I., Rohlfshagen, P., Tavener, S., Perez, D., commerce, Proceedings of the Twenty-Sixth International
Samothrakis, S., and Colton, S. A survey of monte carlo Joint Conference on Artificial Intelligence, IJCAI-17, pp.
tree search methods. IEEE Transactions on Computa- arXiv preprint arXiv:1708.05930, 2017.
tional Intelligence and AI in games, 4(1):1–43, 2012. Huang, W. and He, K. A caving degree approach for the
Busoniu, L., Babuska, R., and De Schutter, B. A comprehen- single container loading problem. European Journal of
sive survey of multiagent reinforcement learning. IEEE Operational Research, 196(1):93–101, 2009.
Transactions on Systems, Man, and Cybernetics, Part C
Jiang, Y., Cao, Z., and Zhang, J. Solving 3d bin pack-
(Applications and Reviews), 38(2):156–172, 2008.
ing problem via multimodal deep reinforcement learning.
Cambazard, H. and O’Sullivan, B. Propagating the bin pack- In Proceedings of the 20th International Conference on
ing constraint using linear programming. In International Autonomous Agents and MultiAgent Systems, pp. 1548–
Conference on Principles and Practice of Constraint Pro- 1550, 2021.
gramming, pp. 129–136. Springer, 2010.
Kingma, D. P. and Ba, J. L. Adam: A method for stochastic
Crainic, T. G., Perboli, G., and Tadei, R. Extreme point- gradient descent. In ICLR: International Conference on
based heuristics for three-dimensional bin packing. In- Learning Representations, pp. 1–15, 2015.
forms Journal on computing, 20(3):368–384, 2008.
Kool, W., van Hoof, H., and Welling, M. Attention, learn to
Duan, L., Hu, H., Qian, Y., Gong, Y., Zhang, X., Wei, J., and solve routing problems! In International Conference on
Xu, Y. A multi-task selected learning approach for solving Learning Representations, 2018.
3d flexible bin packing problem. In Proceedings of the
Lanctot, M., Zambaldi, V., Gruslys, A., Lazaridou, A.,
18th International Conference on Autonomous Agents
Tuyls, K., Pérolat, J., Silver, D., and Graepel, T. A uni-
and MultiAgent Systems, pp. 1386–1394, 2019.
fied game-theoretic approach to multiagent reinforcement
Dyckhoff, H. A new linear programming approach to the learning. In Proceedings of the 31st International Con-
cutting stock problem. Operations Research, 29(6):1092– ference on Neural Information Processing Systems, pp.
1104, 1981. 4193–4206, 2017.
Fanslau, T. and Bortfeldt, A. A tree search algorithm for Laterre, A., Fu, Y., Jabri, M. K., Cohen, A.-S., Kas, D., Haj-
solving the container loading problem. INFORMS Jour- jar, K., Dahl, T. S., Kerkeni, A., and Beguir, K. Ranked
nal on Computing, 22(2):222–235, 2010. reward: Enabling self-play reinforcement learning for
combinatorial optimization. Workshop on Reinforcement
Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., and Learning in Games, Proceedings of the Thirty-Fourth
Whiteson, S. Counterfactual multi-agent policy gradi- AAAI Conference on Artificial Intelligence, AAAI-19, pp.
ents. In Proceedings of the AAAI Conference on Artificial arXiv preprint arXiv:1807.01672, 2019.
Intelligence, volume 32, 2018.
Lewis, C. Linear programming: Theory and applications.
Gehring, H. and Bortfeldt, A. A parallel genetic algo- Whitman College Mathematics Department, 2008.
rithm for solving the container loading problem. In-
ternational Transactions in Operational Research, 9(4): Lewis, H. R. Computers and intractability: A guide to the
497–511, 2002. theory of np-completeness, 1983.
Attend2Pack
Li, D., Ren, C., Gu, Z., Wang, Y., and Lau, F. Solv- bibliography. Journal of the Operational Research Soci-
ing packing problems by conditional query learning, ety, 43(7):691–706, 1992.
2020. URL https://openreview.net/forum?
id=BkgTwRNtPB. Tanaka, T., Kaneko, T., Sekine, M., Tangkaratt, V., and
Sugiyama, M. Simultaneous planning for item picking
Littman, M. L. Markov games as a framework for multi- and placing by deep reinforcement learning. In 2020
agent reinforcement learning. In Machine learning pro- IEEE/RSJ International Conference on Intelligent Robots
ceedings 1994, pp. 157–163. Elsevier, 1994. and Systems (IROS), pp. 9705–9711. IEEE, 2020.
Loh, K.-H., Golden, B., and Wasil, E. Solving the one- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
dimensional bin packing problem with a weight annealing L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention
heuristic. Computers & Operations Research, 35(7):2283– is all you need. In Proceedings of the 31st International
2291, 2008. Conference on Neural Information Processing Systems,
pp. 6000–6010, 2017.
Lysgaard, J., Letchford, A. N., and Eglese, R. W. A new
branch-and-cut algorithm for the capacitated vehicle rout- Vinyals, O., Fortunato, M., and Jaitly, N. Pointer networks.
ing problem. Mathematical Programming, 100(2):423– In Proceedings of the 28th International Conference on
445, 2004. Neural Information Processing Systems-Volume 2, pp.
2692–2700, 2015.
Martello, S. Knapsack problems: algorithms and computer
implementations. Wiley-Interscience series in discrete Wang, X., Thomas, J. D., Piechocki, R. J., Kapoor, S.,
mathematics and optimiza tion, 1990. Santos-Rodriguez, R., and Parekh, A. Self-play learning
strategies for resource assignment in open-ran networks.
Martello, S., Pisinger, D., and Vigo, D. The three- arXiv preprint arXiv:2103.02649, 2021.
dimensional bin packing problem. Operations research,
48(2):256–267, 2000. Williams, R. J. Simple statistical gradient-following algo-
rithms for connectionist reinforcement learning. Machine
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, learning, 8(3-4):229–256, 1992.
J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-
land, A. K., Ostrovski, G., et al. Human-level control Wu, Y., Li, W., Goh, M., and de Souza, R. Three-
through deep reinforcement learning. nature, 518(7540): dimensional bin packing problem with variable bin height.
529–533, 2015. European journal of operational research, 202(2):347–
355, 2010.
Panait, L. and Luke, S. Cooperative multi-agent learning:
The state of the art. Autonomous agents and multi-agent Yang, N. and Liu, J. J. Sequential markov games with
systems, 11(3):387–434, 2005. ordered agents: A bellman-like approach. IEEE Control
Systems Letters, 4(4):898–903, 2020.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and
Sutskever, I. Language models are unsupervised multitask Young-Dae, H., Young-Joo, K., and Ki-Baek, L. Smart
learners. OpenAI blog, 1(8):9, 2019. pack: Online autonomous object-packing system using
rgb-d sensor data. Sensors, 20(16):4448, 2020.
Salem, K. H. and Kieffer, Y. New symmetry-less ilp for-
mulation for the classical one dimensional bin-packing Zhao, H., She, Q., Zhu, C., Yang, Y., and Xu, K. Online 3d
problem. In International Computing and Combinatorics bin packing with constrained deep reinforcement learn-
Conference, pp. 423–434. Springer, 2020. ing. Proceedings of the AAAI Conference on Artificial
Intelligence, 35(1):741–749, May 2021.
Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Pri-
oritized experience replay. In Proceedings of the 4th Zhu, W., Oon, W.-C., Lim, A., and Weng, Y. The six
International Conference on Learning Representations, elements to block-building approaches for the single con-
2016. tainer loading problem. Applied Intelligence, 37(3):431–
445, 2012.
Sutton, R. S., McAllester, D. A., Singh, S. P., Mansour, Y.,
et al. Policy gradient methods for reinforcement learning Zhu, W., Zhuang, Y., and Zhang, L. A three-dimensional
with function approximation. In NIPs, volume 99, pp. virtual resource scheduling method for energy saving in
1057–1063. Citeseer, 1999. cloud computing. Future Generation Computer Systems,
69:66–74, 2017.
Sweeney, P. E. and Paternoster, E. R. Cutting and packing
problems: a categorized, application-orientated research
Attend2Pack
b1 b2 b3
<latexit sha1_base64="EoNWsbXD95+uV5vA2gJ2hrCiPRY=">AAACMHicbVA7T8MwGLR5tZRXCyNLRIXEVCUVAsZKLIxFog+pDZXtOK1VO4lsB1RF+SmssPNrYEKs/AqcNANNOcnS+e775PPhiDOlbfsTbmxube9Uqru1vf2Dw6N647ivwlgS2iMhD+UQI0U5C2hPM83pMJIUCczpAM9vM3/wRKViYfCgFxF1BZoGzGcEaSNN6o0xFslYID1TfoLT9NGZ1Jt2y85hrROnIE1QoDtpwMrYC0ksaKAJR0qNHDvSboKkZoTTtDaOFY0QmaMpHRkaIEGVm+TZU+vcKJ7lh9KcQFu5+ncjQUKphcBmMk9Z9jLxXw+LdPVeFjyVvViKp/0bN2FBFGsakGU6P+aWDq2sPctjkhLNF4YgIpn5oEVmSCKiTcc1U51TLmqd9Nst56rVvr9sduyixCo4BWfgAjjgGnTAHeiCHiDgGbyAV/AG3+EH/ILfy9ENWOycgBXAn1+Mralt</latexit> <latexit sha1_base64="QhL1yD0scJkJigeFsfemgQIToSY=">AAACMHicbVA7T8MwGLR5tZRXCyNLRIXEVCUVAsZKLIxFog+pDZXtOK1VO4lsB1RF+SmssPNrYEKs/AqcNANNOcnS+e775PPhiDOlbfsTbmxube9Uqru1vf2Dw6N647ivwlgS2iMhD+UQI0U5C2hPM83pMJIUCczpAM9vM3/wRKViYfCgFxF1BZoGzGcEaSNN6o0xFslYID1TfoLT9LE9qTftlp3DWidOQZqgQHfSgJWxF5JY0EATjpQaOXak3QRJzQinaW0cKxohMkdTOjI0QIIqN8mzp9a5UTzLD6U5gbZy9e9GgoRSC4HNZJ6y7GXivx4W6eq9LHgqe7EUT/s3bsKCKNY0IMt0fswtHVpZe5bHJCWaLwxBRDLzQYvMkEREm45rpjqnXNQ66bdbzlWrfX/Z7NhFiVVwCs7ABXDANeiAO9AFPUDAM3gBr+ANvsMP+AW/l6MbsNg5ASuAP7+OZ6lu</latexit> <latexit sha1_base64="WKP82GdyMobpor0NoLvdzYwCNJg=">AAACMHicbVA7T8MwGHTKo6W8WhhZIiokpiopCBgrsTAWiT6kJlS247RW7SSyHVAV5aewws6vgQmx8itw0gw05SRL57vvk8+HIkalsqxPo7KxubVdre3Ud/f2Dw4bzaOBDGOBSR+HLBQjBCVhNCB9RRUjo0gQyBEjQzS/zfzhExGShsGDWkTE5XAaUJ9iqLQ0aTQdxBOHQzWTfoLS9PFi0mhZbSuHuU7sgrRAgd6kaVQdL8QxJ4HCDEo5tq1IuQkUimJG0roTSxJBPIdTMtY0gJxIN8mzp+aZVjzTD4U+gTJz9e9GArmUC470ZJ6y7GXivx7i6eq9LHgye7EUT/k3bkKDKFYkwMt0fsxMFZpZe6ZHBcGKLTSBWFD9QRPPoIBY6Y7rujq7XNQ6GXTa9lW7c3/Z6lpFiTVwAk7BObDBNeiCO9ADfYDBM3gBr+DNeDc+jC/jezlaMYqdY7AC4+cXkCGpbw==</latexit>
<latexit sha1_base64="bDk2TOBs6kHU+/75+XtNWmnEC78=">AAACInicbVBPS8MwHE3mn835b9Ojl+AQPI12DPU48OJxgt0GWxlpmm5hSVuSVBiln8Gr3v003sST4Icx3XpwnQ8CL+/9fuTleTFnSlvWN6zs7O7tV2sH9cOj45PTRvNsoKJEEuqQiEdy5GFFOQupo5nmdBRLioXH6dBb3Of+8JlKxaLwSS9j6go8C1nACNZGcuJpamfTRstqWyugbWIXpAUK9KdNWJ34EUkEDTXhWKmxbcXaTbHUjHCa1SeJojEmCzyjY0NDLKhy01XaDF0ZxUdBJM0JNVqpfzdSLJRaCs9MCqznquzl4r+eJ7LNe1nwVf5iKZ4O7tyUhXGiaUjW6YKEIx2hvC/kM0mJ5ktDMJHMfBCROZaYaNNq3VRnl4vaJoNO275pdx67rV63KLEGLsAluAY2uAU98AD6wAEEMPACXsEbfIcf8BN+rUcrsNg5BxuAP7+RAqPV</latexit>
p1 <latexit sha1_base64="+qrijgckqRahShVRud03wLeFPLM=">AAACInicbVBPS8MwHE3mn835b9Ojl+AQPI12DPU48OJxgt0GWxlpmm5hSVuSVBiln8Gr3v003sST4Icx3XpwnQ8CL+/9fuTleTFnSlvWN6zs7O7tV2sH9cOj45PTRvNsoKJEEuqQiEdy5GFFOQupo5nmdBRLioXH6dBb3Of+8JlKxaLwSS9j6go8C1nACNZGcuJp2smmjZbVtlZA28QuSAsU6E+bsDrxI5IIGmrCsVJj24q1m2KpGeE0q08SRWNMFnhGx4aGWFDlpqu0Gboyio+CSJoTarRS/26kWCi1FJ6ZFFjPVdnLxX89T2Sb97Lgq/zFUjwd3LkpC+NE05Cs0wUJRzpCeV/IZ5ISzZeGYCKZ+SAicywx0abVuqnOLhe1TQadtn3T7jx2W71uUWINXIBLcA1scAt64AH0gQMIYOAFvII3+A4/4Cf8Wo9WYLFzDjYAf34Bkr2j1g==</latexit>
p2 <latexit sha1_base64="Nmqn2SlhEynLisHOq+4hGKMR8rg=">AAACInicbVBPS8MwHE39tzn/bXr0EhyCp9HOoR4HXjxOsNtgKyNN0y0sSUuSCqP0M3jVu5/Gm3gS/DCmWw+u80Hg5b3fj7w8P2ZUadv+tra2d3b3KtX92sHh0fFJvXHaV1EiMXFxxCI59JEijAriaqoZGcaSIO4zMvDn97k/eCZS0Ug86UVMPI6mgoYUI20kN56k19mk3rRb9hJwkzgFaYICvUnDqoyDCCecCI0ZUmrk2LH2UiQ1xYxktXGiSIzwHE3JyFCBOFFeukybwUujBDCMpDlCw6X6dyNFXKkF980kR3qmyl4u/uv5PFu/l4VA5S+W4unwzkupiBNNBF6lCxMGdQTzvmBAJcGaLQxBWFLzQYhnSCKsTas1U51TLmqT9Nst56bVfuw0u52ixCo4BxfgCjjgFnTBA+gBF2BAwQt4BW/Wu/VhfVpfq9Etq9g5A2uwfn4BlHij1w==</latexit>
p3 nels and 8 output channels, with kernel size, stride and

+
⇡p
<latexit sha1_base64="tULcuQARL6vxwPI734SPWhCL5iA=">AAACL3icbVDNS8MwHE3nx+b86vToJTgET6Md4sdt4MXjBPcBax1pmm5hSVuSdDBK/xOvevevES/i1f/CdOvBdT4IvLz3+5GX58WMSmVZn0Zla3tnt1rbq+8fHB4dm42TvowSgUkPRywSQw9JwmhIeooqRoaxIIh7jAy82X3uD+ZESBqFT2oRE5ejSUgDipHS0tg0nZg+pw5Haip4GmfZ2GxaLWsJuEnsgjRBge64YVQdP8IJJ6HCDEk5sq1YuSkSimJGsrqTSBIjPEMTMtI0RJxIN11Gz+CFVnwYREKfUMGl+ncjRVzKBff0ZJ5Rlr1c/NfzeLZ+Lwu+zF8sxVPBrZvSME4UCfEqXZAwqCKYlwd9KghWbKEJwoLqD0I8RQJhpSuu6+rsclGbpN9u2det9uNVs3NXlFgDZ+AcXAIb3IAOeABd0AMYzMELeAVvxrvxYXwZ36vRilHsnII1GD+/Qn2pWQ==</latexit>
⇡p
⇡p
padding of 3, 1 and 1 respectively. This is followed by

MLP b2
<latexit sha1_base64="QhL1yD0scJkJigeFsfemgQIToSY=">AAACMHicbVA7T8MwGLR5tZRXCyNLRIXEVCUVAsZKLIxFog+pDZXtOK1VO4lsB1RF+SmssPNrYEKs/AqcNANNOcnS+e775PPhiDOlbfsTbmxube9Uqru1vf2Dw6N647ivwlgS2iMhD+UQI0U5C2hPM83pMJIUCczpAM9vM3/wRKViYfCgFxF1BZoGzGcEaSNN6o0xFslYID1TfoLT9LE9qTftlp3DWidOQZqgQHfSgJWxF5JY0EATjpQaOXak3QRJzQinaW0cKxohMkdTOjI0QIIqN8mzp9a5UTzLD6U5gbZy9e9GgoRSC4HNZJ6y7GXivx4W6eq9LHgqe7EUT/s3bsKCKNY0IMt0fswtHVpZe5bHJCWaLwxBRDLzQYvMkEREm45rpjqnXNQ66bdbzlWrfX/Z7NhFiVVwCs7ABXDANeiAO9AFPUDAM3gBr+ANvsMP+AW/l6MbsNg5ASuAP7+OZ6lu</latexit>
b1 b3
<latexit sha1_base64="EoNWsbXD95+uV5vA2gJ2hrCiPRY=">AAACMHicbVA7T8MwGLR5tZRXCyNLRIXEVCUVAsZKLIxFog+pDZXtOK1VO4lsB1RF+SmssPNrYEKs/AqcNANNOcnS+e775PPhiDOlbfsTbmxube9Uqru1vf2Dw6N647ivwlgS2iMhD+UQI0U5C2hPM83pMJIUCczpAM9vM3/wRKViYfCgFxF1BZoGzGcEaSNN6o0xFslYID1TfoLT9NGZ1Jt2y85hrROnIE1QoDtpwMrYC0ksaKAJR0qNHDvSboKkZoTTtDaOFY0QmaMpHRkaIEGVm+TZU+vcKJ7lh9KcQFu5+ncjQUKphcBmMk9Z9jLxXw+LdPVeFjyVvViKp/0bN2FBFGsakGU6P+aWDq2sPctjkhLNF4YgIpn5oEVmSCKiTcc1U51TLmqd9Nst56rVvr9sduyixCo4BWfgAjjgGnTAHeiCHiDgGbyAV/AG3+EH/ILfy9ENWOycgBXAn1+Mralt</latexit> <latexit sha1_base64="WKP82GdyMobpor0NoLvdzYwCNJg=">AAACMHicbVA7T8MwGHTKo6W8WhhZIiokpiopCBgrsTAWiT6kJlS247RW7SSyHVAV5aewws6vgQmx8itw0gw05SRL57vvk8+HIkalsqxPo7KxubVdre3Ud/f2Dw4bzaOBDGOBSR+HLBQjBCVhNCB9RRUjo0gQyBEjQzS/zfzhExGShsGDWkTE5XAaUJ9iqLQ0aTQdxBOHQzWTfoLS9PFi0mhZbSuHuU7sgrRAgd6kaVQdL8QxJ4HCDEo5tq1IuQkUimJG0roTSxJBPIdTMtY0gJxIN8mzp+aZVjzTD4U+gTJz9e9GArmUC470ZJ6y7GXivx7i6eq9LHgye7EUT/k3bkKDKFYkwMt0fsxMFZpZe6ZHBcGKLTSBWFD9QRPPoIBY6Y7rujq7XNQ6GXTa9lW7c3/Z6lpFiTVwAk7BObDBNeiCO9ADfYDBM3gBr+DNeDc+jC/jezlaMYqdY7AC4+cXkCGpbw==</latexit>
<latexit sha1_base64="ZBCy0b+PEk83IdMpiK0HOdUPGw8=">AAACMnicbVDNS8MwHE3nx+b82tzRS3AInkY7RD0OvHic4D5gLSVN0y0saUuSCqX0b/Gqd/8ZvYlX/wjTbgfX+SDw8t7vR16eFzMqlWl+GLWd3b39euOgeXh0fHLaap+NZZQITEY4YpGYekgSRkMyUlQxMo0FQdxjZOIt7wt/8kyEpFH4pNKYOBzNQxpQjJSW3FbH9nhmc6QWMsiCPHczK3dbXbNnloDbxFqTLlhj6LaNuu1HOOEkVJghKWeWGSsnQ0JRzEjetBNJYoSXaE5mmoaIE+lkZfocXmrFh0Ek9AkVLNW/GxniUqbc05NlzqpXiP96Hs8371XBl8WLlXgquHMyGsaJIiFepQsSBlUEi/6gTwXBiqWaICyo/iDECyQQVrrlpq7Oqha1Tcb9nnXT6z9edwfmusQGOAcX4ApY4BYMwAMYghHAIAUv4BW8Ge/Gp/FlfK9Ga8Z6pwM2YPz8ApsHqn4=</latexit>
f1 b3
<latexit sha1_base64="WKP82GdyMobpor0NoLvdzYwCNJg=">AAACMHicbVA7T8MwGHTKo6W8WhhZIiokpiopCBgrsTAWiT6kJlS247RW7SSyHVAV5aewws6vgQmx8itw0gw05SRL57vvk8+HIkalsqxPo7KxubVdre3Ud/f2Dw4bzaOBDGOBSR+HLBQjBCVhNCB9RRUjo0gQyBEjQzS/zfzhExGShsGDWkTE5XAaUJ9iqLQ0aTQdxBOHQzWTfoLS9PFi0mhZbSuHuU7sgrRAgd6kaVQdL8QxJ4HCDEo5tq1IuQkUimJG0roTSxJBPIdTMtY0gJxIN8mzp+aZVjzTD4U+gTJz9e9GArmUC470ZJ6y7GXivx7i6eq9LHgye7EUT/k3bkKDKFYkwMt0fsxMFZpZe6ZHBcGKLTSBWFD9QRPPoIBY6Y7rujq7XNQ6GXTa9lW7c3/Z6lpFiTVwAk7BObDBNeiCO9ADfYDBM3gBr+DNeDc+jC/jezlaMYqdY7AC4+cXkCGpbw==</latexit>
b1
<latexit sha1_base64="EoNWsbXD95+uV5vA2gJ2hrCiPRY=">AAACMHicbVA7T8MwGLR5tZRXCyNLRIXEVCUVAsZKLIxFog+pDZXtOK1VO4lsB1RF+SmssPNrYEKs/AqcNANNOcnS+e775PPhiDOlbfsTbmxube9Uqru1vf2Dw6N647ivwlgS2iMhD+UQI0U5C2hPM83pMJIUCczpAM9vM3/wRKViYfCgFxF1BZoGzGcEaSNN6o0xFslYID1TfoLT9NGZ1Jt2y85hrROnIE1QoDtpwMrYC0ksaKAJR0qNHDvSboKkZoTTtDaOFY0QmaMpHRkaIEGVm+TZU+vcKJ7lh9KcQFu5+ncjQUKphcBmMk9Z9jLxXw+LdPVeFjyVvViKp/0bN2FBFGsakGU6P+aWDq2sPctjkhLNF4YgIpn5oEVmSCKiTcc1U51TLmqd9Nst56rVvr9sduyixCo4BWfgAjjgGnTAHeiCHiDgGbyAV/AG3+EH/ILfy9ENWOycgBXAn1+Mralt</latexit>
<latexit sha1_base64="FA5GvESfbY4yPRdrpOgJVXcJEIA=">AAACMnicbVDNS8MwHE3nx+b82tzRS3AInkY7RD0OvHic4D5gLSVN0y0saUuSCqX0b/Gqd/8ZvYlX/wjTbgfX+SDw8t7vR16eFzMqlWl+GLWd3b39euOgeXh0fHLaap+NZZQITEY4YpGYekgSRkMyUlQxMo0FQdxjZOIt7wt/8kyEpFH4pNKYOBzNQxpQjJSW3FbH9nhmc6QWMsiCPHezfu62umbPLAG3ibUmXbDG0G0bdduPcMJJqDBDUs4sM1ZOhoSimJG8aSeSxAgv0ZzMNA0RJ9LJyvQ5vNSKD4NI6BMqWKp/NzLEpUy5pyfLnFWvEP/1PJ5v3quCL4sXK/FUcOdkNIwTRUK8ShckDKoIFv1BnwqCFUs1QVhQ/UGIF0ggrHTLTV2dVS1qm4z7Peum13+87g7MdYkNcA4uwBWwwC0YgAcwBCOAQQpewCt4M96NT+PL+F6N1oz1TgdswPj5BZzCqn8=</latexit>
f2 b1
<latexit sha1_base64="hlVA/gflNZXYbi/TfGZg6ec+/8k=">AAACMnicbVDNS8MwHE382pxfmzt6KQ7B02inqMeBF48T3AespaRZuoUlbUlSoZT+LV717j+jN/HqH2Ha9eA6HwRe3vv9yMvzIkalMs0PuLW9s7tXq+83Dg6Pjk+ardORDGOByRCHLBQTD0nCaECGiipGJpEgiHuMjL3lfe6Pn4mQNAyeVBIRh6N5QH2KkdKS22zbHk9tjtRC+qmfZW56lbnNjtk1CxibxCpJB5QYuC1Ys2chjjkJFGZIyqllRspJkVAUM5I17FiSCOElmpOppgHiRDppkT4zLrQyM/xQ6BMoo1D/bqSIS5lwT08WOateLv7reTxbv1eFmcxfrMRT/p2T0iCKFQnwKp0fM0OFRt6fMaOCYMUSTRAWVH/QwAskEFa65YauzqoWtUlGva510+09Xnf6ZlliHZyBc3AJLHAL+uABDMAQYJCAF/AK3uA7/IRf8Hs1ugXLnTZYA/z5BZ59qoA=</latexit>
f3 LayerNorm and ReLU activation. Then there is another

⇥3
<latexit sha1_base64="AB/RrFB7DJ3R942zym1zqu7ToAE=">AAACJHicbVDLSgMxFE18tdZXq0s3wSK4KjO1qMuCG5cV7APaoWQymTY0yQxJRihDP8Kt7v0ad+LCjd9ipp2FnXogcO6593JPjh9zpo3jfMOt7Z3dvVJ5v3JweHR8Uq2d9nSUKEK7JOKRGvhYU84k7RpmOB3EimLhc9r3Z/dZv/9MlWaRfDLzmHoCTyQLGcHGSv2RYYLq63G17jScJdAmcXNSBzk64xosjYKIJIJKQzjWeug6sfFSrAwjnC4qo0TTGJMZntChpRLbM1669LtAl1YJUBgp+6RBS/XvRoqF1nPh20mBzVQXe5n4b88Xi/W6KAQ6u1iwZ8I7L2UyTgyVZOUuTDgyEcoSQwFTlBg+twQTxewHEZlihYmxuVZsdG4xqE3Sazbcm0bzsVVvt/IQy+AcXIAr4IJb0AYPoAO6gIAZeAGv4A2+ww/4Cb9Wo1sw3zkDa4A/vx9gpKI=</latexit>
LayerNorm
s1
<latexit sha1_base64="SQ2/RKkMLwXigqzkb5Kc49lfZCk=">AAACInicbVBPS8MwHE3mn835b9Ojl+AQPI12DPU48OJxgt0GWxlpmm5hSVuSVBiln8Gr3v003sST4Icx3XpwnQ8CL+/9fuTleTFnSlvWN6zs7O7tV2sH9cOj45PTRvNsoKJEEuqQiEdy5GFFOQupo5nmdBRLioXH6dBb3Of+8JlKxaLwSS9j6go8C1nACNZGctQ0tbNpo2W1rRXQNrEL0gIF+tMmrE78iCSChppwrNTYtmLtplhqRjjN6pNE0RiTBZ7RsaEhFlS56Spthq6M4qMgkuaEGq3UvxspFkothWcmBdZzVfZy8V/PE9nmvSz4Kn+xFE8Hd27KwjjRNCTrdEHCkY5Q3hfymaRE86UhmEhmPojIHEtMtGm1bqqzy0Vtk0Gnbd+0O4/dVq9blFgDF+ASXAMb3IIeeAB94AACGHgBr+ANvsMP+Am/1qMVWOycgw3An1+WPKPY</latexit>
s2 <latexit sha1_base64="+xqPcj4JX5Ts19R0PBqJvCC0fuw=">AAACInicbVBPS8MwHE3mn835b9Ojl+AQPI12DPU48OJxgt0GWxlpmm5hSVuSVBiln8Gr3v003sST4Icx3XpwnQ8CL+/9fuTleTFnSlvWN6zs7O7tV2sH9cOj45PTRvNsoKJEEuqQiEdy5GFFOQupo5nmdBRLioXH6dBb3Of+8JlKxaLwSS9j6go8C1nACNZGctQ07WTTRstqWyugbWIXpAUK9KdNWJ34EUkEDTXhWKmxbcXaTbHUjHCa1SeJojEmCzyjY0NDLKhy01XaDF0ZxUdBJM0JNVqpfzdSLJRaCs9MCqznquzl4r+eJ7LNe1nwVf5iKZ4O7tyUhXGiaUjW6YKEIx2hvC/kM0mJ5ktDMJHMfBCROZaYaNNq3VRnl4vaJoNO275pdx67rV63KLEGLsAluAY2uAU98AD6wAEEMPACXsEbfIcf8BN+rUcrsNg5BxuAP7+X96PZ</latexit>
s3
<latexit sha1_base64="OTfN32Vle4ZkNaP0UJm2rmpE7Is=">AAACInicbVBPS8MwHE39tzn/bXr0EhyCp9HOoR4HXjxOsNtgKyNN0y0sSUuSCqP0M3jVu5/Gm3gS/DCmWw+u80Hg5b3fj7w8P2ZUadv+tra2d3b3KtX92sHh0fFJvXHaV1EiMXFxxCI59JEijAriaqoZGcaSIO4zMvDn97k/eCZS0Ug86UVMPI6mgoYUI20kV03S62xSb9otewm4SZyCNEGB3qRhVcZBhBNOhMYMKTVy7Fh7KZKaYkay2jhRJEZ4jqZkZKhAnCgvXabN4KVRAhhG0hyh4VL9u5EirtSC+2aSIz1TZS8X//V8nq3fy0Kg8hdL8XR456VUxIkmAq/ShQmDOoJ5XzCgkmDNFoYgLKn5IMQzJBHWptWaqc4pF7VJ+u2Wc9NqP3aa3U5RYhWcgwtwBRxwC7rgAfSACzCg4AW8gjfr3fqwPq2v1eiWVeycgTVYP7+ZsqPa</latexit>
convolutional layer with number of input channels of 8 and

+
MHA
⇡s
<latexit sha1_base64="UWuIvBH+JkgAWDIITSyOz+uIiOE=">AAACL3icbVDNS8MwHE3nx+b86vToJTgET6Md4sdt4MXjBPcBax1pmm5hSVuSdDBK/xOvevevES/i1f/CdOvBdT4IvLz3+5GX58WMSmVZn0Zla3tnt1rbq+8fHB4dm42TvowSgUkPRywSQw9JwmhIeooqRoaxIIh7jAy82X3uD+ZESBqFT2oRE5ejSUgDipHS0tg0nZg+pw5Haip4KrNsbDatlrUE3CR2QZqgQHfcMKqOH+GEk1BhhqQc2Vas3BQJRTEjWd1JJIkRnqEJGWkaIk6kmy6jZ/BCKz4MIqFPqOBS/buRIi7lgnt6Ms8oy14u/ut5PFu/lwVf5i+W4qng1k1pGCeKhHiVLkgYVBHMy4M+FQQrttAEYUH1ByGeIoGw0hXXdXV2uahN0m+37OtW+/Gq2bkrSqyBM3AOLoENbkAHPIAu6AEM5uAFvII34934ML6M79VoxSh2TsEajJ9fR7GpXA==</latexit>
⇡s
⇡s
number of output channels of 8, and with kernel size, stride

LayerNorm
b b b 1
<latexit sha1_base64="QhL1yD0scJkJigeFsfemgQIToSY=">AAACMHicbVA7T8MwGLR5tZRXCyNLRIXEVCUVAsZKLIxFog+pDZXtOK1VO4lsB1RF+SmssPNrYEKs/AqcNANNOcnS+e775PPhiDOlbfsTbmxube9Uqru1vf2Dw6N647ivwlgS2iMhD+UQI0U5C2hPM83pMJIUCczpAM9vM3/wRKViYfCgFxF1BZoGzGcEaSNN6o0xFslYID1TfoLT9LE9qTftlp3DWidOQZqgQHfSgJWxF5JY0EATjpQaOXak3QRJzQinaW0cKxohMkdTOjI0QIIqN8mzp9a5UTzLD6U5gbZy9e9GgoRSC4HNZJ6y7GXivx4W6eq9LHgqe7EUT/s3bsKCKNY0IMt0fswtHVpZe5bHJCWaLwxBRDLzQYvMkEREm45rpjqnXNQ66bdbzlWrfX/Z7NhFiVVwCs7ABXDANeiAO9AFPUDAM3gBr+ANvsMP+AW/l6MbsNg5ASuAP7+OZ6lu</latexit>
2
3
<latexit sha1_base64="ZBCy0b+PEk83IdMpiK0HOdUPGw8=">AAACMnicbVDNS8MwHE3nx+b82tzRS3AInkY7RD0OvHic4D5gLSVN0y0saUuSCqX0b/Gqd/8ZvYlX/wjTbgfX+SDw8t7vR16eFzMqlWl+GLWd3b39euOgeXh0fHLaap+NZZQITEY4YpGYekgSRkMyUlQxMo0FQdxjZOIt7wt/8kyEpFH4pNKYOBzNQxpQjJSW3FbH9nhmc6QWMsiCPHczK3dbXbNnloDbxFqTLlhj6LaNuu1HOOEkVJghKWeWGSsnQ0JRzEjetBNJYoSXaE5mmoaIE+lkZfocXmrFh0Ek9AkVLNW/GxniUqbc05NlzqpXiP96Hs8371XBl8WLlXgquHMyGsaJIiFepQsSBlUEi/6gTwXBiqWaICyo/iDECyQQVrrlpq7Oqha1Tcb9nnXT6z9edwfmusQGOAcX4ApY4BYMwAMYghHAIAUv4BW8Ge/Gp/FlfK9Ga8Z6pwM2YPz8ApsHqn4=</latexit>
f1
b b 1
3
<latexit sha1_base64="FA5GvESfbY4yPRdrpOgJVXcJEIA=">AAACMnicbVDNS8MwHE3nx+b82tzRS3AInkY7RD0OvHic4D5gLSVN0y0saUuSCqX0b/Gqd/8ZvYlX/wjTbgfX+SDw8t7vR16eFzMqlWl+GLWd3b39euOgeXh0fHLaap+NZZQITEY4YpGYekgSRkMyUlQxMo0FQdxjZOIt7wt/8kyEpFH4pNKYOBzNQxpQjJSW3FbH9nhmc6QWMsiCPHezfu62umbPLAG3ibUmXbDG0G0bdduPcMJJqDBDUs4sM1ZOhoSimJG8aSeSxAgv0ZzMNA0RJ9LJyvQ5vNSKD4NI6BMqWKp/NzLEpUy5pyfLnFWvEP/1PJ5v3quCL4sXK/FUcOdkNIwTRUK8ShckDKoIFv1BnwqCFUs1QVhQ/UGIF0ggrHTLTV2dVS1qm4z7Peum13+87g7MdYkNcA4uwBWwwC0YgAcwBCOAQQpewCt4M96NT+PL+F6N1oz1TgdswPj5BZzCqn8=</latexit>
f2 b1
<latexit sha1_base64="hlVA/gflNZXYbi/TfGZg6ec+/8k=">AAACMnicbVDNS8MwHE382pxfmzt6KQ7B02inqMeBF48T3AespaRZuoUlbUlSoZT+LV717j+jN/HqH2Ha9eA6HwRe3vv9yMvzIkalMs0PuLW9s7tXq+83Dg6Pjk+ardORDGOByRCHLBQTD0nCaECGiipGJpEgiHuMjL3lfe6Pn4mQNAyeVBIRh6N5QH2KkdKS22zbHk9tjtRC+qmfZW56lbnNjtk1CxibxCpJB5QYuC1Ys2chjjkJFGZIyqllRspJkVAUM5I17FiSCOElmpOppgHiRDppkT4zLrQyM/xQ6BMoo1D/bqSIS5lwT08WOateLv7reTxbv1eFmcxfrMRT/p2T0iCKFQnwKp0fM0OFRt6fMaOCYMUSTRAWVH/QwAskEFa65YauzqoWtUlGva510+09Xnf6ZlliHZyBc3AJLHAL+uABDMAQYJCAF/AK3uA7/IRf8Hs1ugXLnTZYA/z5BZ59qoA=</latexit>
f3
ru
<latexit sha1_base64="75VSJDpC8evS17CYe4jaSZA5tcE=">AAACK3icbVDNTgIxGGzxB8Q/0KOXRmLiiewSoxxJvHjERMAEVtLtFmhou5u2a0I2+x5e9e7TeNJ49T3swh5kcZIm05nvS6fjR5xp4zifsLS1vbNbruxV9w8Oj45r9ZO+DmNFaI+EPFSPPtaUM0l7hhlOHyNFsfA5Hfjz28wfPFOlWSgfzCKinsBTySaMYGOlJzVORgKbmRJJnKbjWsNpOkugTeLmpAFydMd1WB4FIYkFlYZwrPXQdSLjJVgZRjhNq6NY0wiTOZ7SoaUSC6q9ZBk7RRdWCdAkVPZIg5bq340EC60XwreTWUZd9DLxX88X6fq9KAQ6e7EQz0zaXsJkFBsqySrdJObIhCgrDgVMUWL4whJMFLMfRGSGFSbG1lu11bnFojZJv9V0r5ut+6tGp52XWAFn4BxcAhfcgA64A13QAwQo8AJewRt8hx/wC36vRksw3zkFa4A/vyf/qFY=</latexit>
and padding of 5, 1 and 1 respectively. This is again fol-
lowed by LayerNorm and ReLU activation. The output is
LinearEncoder FrontierEncoder FrontierEncoder FrontierEncoder
then flattened into a vector of size 512, which is then passed
1
into a fully-connected layer with 128 hidden units, which is
2
again followed by LayerNorm and ReLU activation.
3
For the 2D situation with a 2D bin with W = 10 (Sec.3.4),
the stacked last two frontier is of size 2 × 10, and is firstly
Figure 5. Schematic plot of the proposed attend2pack pipeline.
fed into a 1D convolutional layer with number of input
channels 2 and number of output channels 8, with kernel
A. Appendix: Schematic Plot of attend2pack size, stride and padding of 3, 1 and 1 respectively. This is
followed by LayerNorm and ReLU activation. The output
A schematic plot of our proposed attend2pack algorithm is then flattened into a vector of size 80, then passed into a
is shown in Fig.5. fully-connected layer with 128 hidden units, which is again
followed by LayerNorm and ReLU activation.
B. Appendix: Schematic Plot of Glimpse For a bin with W × H = 100 × 100 in the 3D situation
A schematic plot of the glimpse operation (Eq.7-11) during (Sec.3.5), the stacked last two frontier of size 2 × 100 × 100
sequence decoding is shown in Fig.6. is firstly fed into a 2D convolutional layer with number of
input channels 2 and number of output channels 4, with ker-
nel size, stride and padding of 3, 2 and 1 respectively. Then
C. Appendix: Network Architecture there is another two convolutional layers both with number
In this section we present the details of the network archi- of input channels of 4 and number of output channels of 4,
tecture used in our proposed attend2pack model. and with kernel size, stride and padding of 3, 2 and 1 respec-
tively. This is again followed by LayerNorm and ReLU
The embedding network θ e firstly encodes a N × 3 input activation. The output is then flattened into a vector of size
into a N × 128 embedding for each data point using an 676, which is then passed into a fully-connected layer with
initial linear encoder with 128 hidden units (in preliminary 128 hidden units, which is again followed by LayerNorm
experiments, we found that applying random permutation and ReLU activation. We note that that in the experiments
to the input list of boxes gave at least marginal performance in Sec.3.5, for the experiments with 100 number of boxes
gains over sorting the boxes, so we adhere to random permu- with bin of W × H = 100 × 100, instead of using the
tation in all our experiments; also we ensure that all boxes stacked last 2 frontiers to be the input to the frontier encod-
are rotated such that ln ≤ wn ≤ hn before they are fed into ing network as in all our other experiments, we only use the
the network). This is then followed by a multi-head self- most recent frontier due to limited memory.
attention module that contains 3 self-attention layers with 8
heads, with each layer containing a self-attention sub-layer The sequence decoding parameters θ s are presented in de-
(Eq.4) and an MLP sub-layer (Eq.5). The self-attention
sub-layer contains a LayerNorm followed by a standard
⇥M
<latexit sha1_base64="LA7RMnlOg++T5ndTEP3JFkdY9Gc=">AAACL3icbVDLSsNAFJ3UR2t9pbp0EyyCq5KUoi4LbtwIFewDmlAmk0k7dGYSZiaFEvInbnXv14gbcetfOGmzsKkHBs49917umePHlEhl259GZWd3b79aO6gfHh2fnJqNs4GMEoFwH0U0EiMfSkwJx31FFMWjWGDIfIqH/vw+7w8XWEgS8We1jLHH4JSTkCCotDQxTVcRhqXLoJoJlj5mE7Npt+wVrG3iFKQJCvQmDaPqBhFKGOYKUSjl2LFj5aVQKIIozupuInEM0RxO8VhTDvU9L11Zz6wrrQRWGAn9uLJW6t+NFDIpl8zXk7lFWe7l4r89n2WbdVkIZH6xZE+Fd15KeJwozNHaXZhQS0VWHp4VEIGRoktNIBJEf9BCMyggUjriuo7OKQe1TQbtlnPTaj91mt1OEWINXIBLcA0ccAu64AH0QB8gsAAv4BW8Ge/Gh/FlfK9HK0axcw42YPz8Au4VqSQ=</latexit>
MHA module (Vaswani et al., 2017) with an embedding

c̄m qs,m qst ct ⇡ts
<latexit sha1_base64="vaoNBUbamM/W5lUSBIYO7Mq9rmk=">AAACOnicbVC7TsMwFHXKo6W8Whi7RFRITFVSIWCsxMJYJPqQmhA5jtNatZPIdpAqKwNfwwo7P8LKhlj5AJw2A005kqXjc+/VPff4CSVCWtaHUdna3tmt1vbq+weHR8eN5slQxClHeIBiGvOxDwWmJMIDSSTF44RjyHyKR/78Nq+PnjAXJI4e5CLBLoPTiIQEQaklr9FyfMiV4zPlMChnIlQoyzJPPiqWeY221bGWMDeJXZA2KND3mkbVCWKUMhxJRKEQE9tKpKsglwRRnNWdVOAEojmc4ommEWRYuGp5RWaeayUww5jrF0lzqf6dUJAJsWC+7lxaLddy8d+az7L1f1kIRL6xZE+GN64iUZJKHKGVuzClpozNPEczIBwjSReaQMSJPtBEM8ghkjrtuo7OLge1SYbdjn3V6d5ftntWEWINtMAZuAA2uAY9cAf6YAAQeAYv4BW8Ge/Gp/FlfK9aK0YxcwrWYPz8Aj7QrmI=</latexit> <latexit sha1_base64="dppNcM0kCYMesZLC45/aq+Bcc0A=">AAACP3icbVC7TsMwFHV4tZRXCyOLRYXEgKqkQsBYiYWxSPQhtSFyHKe1aifBdpCqKCtfwwo7n8EXsCFWNpw0A005kqVzz71XPve4EaNSmeaHsba+sblVqW7Xdnb39g/qjcO+DGOBSQ+HLBRDF0nCaEB6iipGhpEgiLuMDNzZTdYfPBEhaRjcq3lEbI4mAfUpRkpLTh2OXZ6MOVJT6SePaeqoh0UpeCLTc5469abZMnPAVWIVpAkKdJ2GURl7IY45CRRmSMqRZUbKTpBQFDOS1saxJBHCMzQhI00DxIm0k/yUFJ5qxYN+KPQLFMzVvxsJ4lLOuasnc9PlXib+23N5ulyXBU9mP5bsKf/aTmgQxYoEeOHOjxlUIczChB4VBCs21wRhQfWBEE+RQFjpyGs6Oqsc1Crpt1vWZat9d9HsmEWIVXAMTsAZsMAV6IBb0AU9gMEzeAGv4M14Nz6NL+N7MbpmFDtHYAnGzy+vFrCV</latexit> <latexit sha1_base64="PThsBQQYr4bGDa/3VrZejKKm4MA=">AAACPXicbVDLSsNAFJ3x1VpfrS4FCRbBVUmKqMuCG5cV7AOaWCaTSTt0JokzE6GE7Pwat7r3O/wAd+LWrZM0C5t6YOCcc+9l7j1uxKhUpvkB19Y3Nrcq1e3azu7e/kG9cdiXYSww6eGQhWLoIkkYDUhPUcXIMBIEcZeRgTu7yeqDJyIkDYN7NY+Iw9EkoD7FSGlrXD+xXZ7YHKmp9JPHNB2rh4UUPJFa1ptmy8xhrBKrIE1QoDtuwIrthTjmJFCYISlHlhkpJ0FCUcxIWrNjSSKEZ2hCRpoGiBPpJPkhqXGmHc/wQ6FfoIzc/TuRIC7lnLu6M1+5XMvMf2suT5d12fBk9mNpPeVfOwkNoliRAC+282NmqNDIojQ8KghWbK4JwoLqAw08RQJhpQOv6eisclCrpN9uWZet9t1Fs2MWIVbBMTgF58ACV6ADbkEX9AAGz+AFvII3+A4/4Rf8XrSuwWLmCCwB/vwCQmKv6A==</latexit> <latexit sha1_base64="ILDTTCFZ1sYTud72+D65e0up0cU=">AAACMXicbVA7T8MwGLTLo6W82jKyRFRITFVSIWCsxMJYJPqQmhA5jtNatZPIdhBVlL/CCju/phti5U/gtBloykmWznffJ5/PixmVyjSXsLKzu7dfrR3UD4+OT04bzdZQRonAZIAjFomxhyRhNCQDRRUj41gQxD1GRt78PvdHL0RIGoVPahETh6NpSAOKkdKS22jZMX1ObY7UTPBUZpmr3Ebb7JgrGNvEKkgbFOi7TVi1/QgnnIQKMyTlxDJj5aRIKIoZyep2IkmM8BxNyUTTEHEinXQVPjMuteIbQST0CZWxUv9upIhLueCensxTyrKXi/96Hs8272XBl/mLpXgquHNSGsaJIiFepwsSZqjIyOszfCoIVmyhCcKC6g8aeIYEwkqXXNfVWeWitsmw27FuOt3H63bPLEqsgXNwAa6ABW5BDzyAPhgADF7BG3gHH/ATLuEX/F6PVmCxcwY2AH9+AQotqjo=</latexit>
<latexit sha1_base64="wf55HqCVbShhWW5fTif36ycUmbo=">AAACMHicbVDNS8MwHE3mx+b82vTopTgET6Mdoh4HXjxOcB+wlpKm6RaWtCVJlVH6p3jVu3+NnsSrf4Vp14PrfBB4ee/3Iy/PixmVyjQ/YW1re2e33thr7h8cHh232icjGSUCkyGOWCQmHpKE0ZAMFVWMTGJBEPcYGXuLu9wfPxEhaRQ+qmVMHI5mIQ0oRkpLbqttezy1OVJzGaQ4y1zltjpm1yxgbBKrJB1QYuC2Yd32I5xwEirMkJRTy4yVkyKhKGYka9qJJDHCCzQjU01DxIl00iJ7ZlxoxTeCSOgTKqNQ/26kiEu55J6eLFJWvVz81/N4tn6vCr7MX6zEU8Gtk9IwThQJ8SpdkDBDRUbenuFTQbBiS00QFlR/0MBzJBBWuuOmrs6qFrVJRr2udd3tPVx1+mZZYgOcgXNwCSxwA/rgHgzAEGDwDF7AK3iD7/ADfsHv1WgNljunYA3w5xcD46my</latexit>
dimension of 128 and 8 heads, whose output is added to the t t
input of the self-attention sub-layer with a skip connection.

The MLP module contains a fully-connected layer with 512 <latexit sha1_base64="Bj8XXE/EkkllWuoGuLk9yJ/Glsg=">AAACRXicbVDLSsNAFJ3UR2t9tbp0EyyKCylJEXVZcOOygn1AE8NkMmmHziRxZiKUkD/wa9zq3m/wI9yJW52kWdjUAwPnnnsvc+5xI0qENIwPrbK2vrFZrW3Vt3d29/YbzYOBCGOOcB+FNOQjFwpMSYD7kkiKRxHHkLkUD93ZTdYfPmEuSBjcy3mEbQYnAfEJglJJTuPUciFPLJclFoNyKvzkMU1TRz4sas4SkZ6z1Gm0jLaRQ18lZkFaoEDPaWpVywtRzHAgEYVCjE0jknYCuSSI4rRuxQJHEM3gBI8VDSDDwk7yg1L9RCme7odcvUDqufp3I4FMiDlz1WTuutzLxH97LkuX67LgiezHkj3pX9sJCaJY4gAt3Pkx1WWoZ5HqHuEYSTpXBCJO1IE6mkIOkVTB11V0ZjmoVTLotM3LdufuotU1ihBr4AgcgzNggivQBbegB/oAgWfwAl7Bm/aufWpf2vditKIVO4dgCdrPL0bws1o=</latexit>
q̄s,m
<latexit sha1_base64="5U9JhlSQqkmiw05glIKgOYaZQRQ=">AAACOnicbVC7TsMwFHXKo6W8Whi7RFRIDKhKKgSMlVgYi0RbpCZUtuO0Vu0ksh2kKsrA17DCzo+wsiFWPgAnzUBTjmTp+Nx7dc89KGJUKsv6MCobm1vb1dpOfXdv/+Cw0TwayjAWmAxwyELxgKAkjAZkoKhi5CESBHLEyAjNb7L66IkIScPgXi0i4nI4DahPMVRamjRaDoIicRBPHA7VTPrJPE3Tx8Q+5+mk0bY6Vg5zndgFaYMC/UnTqDpeiGNOAoUZlHJsW5FyEygUxYykdSeWJIJ4DqdkrGkAOZFukl+Rmqda8Uw/FPoFyszVvxMJ5FIuONKdudVyLRP/rSGerv7LgiezjSV7yr92ExpEsSIBXrrzY2aq0MxyND0qCFZsoQnEguoDTTyDAmKl067r6OxyUOtk2O3Yl53u3UW7ZxUh1kALnIAzYIMr0AO3oA8GAINn8AJewZvxbnwaX8b3srViFDPHYAXGzy+AD630</latexit>
1,m 2,m 3,m 1,m 2,m 3,m

v̄ v̄ v̄
<latexit sha1_base64="BMHjlxpCaot4qfu9mDF0ls6oqNY=">AAACOnicbVC7TsMwFHXKo6W8Whi7RFRIDKhKKgSMlVgYi0RbpCZUjuO0Vm0nsh2kKsrA17DCzo+wsiFWPgAnzUBTjmTp+Nx7dc89XkSJVJb1YVQ2Nre2q7Wd+u7e/sFho3k0lGEsEB6gkIbiwYMSU8LxQBFF8UMkMGQexSNvfpPVR09YSBLye7WIsMvglJOAIKi0NGm0HA+KxPFY4jCoZjJI5mmaPibdc5ZOGm2rY+Uw14ldkDYo0J80jarjhyhmmCtEoZRj24qUm0ChCKI4rTuxxBFEczjFY005ZFi6SX5Fap5qxTeDUOjHlZmrfycSyKRcME935lbLtUz8t+axdPVfFnyZbSzZU8G1mxAexQpztHQXxNRUoZnlaPpEYKToQhOIBNEHmmgGBURKp13X0dnloNbJsNuxLzvdu4t2zypCrIEWOAFnwAZXoAduQR8MAALP4AW8gjfj3fg0vozvZWvFKGaOwQqMn1+BzK31</latexit> <latexit sha1_base64="+CPqmzaML5jQ2VY27wbvhlnu18s=">AAACOnicbVC7TsMwFHXKo6W8Whi7RFRIDKhKCgLGSiyMRaIPqQmV4zqtVduJbAepijLwNayw8yOsbIiVD8BJM9CUI1k6Pvde3XOPF1IilWV9GKWNza3tcmWnuru3f3BYqx/1ZRAJhHsooIEYelBiSjjuKaIoHoYCQ+ZRPPDmt2l98ISFJAF/UIsQuwxOOfEJgkpL41rD8aCIHY/FDoNqJv14niTJY3xxzpJxrWm1rAzmOrFz0gQ5uuO6UXYmAYoY5gpRKOXItkLlxlAogihOqk4kcQjRHE7xSFMOGZZunF2RmKdamZh+IPTjyszUvxMxZFIumKc7M6vFWir+W/NYsvovChOZbizYU/6NGxMeRgpztHTnR9RUgZnmaE6IwEjRhSYQCaIPNNEMCoiUTruqo7OLQa2TfrtlX7Xa95fNjpWHWAENcALOgA2uQQfcgS7oAQSewQt4BW/Gu/FpfBnfy9aSkc8cgxUYP7+Dia32</latexit>
<latexit sha1_base64="JiEQXBTd16Te7FrpRDGJWKtoGow=">AAACOnicbVC7TsMwFHXKo6W8Whi7RFRIDKhKKgSMlVgYi0QfUhMqx3Faq7YT2U6lKsrA17DCzo+wsiFWPgA3zUBTjmTp+Nx7dc89XkSJVJb1YZS2tnd2y5W96v7B4dFxrX7Sl2EsEO6hkIZi6EGJKeG4p4iieBgJDJlH8cCb3S3rgzkWkoT8US0i7DI44SQgCCotjWsNx4MicTyWOAyqqQySeZqmT4l9ydJxrWm1rAzmJrFz0gQ5uuO6UXb8EMUMc4UolHJkW5FyEygUQRSnVSeWOIJoBid4pCmHDEs3ya5IzXOt+GYQCv24MjP170QCmZQL5unOzGqxthT/rXksXf8XBV8uNxbsqeDWTQiPYoU5WrkLYmqq0FzmaPpEYKToQhOIBNEHmmgKBURKp13V0dnFoDZJv92yr1vth6tmx8pDrIAGOAMXwAY3oAPuQRf0AALP4AW8gjfj3fg0vozvVWvJyGdOwRqMn1+TcK3/</latexit> <latexit sha1_base64="b2ekMrTqjyf7B06O+7DH3Dy83nw=">AAACOnicbVC7TsMwFHXKo6W8Whi7RFRIDKhKKgSMlVgYi0QfUhMqx3Faq7YT2U6lKsrA17DCzo+wsiFWPgA3zUBTjmTp+Nx7dc89XkSJVJb1YZS2tnd2y5W96v7B4dFxrX7Sl2EsEO6hkIZi6EGJKeG4p4iieBgJDJlH8cCb3S3rgzkWkoT8US0i7DI44SQgCCotjWsNx4MicTyWOAyqqQySeZqmT0n7kqXjWtNqWRnMTWLnpAlydMd1o+z4IYoZ5gpRKOXItiLlJlAogihOq04scQTRDE7wSFMOGZZukl2Rmuda8c0gFPpxZWbq34kEMikXzNOdmdVibSn+W/NYuv4vCr5cbizYU8GtmxAexQpztHIXxNRUobnM0fSJwEjRhSYQCaIPNNEUCoiUTruqo7OLQW2SfrtlX7faD1fNjpWHWAENcAYugA1uQAfcgy7oAQSewQt4BW/Gu/FpfBnfq9aSkc+cgjUYP7+VLa4A</latexit> <latexit sha1_base64="WRHMA5Vn6bBj8813HPyW9VY6DyE=">AAACOnicbVC7TsMwFHV4tZRXC2MXiwqJAVVJQcBYiYWxSPQhNaFyHKe1aieR7VSqogx8DSvs/AgrG2LlA3DTDDTlSJaOz71X99zjRoxKZZofxsbm1vZOqbxb2ds/ODyq1o57MowFJl0cslAMXCQJowHpKqoYGUSCIO4y0nend4t6f0aEpGHwqOYRcTgaB9SnGCktjap120UisV2e2BypifSTWZqmT8nlBU9H1YbZNDPAdWLlpAFydEY1o2R7IY45CRRmSMqhZUbKSZBQFDOSVuxYkgjhKRqToaYB4kQ6SXZFCs+04kE/FPoFCmbq34kEcSnn3NWdmdVibSH+W3N5uvovCp5cbCzYU/6tk9AgihUJ8NKdHzOoQrjIEXpUEKzYXBOEBdUHQjxBAmGl067o6KxiUOuk12pa183Ww1WjbeYhlkEdnIJzYIEb0Ab3oAO6AINn8AJewZvxbnwaX8b3snXDyGdOwAqMn1+W6q4B</latexit>
k1 k2 k3
<latexit sha1_base64="y70Qii9gvu+FFhzJsV29USjWR40=">AAACMHicbVA7T8MwGLR5tZRXCyNLRIXEVCUVAsZKLIxFog+pCZXjOK1V24lsB1RF+SmssPNrYEKs/ArcNANNOcnS+e775PP5MaNK2/Yn3Njc2t6pVHdre/sHh0f1xnFfRYnEpIcjFsmhjxRhVJCeppqRYSwJ4j4jA392u/AHT0QqGokHPY+Jx9FE0JBipI00rjdcn6cuR3qqwnSWZY/OuN60W3YOa504BWmCAt1xA1bcIMIJJ0JjhpQaOXasvRRJTTEjWc1NFIkRnqEJGRkqECfKS/PsmXVulMAKI2mO0Fau/t1IEVdqzn0zmacsewvxX8/n2eq9LARq8WIpng5vvJSKONFE4GW6MGGWjqxFe1ZAJcGazQ1BWFLzQQtPkURYm45rpjqnXNQ66bdbzlWrfX/Z7NhFiVVwCs7ABXDANeiAO9AFPYDBM3gBr+ANvsMP+AW/l6MbsNg5ASuAP7+cW6l2</latexit> <latexit sha1_base64="IlILfmm/LtNWyzQs9PDs/Wgxytc=">AAACMHicbVA7T8MwGHR4tZRXCyOLRYXEVCUVAsZKLIxFog+pCZXjOK1V24lsB1RF+SmssPNrYEKs/ArcNANNOcnS+e775PP5MaNK2/antbG5tb1Tqe7W9vYPDo/qjeO+ihKJSQ9HLJJDHynCqCA9TTUjw1gSxH1GBv7sduEPnohUNBIPeh4Tj6OJoCHFSBtpXG+4Pk9djvRUheksyx7b43rTbtk54DpxCtIEBbrjhlVxgwgnnAiNGVJq5Nix9lIkNcWMZDU3USRGeIYmZGSoQJwoL82zZ/DcKAEMI2mO0DBX/26kiCs1576ZzFOWvYX4r+fzbPVeFgK1eLEUT4c3XkpFnGgi8DJdmDCoI7hoDwZUEqzZ3BCEJTUfhHiKJMLadFwz1TnlotZJv91yrlrt+8tmxy5KrIJTcAYugAOuQQfcgS7oAQyewQt4BW/Wu/VhfVnfy9ENq9g5ASuwfn4BnhWpdw==</latexit> <latexit sha1_base64="9AyPNqnVCyFy7WF0SWjKRKHuM6E=">AAACMHicbVA7T8MwGLTLo6W8WhhZIiokpiopCBgrsTAWiT6kJlSO47RW7SSyHVAV5aewws6vgQmx8itw0gw05SRL57vvk8/nRoxKZZqfsLKxubVdre3Ud/f2Dw4bzaOBDGOBSR+HLBQjF0nCaED6iipGRpEgiLuMDN35beYPn4iQNAwe1CIiDkfTgPoUI6WlSaNpuzyxOVIz6SfzNH28mDRaZtvMYawTqyAtUKA3acKq7YU45iRQmCEpx5YZKSdBQlHMSFq3Y0kihOdoSsaaBogT6SR59tQ404pn+KHQJ1BGrv7dSBCXcsFdPZmnLHuZ+K/n8nT1XhY8mb1Yiqf8GyehQRQrEuBlOj9mhgqNrD3Do4JgxRaaICyo/qCBZ0ggrHTHdV2dVS5qnQw6beuq3bm/bHXNosQaOAGn4BxY4Bp0wR3ogT7A4Bm8gFfwBt/hB/yC38vRCix2jsEK4M8vn8+peA==</latexit>
t k̄ k̄ k̄
hidden nodes followed by ReLU activation, which is then
followed by another fully-connected layer with 128 hidden
<latexit sha1_base64="myC8eZUClK/OUZAMn3qFVacl128=">AAACSHicbVDLSsNAFJ3UR2t9tbp0EyyCgpSkFHVZcOOygn1AE8tkMmmHziRhZiKUMN/g17jVvX/gX7gTd07SLPrwwMCZc+/lnnu8mBIhLevLKG1t7+yWK3vV/YPDo+Na/aQvooQj3EMRjfjQgwJTEuKeJJLiYcwxZB7FA292n9UHL5gLEoVPch5jl8FJSAKCoNTSuHbleCx1GJRTEaQDpZ5Tx4M8XVZnSqlrpsa1htW0cpibxC5IAxTojutG2fEjlDAcSkShECPbiqWbQi4JolhVnUTgGKIZnOCRpiFkWLhpfpMyL7Tim0HE9QulmavLEylkQsyZpztzn+u1TPy35jG1+l8XfJFtXLMngzs3JWGcSByihbsgoaaMzCxV0yccI0nnmkDEiT7QRFPIIZI6+6qOzl4PapP0W037ptl6bDc67SLECjgD5+AS2OAWdMAD6IIeQOAVvIF38GF8Gt/Gj/G7aC0ZxcwpWEGp9AflsLOl</latexit>
Wv̄,m Wk
<latexit sha1_base64="P5VJ+fEjMNGSPHG9gBfvMBlyhbs=">AAACQHicbVC7TsMwFLXLo6W8WhhZAhUSU5VUFTBWYmEsEn1Ibagcx2mt2klkO0hVlJmvYYWdv+AP2BArE06aoU05kqXjc+/VPfc4IaNSmeYnLG1t7+yWK3vV/YPDo+Na/aQvg0hg0sMBC8TQQZIw6pOeooqRYSgI4g4jA2d+l9YHz0RIGviPahESm6OpTz2KkdLSpHY+dng85kjNpBcPkuQpXhXmSZJMag2zaWYwNomVkwbI0Z3UYXnsBjjixFeYISlHlhkqO0ZCUcxIUh1HkoQIz9GUjDT1ESfSjrNbEuNSK67hBUI/XxmZujoRIy7lgju6MzNZrKXivzWHJ+v/ouDKdGPBnvJu7Zj6YaSIj5fuvIgZKjDSNA2XCoIVW2iCsKD6QAPPkEBY6cyrOjqrGNQm6bea1nWz9dBudNp5iBVwBi7AFbDADeiAe9AFPYDBC3gFb+AdfsAv+A1/lq0lmM+cgjXA3z/YKLEy</latexit>
<latexit sha1_base64="kJapKS5O03ayt02qu/FIRaYHHqQ=">AAACSHicbVDLSsNAFJ3ER2t9tbp0EyyCgpSkFHVZcOOygn1AE8tkOmkHZ5IwMymUMN/g17jVvX/gX7gTd07SLPrwwMCZc+/lnnv8mBIhbfvLMLe2d3ZL5b3K/sHh0XG1dtITUcIR7qKIRnzgQ4EpCXFXEknxIOYYMp/ivv9yn9X7M8wFicInOY+xx+AkJAFBUGppVL1yfZa6DMqpCNK+Us+p60OeLqszpdQ1U6Nq3W7YOaxN4hSkDgp0RjWj5I4jlDAcSkShEEPHjqWXQi4JolhV3ETgGKIXOMFDTUPIsPDS/CZlXWhlbAUR1y+UVq4uT6SQCTFnvu7Mfa7XMvHfms/U6n9dGIts45o9Gdx5KQnjROIQLdwFCbVkZGWpWmPCMZJ0rglEnOgDLTSFHCKps6/o6Jz1oDZJr9lwbhrNx1a93SpCLIMzcA4ugQNuQRs8gA7oAgRewRt4Bx/Gp/Ft/Bi/i1bTKGZOwQpM8w/48LOw</latexit>
Wk̄,m
nodes. The output of the last fully-connected layer is added
to the input of the MLP sub-layer with a skip connection.
b1 b2 b3
<latexit sha1_base64="EoNWsbXD95+uV5vA2gJ2hrCiPRY=">AAACMHicbVA7T8MwGLR5tZRXCyNLRIXEVCUVAsZKLIxFog+pDZXtOK1VO4lsB1RF+SmssPNrYEKs/AqcNANNOcnS+e775PPhiDOlbfsTbmxube9Uqru1vf2Dw6N647ivwlgS2iMhD+UQI0U5C2hPM83pMJIUCczpAM9vM3/wRKViYfCgFxF1BZoGzGcEaSNN6o0xFslYID1TfoLT9NGZ1Jt2y85hrROnIE1QoDtpwMrYC0ksaKAJR0qNHDvSboKkZoTTtDaOFY0QmaMpHRkaIEGVm+TZU+vcKJ7lh9KcQFu5+ncjQUKphcBmMk9Z9jLxXw+LdPVeFjyVvViKp/0bN2FBFGsakGU6P+aWDq2sPctjkhLNF4YgIpn5oEVmSCKiTcc1U51TLmqd9Nst56rVvr9sduyixCo4BWfgAjjgGnTAHeiCHiDgGbyAV/AG3+EH/ILfy9ENWOycgBXAn1+Mralt</latexit> <latexit sha1_base64="QhL1yD0scJkJigeFsfemgQIToSY=">AAACMHicbVA7T8MwGLR5tZRXCyNLRIXEVCUVAsZKLIxFog+pDZXtOK1VO4lsB1RF+SmssPNrYEKs/AqcNANNOcnS+e775PPhiDOlbfsTbmxube9Uqru1vf2Dw6N647ivwlgS2iMhD+UQI0U5C2hPM83pMJIUCczpAM9vM3/wRKViYfCgFxF1BZoGzGcEaSNN6o0xFslYID1TfoLT9LE9qTftlp3DWidOQZqgQHfSgJWxF5JY0EATjpQaOXak3QRJzQinaW0cKxohMkdTOjI0QIIqN8mzp9a5UTzLD6U5gbZy9e9GgoRSC4HNZJ6y7GXivx4W6eq9LHgqe7EUT/s3bsKCKNY0IMt0fswtHVpZe5bHJCWaLwxBRDLzQYvMkEREm45rpjqnXNQ66bdbzlWrfX/Z7NhFiVVwCs7ABXDANeiAO9AFPUDAM3gBr+ANvsMP+AW/l6MbsNg5ASuAP7+OZ6lu</latexit> <latexit sha1_base64="WKP82GdyMobpor0NoLvdzYwCNJg=">AAACMHicbVA7T8MwGHTKo6W8WhhZIiokpiopCBgrsTAWiT6kJlS247RW7SSyHVAV5aewws6vgQmx8itw0gw05SRL57vvk8+HIkalsqxPo7KxubVdre3Ud/f2Dw4bzaOBDGOBSR+HLBQjBCVhNCB9RRUjo0gQyBEjQzS/zfzhExGShsGDWkTE5XAaUJ9iqLQ0aTQdxBOHQzWTfoLS9PFi0mhZbSuHuU7sgrRAgd6kaVQdL8QxJ4HCDEo5tq1IuQkUimJG0roTSxJBPIdTMtY0gJxIN8mzp+aZVjzTD4U+gTJz9e9GArmUC470ZJ6y7GXivx7i6eq9LHgye7EUT/k3bkKDKFYkwMt0fsxMFZpZe6ZHBcGKLTSBWFD9QRPPoIBY6Y7rujq7XNQ6GXTa9lW7c3/Z6lpFiTVwAk7BObDBNeiCO9ADfYDBM3gBr+DNeDc+jC/jezlaMYqdY7AC4+cXkCGpbw==</latexit>
As for the frontier encoding network, for a bin with

W ×H = 10×10 in the 3D situation (Sec.3.2, 3.3, 3.4), the
stacked two lastest frontiers (Sec.2.2.2) of size 2 × 10 × 10 Figure 6. Schematic plot of the glimpse operation during sequence
is frstly fed into a 2D convolutional layer with 2 input chan- decoding (Sec.2.3: Eq.7-11).
Attend2Pack
(a) Additional experiments on the dataset used in Sec.3.2.

(a) L = 3 (b) L = 9 (c) L = 12
Figure 8. Visualizations for App.E, on dataset generated by cutting

a rectangle into 10 boxes, with W = 15 and L uniformly sampled
from [2 . . 15], where the box edges range within [1 . . 15].
D. Appendix: Additional Ablation Study on

(b) Additional experiments on the dataset used in Sec.3.3 the Sequence Policy
(we note that here ×po-×seq-random corresponds to
×po-×seq in Fig.2).
We conduct an additional set of comparisons on top of the
ablation studies in Sec.3.2,3.3. Specifically, we compare
Figure 7. Results for App.D, where we plot the average utility ob- additionally with training setups where the sequential or-
tained in evaluation during training, each plot shows mean ± 0.1 · der of the boxes are not learned, but is fixed by random
standard deviation. permutation or sorted by a simple heuristic: sorted from
large to small by volumn ln · wn · hn , where ties are bro-
ken by w /hn then by l /wn (we note that ln ≤ wn ≤ hn
n n
(App.C)). The setups without learning the sequence order

tails in Sec.2.3. We note an implementation detail is omitted are denoted as ×po-×seq-random (fixed random order) and
in Sec.2.3 and 2.4, that for both the encoding functions fI s ×po-×seq-sorted (fixed sorted order) respectively. The ex-
(Eq.6) and fI p (Eq.12), each of their input component will perimental results are shown in Fig.7, from which we can
go through a linear projection before passing through the observe that for the random dataset (Fig.7b), the simple
final hi operation heuristic of following a sorted sequential order could lead to
satisfactory performance together with a learned placement
policy; while this sorted heuristic could not yield strong
fIs (s1:t−1 , p1:t−1 ; θ e ) = performance for the cut dataset (Fig.7a). Therefore learning

s-leftover
hB \ bs1:t−1 i, Wfrontier f t ,

W (16) the sequence policy should be considered when dealing with
fIp (s1:t , p1:t−1 ; θ e ) = an arbitrary dataset configuration.

selected st
b , Wp-leftover hB \ bs1:t i, Wfrontier f t ,

W
(17) E. Appendix: Additional Experiments on 2D
BPP on Boxes Generated by Cut
We conduct an additional set of experiments comparing
where all the weight matrices W are of size 128 × 128. against (Wang et al., 2021), which adopts the RR algorithm
As for the placement decoding network θ p , it fistly con-
tains a fully-connected layer with 128 hidden units followed
by LayerNorm and ReLU activation. This is followed by Table 3. Comparisons in Sec.E on rL against HVRAA(Zhu et al.,
another fully-connected layer with 128 hidden units with 2017), Lego heuristic(Hu et al., 2017), MCTS(Browne et al.,
LayerNorm and ReLU activation. Then the final output 2012) and RR(Laterre et al., 2019). The ru obtained by our method
fully-connected layer maps 128-dimensional embeddings is presented in the last row.
to placement logits. The logit vector is of size 6 × 10 for
HVRAA 0.896 ± 0.239
3D packing with bin of W × H = 10 × 10 (6 possible L EGO HEURISTIC 0.737 ± 0.349
orientations) (Sec.3.2, 3.3, 3.4), 2 × 10 for 2D packing with MCTS 0.936 ± 0.097
RR 0.964 ± 0.160
W = 10 (2 possible orientations) (Sec.3.4), and 6 × 100 for ATTEND 2 PACK 0.971 ± 0.051
3D packing with bin of W × H = 100 × 100 (6 possible ATTEND 2 PACK (ru ) 0.971 ± 0.051
orientations) (Sec.3.5).
Attend2Pack
(a) Average utility obtained in evaluation during training,

each plot shows mean ± 0.1 · standard deviation. (a) 20 items (b) 30 items
(b) 10 items (c) 20 items (d) 30 items (e) 50 items
(c) 50 items (d) 100 items
Figure 10. Visualizations for Sec.3.5, on dataset generated by ran-

domly sampling the edges of 20, 30, 50, 100 boxes from [20 . . 80],
with a bin of W × H = 100 × 100.
(f) Average utility obtained in evaluation during training,

each plot shows mean ± 0.1 · standard deviation. G. Appendix: Dataset Specifications
We list the specifications of the datasets used in all our
presented experiments in Table 4.
H. Appendix: Inference Time

(g) 10 items (h) 20 items (i) 30 items (j) 50 items We list the inference time of our presented experiments in
Figure 9. Visualizations for Sec.3.4, on dataset generated by cut- Table 5, for which the CPU configuration of our computing
ting a square of 10 × 10 (2D) or a bin of 10 × 10 × 10 (3D) into resource is: Intel(R) Core(TM) i9-7940X CPU 3.10GHz
10, 20, 30, 50 items, with the box edges ranging within [1 . . 10]. 128GB, and the GPU configuration is: Nvidia TITAN X
(Pascal) GPU with 12G memory.
I. Appendix: Cut Dataset Generation

on a 2D BPP dataset generated by cutting 2D rectangles
into 10 boxes, with W = 15 and L uniformly sampled from The algorithm from (Laterre et al., 2019) that cuts a fixed-
[2 . . 15], where the box edges are guaranteed to be within sized bin into a designated number of cubic items is pre-
[1 . . 15]. The reward measure used in (Wang et al., 2021) sented in Alg.1. This algorithm is used to generate the
PN n n
is rL = Lπ/ ( n=1Wl w ) , where Lπ denotes the length of the dataset used in the experiments in Sec.3.2, 3.4 and App.E.
minimum bounding box that encompass all packed boxes
at the end of episodes; as in Sec.3.4, we train our agent
attend2pack with the utility reward ru but compare to the
Table 4. Specifications of datasets in our presented experiments.
results of (Wang et al., 2021) using their measure rL . The
packing visualizations and the final testing results are shown
in Fig.8 and Table 3. DATASET TYPE L × W (×H) N li ,wi ,(hi )
S EC .3.2, A PP.D CUT-3D 10 × 10 × 10 10 [1 . . 10]
S EC .3.3, A PP.D RANDOM -3D · × 10 × 10 24 [2 . . 5]
F. Appendix: More Visualizations for Sec.3.4, S EC .3.4 CUT-2D 10 × 10 10, 20, 30, 50 [1 . . 10]
S EC .3.4 CUT-3D 10 × 10 × 10 10, 20, 30, 50 [1 . . 10]
3.5 S EC .3.5 RANDOM -3D · × 100 × 100 20, 30, 50, 100 [20 . . 80]
A PP.E CUT-2D [2 . . 15] × 15 10 [1 . . 15]
Additional results for Sec.3.4, 3.5 are shown in Fig.9, 10.
Attend2Pack
Table 5. Inference time for a batch of size 1024 on GPU (Intel(R)

Core(TM) i9-7940X CPU 3.10GHz 128GB with a Nvidia TITAN
X (Pascal) GPU with 12G memory) of our presented experiments
(we note that the some operations for 3D are written in cuda,
while the rest of the code is in python/pytorch).
L × W (×H) N T IME (s)
2D S EC .3.4 10 × 10 10 0.108s
2D S EC .3.4 10 × 10 20 0.250s
2D S EC .3.4 10 × 10 30 0.448s
2D S EC .3.4 10 × 10 50 1.015s
2D A PP.E [2 . . 15] × 15 10 0.145s
3D S EC .3.3, A PP.D · × 10 × 10 24 0.224s
3D S EC .3.2,3.4, A PP.D 10 × 10 × 10 10 0.097s
3D S EC .3.4 10 × 10 × 10 20 0.185s
3D S EC .3.4 10 × 10 × 10 30 0.269s
3D S EC .3.4 10 × 10 × 10 50 0.504s
3D S EC .3.5 · × 100 × 100 20 0.535s
3D S EC .3.5 · × 100 × 100 30 0.805s
3D S EC .3.5 · × 100 × 100 50 1.379s
3D S EC .3.5 · × 100 × 100 100 2.821s
Algorithm 1 Cutting Bin from (Laterre et al., 2019)

Input: Bin dimensions: length L, width W , height H;
minimum length for box edges M ; number of boxes N
to cut the bin into.
Output: A set of N boxes S.
Initialize box list S = {(L, W, H)}.
while |S| < N do
Sample a box i (li , wi , hi ) from S by probabilities
proportional to the volumes of the boxes; pop box i
from S.
Sample a dimension (e.g. wi ) from (li , wi , hi ) by
probabilities proportional to the lengths of the three
dimensions.
Sample a cutting point j from the selected dimension
wi by probabilities proportional to the distances from
all points on wi to the center of wi . Check that j ≥
M and wi − j ≥ M , otherwise push the selected
box (li , wi , hi ) back into S and continue to the next
iteration.
Cut box (li , wi , hi ) along wi from the chosen point
j to get two boxes (li , j, hi ) and (li , wi − j, hi ) and
push those two boxes into S.
end while
return S

Attend2Pack Bin Packing Through Deep Reinforcement Learning With Attention

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Attend2Pack Bin Packing Through Deep Reinforcement Learning With Attention

Uploaded by

Copyright:

Available Formats

Attend2Pack:

Bin Packing through Deep Reinforcement Learning with Attention

Jingwei Zhang * 1 Bin Zi * 1 Xiaoyu Ge 1 2

Abstract There are in general two types of bin packing problems,

This paper seeks to tackle the bin packing problem

1.2. Traditional Algorithms

q̄st = fIs (s1:t−1 , p1:t−1 ; θ e ) = hB \ bs1:t−1 i, f t , (6)

with Wk̄,m , Wv̄,m ∈ Rh×d and Wk ∈ Rd×d (in general the

of conducting training-time beam-search, simply conduct

(a) 10 items (b) 20 items (c) 30 items (d) 50 items

(a) Average utility obtained in evaluation during training,

(e) 10 items (f) 20 items (g) 30 items (h) 50 items

Figure 3. Visualizations for Sec.3.4, on dataset generated by cut-

(b) attend2pack (c) ×po-×seq-×att (d) ×po-×seq-×att

2021), which packs on average 12.2 items with a utility of

References Gurobi Optimization, I. Gurobi optimizer reference manual,

p3 nels and 8 output channels, with kernel size, stride and

padding of 3, 1 and 1 respectively. This is followed by

f3 LayerNorm and ReLU activation. Then there is another

convolutional layer with number of input channels of 8 and

number of output channels of 8, and with kernel size, stride

MHA module (Vaswani et al., 2017) with an embedding

dimension of 128 and 8 heads, whose output is added to the t t

input of the self-attention sub-layer with a skip connection.

1,m 2,m 3,m 1,m 2,m 3,m

As for the frontier encoding network, for a bin with

(a) Additional experiments on the dataset used in Sec.3.2.

Figure 8. Visualizations for App.E, on dataset generated by cutting

D. Appendix: Additional Ablation Study on

(App.C)). The setups without learning the sequence order

(a) Average utility obtained in evaluation during training,

(b) 10 items (c) 20 items (d) 30 items (e) 50 items

(c) 50 items (d) 100 items

Figure 10. Visualizations for Sec.3.5, on dataset generated by ran-

(f) Average utility obtained in evaluation during training,

H. Appendix: Inference Time

I. Appendix: Cut Dataset Generation

Table 5. Inference time for a batch of size 1024 on GPU (Intel(R)

Algorithm 1 Cutting Bin from (Laterre et al., 2019)

You might also like