You are on page 1of 68

OPTIMAL TRANSPORT: DISCRETIZATION AND

ALGORITHMS

QUENTIN MÉRIGOT AND BORIS THIBERT

Abstract. This chapter describes techniques for the numerical resolu-


tion of optimal transport problems. We will consider several discretiza-
tions of these problems, and we will put a strong focus on the mathemat-
arXiv:2003.00855v1 [math.NA] 2 Mar 2020

ical analysis of the algorithms to solve the discretized problems. We will


describe in detail the following discretizations and corresponding algo-
rithms: the assignment problem and Bertsekas auction’s algorithm; the
entropic regularization and Sinkhorn-Knopp’s algorithm; semi-discrete
optimal transport and Oliker-Prussner or damped Newton’s algorithm,
and finally semi-discrete entropic regularization. Our presentation high-
lights the similarity between these algorithms and their connection with
the theory of Kantorovich duality.

Contents
1. Introduction 2
2. Optimal transport theory 6
2.1. The problems of Monge and Kantorovich 6
2.2. Kantorovich duality 8
2.3. Kantorovich’s functional 16
3. Discrete optimal transport 19
3.1. Formulation of discrete optimal transport 19
3.2. Linear assignment via coordinate ascent 20
3.3. Discrete optimal transport via entropic regularization 26
4. Semi-discrete optimal transport 36
4.1. Formulation of semi-discrete optimal transport 36
4.2. Semi-discrete optimal transport via coordinate decrements 43
4.3. Semi-discrete optimal transport via Newton’s method 48
4.4. Semi-discrete entropic transport 57
5. Appendix 63
5.1. Convex analysis 63
5.2. Coarea formula 63
References 64

1
2 QUENTIN MÉRIGOT AND BORIS THIBERT

1. Introduction
The problem of optimal transport, introduced by Gaspard Monge in 1871
[76], was motivated by military applications. The goal was to find the most
economical way to transport a certain amount of sand from a quarry to
a construction site. The source and target distributions of sand are seen
as probability measures, denoted µ and ν, and c(x, y) denotes the cost of
transporting a grain of sand from the position x to the position y, and the
goal is to solve the non-convex optimization problem
Z
(MP) = min c(x, T (x))dµ, (1.1)
T# µ=ν

where T# µ = ν means that ν is the push-forward of µ under the transport


map T . The modern theory of optimal transport has been initiated by Lenoid
Kantorovich in the 1940s, via a convex relaxation of Monge’s problem. Given
two probability measures µ and ν, it consists in minimizing
Z
(KP) = min c(x, y)dγ(x, y), (1.2)
γ∈Γ(µ,ν)

over the set Γ(µ, ν) of transport plans 1 between µ and ν. Kantorovich’s


theory has been used and revisited by many authors from the 1980s, allowing
a complete solution to Monge’s problem in particular for c(x, y) = kx − ykp .
Since then, optimal transport has been connected to various domains of
mathematics (geometry, probabilities, partial differential equations) but also
to more applied domains. Current applications of optimal transport include
machine learning [83], computer graphics [87], quantum chemistry [22, 32],
fluid dynamics [20, 39, 73], optics [80, 26, 99, 24], economy [49], statistics
[27, 30, 61]. The selection of citation above is certainly quite arbitrary,
as optimal transport is now more than ever a vivid topic, with more than
several hundreds (perhaps even thousands) of articles published every year
and containing the words «optimal transport».
There exist many books on the theory of optimal transport, e.g. by
Rachev-Rüschendorf [84, 85], by Villani [97, 98] and by Santambrogio [88].
However, there exist fewer books dealing with the numerical aspects, by Gali-
chon [49], by Cuturi-Peyré [83] and one chapter of Santambrogio [88]. The
books by Galichon and Cuturi-Peyré are targeted toward applications (in
economy and machine learning, respectively) and do not deal in full detail
with the mathematical analysis of algorithms for optimal transport. In this
chapter, we concentrate on numerical methods for optimal transport relying
on Kantorovich duality. Our aim in particular is to provide a self-contained
mathematical analysis of several popular algorithms to solve the discretized
optimal transport problems.

Kantorovich duality. In the 2000s, the theory of optimal transport was


already mature and was used within mathematics, but also in theoretical
physics or in economy. However, numerical applications were essentially

1A probability measure γ is a transport plan between µ and ν if its marginals are µ


and ν.
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 3

limited to one-dimensional problems because of the prohibitive cost of ex-


isting algorithms for higher dimensional problems, whose complexity was
in general more than quadratic in the size of the data. Numerous numeri-
cal methods have been introduced since then. Most of them rely the dual
problem associated to Kantorovich’s problem (1.2), namely
Z Z
(DP) = max ϕdµ − ψdν, (1.3)
ϕ ψ6c

where the maximum is taken over pairs (ϕ, ψ) of functions satisfying ϕ ψ 6


c, meaning that ϕ(x) − ψ(y) 6 c(x, y) for all x, y. Equivalently, the dual
problem can be written as the unconstrained maximization problem
Z Z
c
(DP) = max K(ψ) where K(ψ) = ψ dµ − ψdν (1.4)
ψ

and ψ c (x) := miny c(x, y) + ψ(y) is the c-transform of ψ, a notion closely re-
lated to the Legendre-Fenchel transform in convex analysis. The function K
is called the Kantorovitch functional. Kantorovich’s duality theorem asserts
that the values of (1.2) and (1.3) (or (1.4)) agree under mild assumptions.

Overview of numerical methods. We now briefly review the most used


numerical methods for optimal transport. Note that there is no «free lunch»
in the sense that there exists no method able to deal efficiently with arbitrary
cost function; the computational complexity of most methods depend on the
complexity of computing c-transforms or smoothed c-transforms. In this
overview, we skip linear programming methods such as the network simplex,
for which we refer to [83].

A. Assignment problem. When the two measures are uniformly sup-


ported on two finite sets with cardinal N , the optimal transport problem
coincides with the assignment problem, described in [21]. The assignment
problem can be solved using various techniques, but in this chapter we will
concentrate on a dual ascent method called Bertsekas’ auction algorithm
[16], whose complexity is also O(N 3 ), but which is very simple to implement
and analyze. In §3.2 we note that the complexity can be improved when it is
possible to compute discrete c-transforms efficiently, namely
ψ c (xi ) = min c(xi , yj ) + ψ(yj ). (1.5)
j

B. Entropic regularization. In this approach, one does not solve the


original optimal transport problem (1.2) exactly, but instead replaces it with
a regularized problem involving the entropy of the transport plan. In the
discrete case, it consists in minimizing
X X
min γi,j c(xi , yj ) + η h(γi,j ), (1.6)
γ
i,j i,j

where h(r) = r(log r − 1) and η > 0 is a small parameter, under the con-
straints X X
∀i, γi,j = µi , ∀j, γi,j = νj (1.7)
j i
4 QUENTIN MÉRIGOT AND BORIS THIBERT

This idea has been introduced in the field of optimal transport by Galichon
and Salanié [50] and by Cuturi [35], see [83, Remark 4.5] for a brief historical
account. Adding the entropy of the transport plan makes the problem (1.6)
strongly convex and smooth. The dual problem can be solved efficiently
using Sinkhorn-Knopp’s algorithm, which involves computing repeatedly the
smoothed c-transform
 
X 1 (−c(x ,y )−ψ(y ))
ψ c,η (xi ) = η log(µi ) − η log  eη i j j 
. (1.8)
j

Sinkhorn-Knopp’s algorithm can be very efficient, provided that the smoothed


c-transform can be computed efficiently (e.g. in near-linear time).

C. Distance costs. When the cost c satisfies the triangle inequality, the
dual problem (1.3) can be further simplified:
Z Z
max ψdµ − ψdν, (1.9)
Lipc (ψ)61

where the maximum is taken over functions satisfying |ψ(x) − ψ(y)| 6 c(x, y)
for all x, y. The equality between the values of (1.2) and (1.9) is called
Kantorovich-Rubinstein’s theorem. This leads to very efficient algorithms
when the 1-Lipschitz constraint can be enforced using only local informa-
tion, thus reducing the number of constraints. This is possible when the
space is discrete and the distance is induced by a graph, or when c is the
Euclidean norm or more generally a Riemannian metric. In the latter case,
the maximum in (1.9) can be replaced by a supremum over C 1 functions
ψ satisfying k∇ψk∞ 6 1 [94, 9]. Note that the case of distance costs is
particularly easy because the c-transform of a 1-Lipschitz function is trivial:
ψ c = −ψ.

D. Monge-Ampère equation. When the cost is the Euclidean scalar


product, c(x, y) = −hx|yi, the dual problem (1.3) can be reformulated as
Z Z

max − ψ dµ − ψdν, (1.10)
ψ

where ψ ∗ (x)= maxy hx|yi − ψ(y) is the Legendre-Fenchel transform of ψ.


If the maximizer ψ is smooth and strongly convex and µ, ν are probability
densities, the optimality condition associated to the dual problem is the
Monge-Ampère equation,
(
µ(∇ψ(y)) det(D2 ψ(y)) = ν(y),
(1.11)
∇ψ(spt(ν)) ⊆ spt(µ).
Note the non-standard boundary conditions appearing on the second line of
the equation. The first methods able to deal with these boundary conditions
use a “wide-stencil” finite difference discretization [46, 13, 12]. These methods
are able to solve optimal transport problems provided that the maximizer of
(1.10) is a viscosity solution to the Monge-Ampère equation (1.11), imposing
restrictions on its regularity. For the Monge-Ampère equation with Dirichlet
conditions, we refer to the recent survey by Neilan, Salgado and Zhang [77].
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 5

E. Semi-discrete formulation. The semi-discrete formulation of optimal


transport involves a source measure that is a probability P density µ and a
target measure ν which is finitely supported, i.e. ν = i νi δyi . It was
introduced by Cullen in 1984 [34], without reference to optimal transport,
and much refined since then [6, 70, 38, 55, 64, 68, 65]. In this setting, the
dual problem (1.4) amounts to maximizing the Kantorovitch functional given
by
Z Z XZ Z
c
K(ψ) = ψ dµ − ψdν = c(x, yi ) + ψ(yi )dµ(x) − ψdν,
i Lagyi (ψ)
(1.12)
where the Laguerre cells are defined by
Lagyi (ψ) = {x | ∀j, c(x, yi ) + ψ(yi ) 6 c(x, yj ) + ψ(yj )}. (1.13)
The optimality condition for (1.12) is the following non-linear system of
equations,
∀i, µ(Lagyi (ψ)) = νi . (1.14)
In the case c(x, y) = −hx|yi, this system of equations can see as a weak
formulation (in the sense of Alexandrov, see [57, Chapter 1]) of the Monge-
Ampère equation (1.11). This “semi-discrete” approach can also be used to
solve Monge-Ampère equations with Dirichlet boundary conditions, and has
been originally introduced for this purpose [79, 75]. Again, the possibility
to solve (1.14) efficiently requires one to be able to compute the Laguerre
tessellation (1.13), and thus the c-transform, efficiently.

F. Dynamic formulation. This formulation relies on the dynamic formu-


lation of optimal transport, which holds when the cost is c(x, y) = kx − yk2
on Rd (or more generally induced by a Riemannian metric), and is known as
the Benamou-Brenier formulation:
Z 1Z (
2 ∂t ρt + div(ρt vt ) = 0,
min ρt kvt k dxdt with
(ρ,v) 0 R d ρ 0 = µ, ρ1 = ν.

Introducing the momentum mt = ρt vt , the problem can be rewritten as


(
kmt k2
Z 1Z
∂t ρt + div(mt ) = 0,
min dxdt with
(ρt ,mt ) 0 Rd ρt ρ0 = µ, ρ1 = ν.
This optimization problem can be discretized using finite elements [7], fi-
nite differences [81] or finite volumes [44], and the discrete problem is then
usually solved using a primal-dual augmented Lagrangian method [7] (see
also [81, 60]). In practice, the convergence is very costly in terms of num-
ber of iterations; note also that each iteration requires the resolution of
a (d + 1)-dimensional Poisson problem to project on the admissible set
{(ρ, m) | ∂t ρ + div(m) = 0}. Another possibility is to use divergence-free
wavelets [59]. One advantage of the Benamou-Brenier approach is that it
is very flexible, easily allowing (Riemannian) cost functions [81], additional
quadratic terms [8], penalization of congestion [23], partial transport [69, 31],
etc. Finally, the convergence from the discretized problem to the continuous
one is subtle and depends on the choice of the discretization, see [67, 28].
6 QUENTIN MÉRIGOT AND BORIS THIBERT

In this chapter, we will describe in detail the following discretizations


for optimal transport and corresponding algorithms to solve the discretized
problems : the assignment problem (A.) through Bertsekas auction’s algo-
rithm, the entropic regularization (B.) through Sinkhorn-Knopp’s algorithm,
semi-discrete optimal transport (E.) through Oliker-Prussner or Newton’s
methods. These algorithms share a common feature, in that they are all
derived from Kantorovich duality. Some of them have been adapted to vari-
ants of optimal transport problems, such as multi-marginal optimal trans-
port problems problems [82], barycenters with respect to optimal transport
metrics [2], partial [25] and unbalanced optimal transport [31, 66], gradient
flows in the Wasserstein space [62, 4], generated Jacobian equations [56, 95].
However, we consider these extensions to be out of the scope of this chapter.

2. Optimal transport theory


This part contains a self-contained introduction to the theory of optimal
transport, putting a strong emphasis on Kantorovich Kantorovich duality.
Kantorovich duality is at the heart of the most important theorems of op-
timal transport, such as Brenier and Gangbo-McCann’s theorems on the
existence and uniqueness of solution to Monge’s problems and the stability
of optimal transport plans and optimal transport maps maps. Kantorovich’s
duality is also used in all the numerical methods presented in this chapter.

Background on measure theory. In the following, we assume that X is a


compact metric space, and we denote C 0 (X) the space of continuous func-
tions over X. We denote M(X) the space of finite (Radon) measures over
X, identified with the set of continuous linear forms over CR0 (X). Given
ϕ ∈ C 0 (X) and µ ∈ M(X), we will often denote hϕ|µi = X ϕdµ. The
spaces of non-negative measures and probability measures are defined by
M+ (X) := {µ ∈ M(X) | µ > 0},

P(X) := {µ ∈ M+ (X) | µ(X) = 1},


where µ > 0 means hµ|ϕi > 0 for all ϕ ∈ C 0 (X, R+ ). The three spaces
M(X), M+ (X) and P(X) are endowed with the weak topology induced by
duality with C 0 (X), namely µn → µ weakly if
n→∞
∀ϕ ∈ C 0 (X), hϕ|µn i −−−→ hϕ|µi.
A point x belongs to the support of a non-negative measure µ iff for every r >
0 one has µ(B(x, r)) > 0. The support of µ is denoted spt(µ). We recall that
by Banach-Alaoglu theorem, the set of probability measures P(X) is weakly
compact, a fact which will be useful to prove existence and convergence
results in optimal transport.
Notation. Given two functions ϕ ∈ C 0 (X) and ψ ∈ C 0 (Y ) we will define
ϕ ⊕ ψ ∈ C 0 (X × Y ) by ϕ ⊕ ψ(x, y) = ϕ(x) + ψ(y). We define ϕ ψ and
ϕ ⊗ ψ similarly.

2.1. The problems of Monge and Kantorovich.


OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 7

Monge’s problem. Before introducing Monge’s problem, we recall the defini-


tion of push-forward or image measure.
Definition 1 (Push-forward and transport map). Let X, Y be compact met-
ric spaces, µ ∈ M(X) and T : X → Y be a measurable map. The push-
forward of µ by T is the measure T# µ on Y defined by
∀ϕ ∈ C 0 (Y ), hϕ|T# µi := hϕ ◦ T |µi,
or equivalently if for every Borel subset B ⊆ Y, T# µ(B) = µ(T −1 (B)). A
measurable map T : X → Y such that T# µ = ν is also called a transport
map between µ and ν.
Example 1. If Y = {y1 , . . . , yn }, then T# µ = 16i6n µ(T −1 ({yi }))δyi .
P

Example 2. Assume that T is a C 1 diffeomorphism between compact do-


mains X, Y of Rd , and assume also that the probability measures µ, ν have
continuous densities ρ, σ with respect to the Lebesgue measure. Then,
Z Z
ϕ(y)σ(y)dy = ϕ(T (x))σ(T (x)) det(DT (x))dx.
Y X
Hence, T is a transport map between µ and ν iff
Z Z
0
∀ϕ ∈ C (X), ϕ(T (x))σ(T (x)) det(DT (x))dx = ϕ(T (x))ρ(x)dx,
X X
or equivalently if the (non-linear) Jacobian equation holds
ρ(x) = σ(T (x)) det(DT (x)).
Definition 2 (Monge’s problem). Consider two compact metric spaces X, Y ,
two probability measures µ ∈ P(X), ν ∈ P(Y ) and a cost function c ∈
C 0 (X × Y ). Monge’s problem is the following optimization problem
Z 
(MP) := inf c(x, T (x))dµ(x) | T : X → Y and T# µ = ν (2.15)
X

Monge’s problem exhibits several difficulties, one of which is that both the
transport constraint (T# µ = ν) and the functional are non-convex. Note also
that there might exist no transport map between µ and ν. For instance, if
µ = δx for some x ∈ X, then, T# µ(B) = µ(T −1 (B)) = δT (x) . In particular,
if card(spt(ν)) > 1, there exists no transport map between µ and ν.
Kantorovich’s problem.
Definition 3 (Marginals). The marginals of a measure γ on a product space
X × Y are the measures ΠX# γ and ΠY # γ, where ΠX : X × Y → X and
ΠY : X × Y → Y are their projection maps.
Definition 4 (Transport plan). A transport plan between two probability
measures µ, ν on two metric spaces X and Y is a probability measure γ
on the product space X × Y whose marginals are µ and ν. The space of
transport plans is denoted Γ(µ, ν), i.e.
Γ(µ, ν) = {γ ∈ P(X × Y ) | ΠX# γ = µ, ΠY # γ = ν} .
Note that Γ(µ, ν) is a convex set.
8 QUENTIN MÉRIGOT AND BORIS THIBERT

Example 3 (Product measure). Note that the set of transport plans Γ(µ, ν)
is never empty, as it contains the measure µ ⊗ ν.
Definition 5 (Kantorovich’s problem). Consider two compact metric spaces
X, Y , two probability measures µ ∈ P(X), ν ∈ P(Y ) and a cost function
c ∈ C 0 (X × Y ). Kantorovich’s problem is the following optimization problem
Z 
(KP) := inf c(x, y)dγ(x, y) | γ ∈ Γ(µ, ν) (2.16)
X×Y
Remark 1. The infimum in Kantorovich’s problem is less than the infimum
in Monge’s problem. Indeed, to any transport map T between µ and ν one
can associate a transport plan, by letting γT = (id, T )# µ. One can easily
check that ΠX# γT = µ and ΠY # γT = ν so that γT ∈ Γ(µ, ν) is a transport
plan between µ and ν. Moreover, by the definition of push-forward,
Z
hc|γT i = hc|(id, T )# µi = hc ◦ (id, T )|µi = c(x, T (x))dµ
X
thus showing that (KP) 6 (MP).
Proposition 1. Kantorovich’s problem (KP) admits a minimizer.
Proof. The definition of ΠX# γ = µ can be expanded into
∀ϕ ∈ C 0 (X), hϕ ⊗ 1|γi = hϕ|µi,
from which it is easy to see that the set Γ(µ, ν) is weakly closed, and there-
fore weakly compact as a subset of P(X × Y ), which is weakly compact by
Banach-Alaoglu’s theorem. We conclude the existence proof by remarking
that the functional that is minimized in (KP), namely µ 7→ hc|µi, is weakly
continuous by definition. 
2.2. Kantorovich duality.
Derivation of the dual problem. The primal Kantorovich problem (KP) can
be reformulated by introducing Lagrange multipliers for the constraints.
Namely, we use that for any γ ∈ M+ (X × Y ),
(
0 if ΠX# γ = µ
sup −hϕ ⊗ 1|γi + hϕ|µi =
ϕ∈C 0 (X) +∞ if not
(
0 if ΠX# γ = µ
sup h1 ⊗ ψ|γi − hψ|µi =
ϕ∈C 0 (X) +∞ if not
to deduce
(
0 if γ ∈ Γ(µ, ν)
sup hϕ|µi − hψ|νi − hϕ ψ|γi =
ϕ∈C 0 (X),ψ∈C 0 (Y ) +∞ if not.
This leads to the following formulation of the Kantorovich problem
(KP) = inf sup hc − (ϕ ψ)|γi + hϕ|µi − hψ|νi
γ∈M+ (X×Y ) (ϕ,ψ)∈C 0 (X)×C 0 (Y )

Kantorovich dual problem is simply obtained by inverting the infimum and


the supremum:
(DP) := sup inf hc − (ϕ ψ)|γi + hϕ|µi − hψ|νi.
ϕ,ψ γ>0
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 9

Note that we will often omit the assumptions that γ ∈ M(X × Y ) and ϕ, ψ
are continuous, when the context is clear. The dual problem can further be
simplified by remarking that
(
0 if ϕ ψ 6 c
inf hc − ϕ ψ|γi =
γ>0 −∞ if not.

Definition 6 (Kantorovich’s dual problem). Given µ ∈ P(X) and ν ∈ P(Y )


with X, Y compact metric spaces and c ∈ C 0 (X ×Y ), we define Kantorovich’s
dual problem by
Z Z 
0 0
(DP) = sup ϕdµ − ψdν | (ϕ, ψ) ∈ C (X) × C (Y ), ϕ ψ 6 c
X Y
(2.17)
Proposition 2. Weak duality holds, i.e. (KP) > (DP).
Proof. Given (ϕ, ψ, γ) ∈ C 0 (X) × C 0 (Y ) × Γ(µ, ν) satisfying the constraint
ϕ ψ 6 c, one has
hϕ|µi − hψ|νi = hϕ ψ|γi 6 hc|γi,
where we used γ ∈ Γ(µ, ν) to get the equality and ϕ ψ 6 c to get the
inequality. As a conclusion,
(DP) = min hϕ|µi − hψ|νi 6 max hc|γi = (KP) 
ϕ ψ6c γ∈Γ(µ,ν)

Existence of solution for the dual problem. Kantorovich’s dual problem (DP)
consists in maximizing a concave (actually linear) functional under linear
inequality constraints. It can also also easily be turned into an unconstrained
minimization problem. The idea is quite simple: given a certain ψ ∈ C 0 (Y ),
one wishes to select ϕ on X which is as large as possible (to maximize the
term hϕ|µi in (DP)) while satisfying the constraint ϕ ψ 6 c. This constraint
can be rewritten as
∀x ∈ X, ϕ(x) 6 min c(x, y) + ψ(y).
y∈Y

The largest function ϕ satisfying it is ϕ(x) = miny∈Y c(x, y) + ψ(y). Thus,


(KP) = sup hϕ|µi − hψ|νi
ϕ ψ6c
Z   Z
= sup min c(x, y) + ψ(y) dµ(x) − ψ(y)dν(y).
ψ∈C 0 (Y ) X y∈Y

This idea is at the basis of many algorithms to solve discrete instances of


optimal transport, but also useful in theory. It also suggests to introduce
the notion of c-transform. .
Definition 7 (c-Transform). The c-transform (resp. c-transform) of a func-
tion ψ : Y → R ∪ {+∞} (resp. ϕ : X → R ∪ {+∞}) is defined as
ψ c : x ∈ X 7→ inf c(x, y) + ψ(y) (2.18)
y∈Y

ϕc : y ∈ Y 7→ sup −c(x, y) + ϕ(x) (2.19)


x∈X
10 QUENTIN MÉRIGOT AND BORIS THIBERT

Thanks to this notion of c-transform, one can reformulate the dual problem
(DP) as an unconstrained maximization problem:
Z Z
(DP) = sup ψ c dµ − ψdν. (2.20)
ψ∈C 0 (Y ) X Y

Remark 2 (c-concavity, c-convexity and c-subdifferential). One can call a


function ϕ on X c-concave if ϕ = ψ c for some ψ : Y → R ∪ {+∞} on
Y . Note that we use the word concave because ψ c is defined through an
infimum. Conversely, a function ψ on Y is called c-convex if ψ = ϕc for
some ϕ : X → R ∪ {+∞}. Note the asymetry between the two notions,
which is due to the choice of the sign in the constraint in Kantorovich’s
problem: in the two equivalent formulations
(KP) = sup hϕ|µi + hψ|νi = sup hϕ|µi − hψ|νi,
ϕ⊕ψ6c ϕ ψ6c
we chose the second one, involving two minus signs. This choice will make it
easier to explain some of the algorithms we will present later in the chapter.
The c-subdifferential of a function ψ on Y is a subset of X × Y defined by
∂ c ψ := {(x, y) ∈ X × Y | ψ c (x) − ψ(y) = c(x, y)} , (2.21)
while the c-subdifferential at a point y in Y is given by
∂ c ψ(y) := {x ∈ X, (x, y) ∈ ∂ c ψ} . (2.22)
Remark 3 (Bilinear cost). When c(x, y) = −hx|yi, a function is c-convex if
and only if it is convex, and ϕc is the Legendre-Fenchel transform of −ϕ.
Proposition 3 (Existence of dual potentials). (DP) admits a maximizer,
which one can assume to be of the form (ϕ, ψ) such that ϕ = ψ c and ψ = ϕc .
The existence of maximizers follows from the fact that a c-concave/c-
convex function has the same modulus of continuity as c.
(Recall that ω : R+ → R is a modulus of continuity of a function f :
Z → R on a metric space (Z, dZ ) if it satisfies limt→0 ω(t) = 0 and for every
z, z 0 ∈ Z, |f (z) − f (z 0 )| 6 ω(dZ (z, z 0 )).)
Lemma 4 (Properties of c-transforms). Let ω : R+ → R+ be a modulus of
continuity for c ∈ C 0 (X × Y ) for the distance
dX×Y ((x, y), (x0 , y 0 )) = dX (x, x0 ) + dY (y, y 0 ).
Then for every ϕ ∈ C 0 (X) and every ψ ∈ C 0 (Y ),
• ϕc and ψ c also admits ω as modulus of continuity.
• ψ cc 6 ψ and ψ ccc = ψ c .
• ϕcc > ϕ and ϕccc = ϕc .
Proof. Let us first prove the first point. Let ψ ∈ C 0 (Y ) and for x ∈ X, let
yx ∈ Y be a point realizing the minimum in the definition of ψ c . Then,
ψ c (x0 ) 6 c(x0 , yx )+ψ(yx ) = ψ c (x)+c(x0 , yx )−c(x, yx ) 6 ψ c (x)+ω(dX (x, x0 )).
Exchanging the role of x and x0 we get |ψ c (x0 ) − ψ c (x)| 6 ω(dX (x, x0 )) as
desired. The proof that ϕc has the ω as modulus of continuity is similar. We
prove now the second point. By definition, one has
 
cc
ψ (y) = max −c(x, y) + min c(x, ỹ) + ψ(ỹ) .
x∈X ỹ∈Y
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 11

By taking ỹ = y, one gets ψ cc (y) 6 ψ(y). Again, by definition, we have


  
ψ ccc (x) = min c(x, y) + max −c(x̃, y) + min c(x̃, ỹ) + ψ(ỹ) .
y∈Y x̃∈X ỹ∈Y

By taking x̃ = x , one gets ψ ccc (x) > ψ c (x), while taking ỹ = y gives us
ψ ccc (x) 6 ψ c (x). The last point is obtained similarly. 
Proof of Proposition 3. Let (ϕn , ψn ) be a maximizing sequence for (DP), i.e.
ϕn ψn 6 c and limn→+∞ hϕn |µi − hψn |νi = (DP). Define ϕ̂n = ψnc and
ψ̂n = ϕˆn c . Then ϕ̂n ψ̂n 6 c, ϕn 6 ϕˆn and ψn > ψˆn , which implies
hϕn |µi − hψn |νi 6 −hψn |νi 6 hϕ̂n |µi − hψ̂n |νi,
implying that (ϕ̂n , ψ̂n ) is also a maximizing sequence. Our goal is now to
show that this sequence admits a converging subsequence. We first note that
we can assume that ϕ̂n (x0 ) = 0 for all n, where x0 is a given point in X: if
this is not the case, we replace (ϕ̂n , ψ̂n ) by (ϕ̂n − ϕ̂n (x0 ), ψ̂n + ϕ̂n (x0 ))), which
is also admissible and has the same dual value. In addition, by Lemma 4, the
sequences (ϕ̂n )n and (ψ̂n )n are equicontinuous. By Arzelà-Ascoli’s theorem,
we deduce that they admit subsequences converging respectively to ϕ ∈
C 0 (x) and ψ ∈ C 0 (Y ), which are then maximizers for (DP). 
Strong duality and stability of optimal transport plans. We will prove strong
duality first in the case where µ, ν are finitely supported, and will then use a
density argument to deduce the general case. As a byproduct of this theorem,
we get a stability result for optimal transport plans (i.e. a limit of optimal
transport plans is also optimal).
Theorem 5 (Strong duality). Let X, Y be compact metric spaces and c ∈
C 0 (X × Y ). Then the maximum is attained in (DP) and (KP) = (DP).

Corollary 6 (Support of OT plans). Let ψ be a maximizer of (2.20) and


γ ∈ Γ(µ, ν) a transport plan. Then the two assertions are equivalent
• γ is an optimal transport plan
• spt(γ) ⊂ ∂ c ψ := {(x, y) ∈ X × Y | ψ c (x) − ψ(y) = c(x, y)}.
As a consequence of Kantorovich duality, we can prove stability of optimal
transport plans and optimal transport maps.
Theorem 7 (Stability of OT plans). Let X, Y be compact metric spaces
and let c ∈ C 0 (X × Y ). Consider (µk )k∈N and (νk )k∈N in P(X) and P(Y )
converging weakly to µ and ν respectively.
• If γk ∈ Γ(µk , νk ) is optimal then, up to subsequences, (γk ) converges
weakly to an optimal transport plan γ ∈ Γ(µ, ν).
• Let (ϕk , ψk ) be optimal Kantorovich potentials in the dual problem
between µk and νk , satisfying ψk = ϕck and ϕk = ψkc . Given a point
x0 ∈ X, define ψ̃k = ψk − ψk (x0 ) and ϕ̃k = ϕk + ψk (x0 ). Then, up
to subsequences, (ψ̃k , ϕ˜k ) converges uniformly to (ϕ, ψ) a maximizing
pair for (DP) satisfying ϕ = ψ c and ψ = ϕc .
The proof of Theorem 5 relies on a simple reformulation of strong duality
– similar to the Karush-Kuhn-Tucker optimality conditions for optimization
problems with inequality constraints:
12 QUENTIN MÉRIGOT AND BORIS THIBERT

Proposition 8. Let γ ∈ Γ(µ, ν) and let (ϕ, ψ) ∈ C 0 (X) × C 0 (Y ) such that


ϕ ψ 6 c. Then, the following statements are equivalent:
• ϕ ψ = c γ-a.e.
• γ minimizes (KP), (ϕ, ψ) maximizes (DP) and (KP) = (DP).
Proof. Assume that ϕ ψ = c γ-a.e. Then,
(KP) 6 hc|γi = hϕ ψ|γi = hϕ|µi − hψ|νi 6 (DP)
Since in addition (KP) > (DP), all inequalities are equalities, which implies
that (KP) = (DP), γ miminizes (KP) and (ϕ, ψ) maximizes (DP). Con-
versely, if (KP) = (DP), γ miminizes (KP) and (ϕ, ψ) maximizes (DP),
then
hϕ|µi − hψ|νi = (DP) = (KP) = hc|γi > hϕ ψ|γi = hϕ|µi − hψ|νi,
implying that ϕ ψ = c γ a.e. 
The proof of Theorem 5 also relies on a few elementary lemmas from
measure theory.
Lemma 9. If µN converges weakly to µ, then for any point x ∈ spt(µ) there
exists a sequence xN ∈ spt(µN ) converging to x.
Proof. Consider x ∈ spt(µ). For any k ∈ N, consider the function ϕk (z) =
max(1 − kd(x, z), 0), in C 0 (X). Then,
lim hϕk |µN i = hϕk |µi > 0,
N →∞

where the last inequality holds because x belongs to the support of µ. Then,
there exists Nk such that for any N > Nk , hϕk |µN i > 0, implying the
existence of xN ∈ X such that xN ∈ spt(µN ) and d(xN , x) 6 1/k. By a
diagonal argument, this allows to construct a sequence of points (xN )N ∈N
such that xN ∈ spt(µN ) and limN →+∞ xN = x. 
Lemma 10. Let X be a compact space and µ ∈ P(X). Then, there exists a
sequence of finitely supported probability measures weakly converging to µ.
Proof. ForSany ε > 0, by compactness there exists N points x1 , . . . , xN such
that X ⊆ i B(xi , ε). We define a partition K1 , . . . , KN of X recursively by
Ki = B(xi , ε) \ (K1 ∪ ... ∪ Ki−1 ) and we introduce
X
µε := µ(Ki )δxi .
16i6N

To prove weak convergence of µε to µ as ε → 0, take ϕ ∈ C 0 (X). By


compactness of X, ϕ admits a modulus of continuity ω, i.e. limt→0 ω(t) = 0
and |ϕ(x) − ϕ(y)| 6 ω(d(x, y)). Using that diam(Ki ) 6 ε, we get

Z Z X Z

ϕdµ − ϕdµε = ϕ(x) − ϕ(xi )dµ 6 ω(ε),

16i6N Ki

We deduce limε→0 hϕ|µε i = hϕ|µi, so that µε weakly converges to µ. 


Lemma 11. If (µ, ν) ∈ P(X) × P(Y ) are finitely supported, (KP) = (DP).
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 13
P P
Proof. Assume that µ = 16i6N µi δxi , ν = 16j6M νj δyj , where all the µi
and µj are strictly positive, and consider the linear programming problem
 
X 
(KP)0 = min
X X
γij c(xi , yj ) | γij > 0, γij = µi , γij = νj ,
 
i,j j i

which admits a solution which we denote γ. By Karush-Kuhn-Tucker theo-


rem, there exists Lagrange multipliers (ϕi )16i6N , (ψj )16j6M and (πij )16i6N,16j6M
such that 
ϕi − ψj − c(xi , yj ) = πij

γij πij = 0

πij 6 0

In particular, ϕi − ψj 6 c(xi , yj ) with equality if γij > 0. To prove strong


duality between the original problems (KP) and (DP), we construct two
functions ϕ̂, ψ̂ such that ϕ̂ ψ̂ 6 c with equality on the set {(xi , yj ) | γij > 0}.
For this purpose, we first introduce
(
ψi if y = yi
ψ(y) =
+∞ if not
and let ϕ̂ = ψ c , ψ̂ = ϕ̂c . Let i ∈ {1, · · · , M }. Since µi = j γij 6= 0, there
P

exists j ∈ {1, . . . , M } such that γij > 0. Using γij πij = 0, we deduce that
so that ϕi − ψj = c(xi , yj ), giving
ϕ̂(xi ) = min c(xi , yk ) + ψk = c(xi , yj ) + ψj = ϕi .
k∈{1,...,N }

Similarly, one can show that ψ̂(yj ) = ψj for all j ∈ {1, . . . , M }. Finally,
P
define γ = ij γij δ(xi ,yj ) ∈ Γ(µ, ν). Then one can check that ϕ̂ ψ̂ 6 c with
equality γ-a.e., so that (KP) = (DP) by Proposition 8. 
Proof of Theorem 5. By Lemma 10, there exists a sequence µk ∈ P(X) (resp.
νk ∈ P(Y )) of finitely supported measures which converge weakly to µ (resp.
ν). We denote (KP)k and (DP)k the primal and dual Kantorovich problems
between µk and νk . By Proposition 3, there exists a solution (ϕk , ψk ) of
(DP)k , such that ϕk = ψkc and ψk = ϕck . Moreover, since strong duality
holds for finitely supported measures (Lemma 11), we see (Proposition 8)
that γk is supported on the set
Sk = {(x, y) ∈ X × Y | ϕk (x) − ψk (y) = c(x, y)}.
Adding a constant if necessary, we can also assume that ϕk (x0 ) = 0 for some
point x0 ∈ X. As c-concave functions, ϕk and ψk have the same modulus
of continuity as the cost function c (see Lemma 4), and they are uniformly
bounded (using ϕk (x0 ) = 0). Using Arzelà-Ascoli theorem, we can therefore
assume that up to subsequences, (ϕk ) (resp. (ψk )) converges to some ϕ
(resp ψ) uniformly. Then, one easily sees that ϕ ψ 6 c so that (ϕ, ψ) are
admissible for the dual problem (DP).
By compactness of P(X × Y ), we can assume that the sequence γk ∈
Γ(µk , νk ) converges to some γ ∈ Γ(µ, ν). Moreover, by Lemma 9, every pair
(x, y) ∈ spt(γ) can be approximated by a sequence of pairs (xk , yk ) ∈ spt(γk )
i.e. limk→∞ (xk , yk ) = (x, y). Since γk is supported on Sk one has c(xk , yk ) =
14 QUENTIN MÉRIGOT AND BORIS THIBERT

ϕk (xk ) − ψk (xk ), which gives at the limit c(x, y) = ϕ(x) − ψ(y). We have
just shown that for every point pair (x, y) in spt(γ), c(x, y) = ϕ(x) − ψ(y)
where ϕ, ψ is admissible. By Proposition 8, this shows that γ and (ϕ, ψ) are
optimal for their respective problems and that (KP) = (DP). 
Corollary 6 is a direct consequence of Proposition 8 and of the strong
duality (KP) = (DP).

Solution of Monge’s problem for Twisted costs. We now show how to use
Kantorovich duality to prove the existence of optimal transport maps when
the source measure is absolutely continuous on a compact subset of Rd and
when the cost function satisfies the following condition:
Definition 8 (Twisted cost). Let ΩX , ΩY ⊆ Rd be open subsets, and c ∈
C 1 (ΩX × ΩY ). The cost function satisfies the twist condition if
∀x0 ∈ ΩX , the map y ∈ ΩY 7→ v := ∇x c(x0 , y) ∈ Rd is injective, (2.23)
where ∇x c(x0 , y) denotes the gradient of x 7→ c(·, y) at x = x0 . Given
x0 ∈ ΩX and v ∈ Rd , we denote yc (x0 , v) the unique point (if it exists)
such that ∇x c(x0 , yc (x0 , v)) = v. The map v 7→ yc (x0 , v) is often called the
c-exponential map at x0 .
Example 4 (Quadratic cost). Let c(x, y) = kx − yk2 . Then, for any x0 ∈ X,
the map y 7→ ∇x c(x0 , y) = 2(x0 − y) is injective, so that c satisfies the twist
condition. Moreover, given v ∈ Rd , the unique y such that ∇x c(x0 , y) =
2(x0 − y) = v is y = x0 − 21 v, implying that yc (x0 , v) = x0 − 21 v.
The following theorem is due to Brenier [19] in the case of the quadratic
cost (i.e. c(x, y) = kx − yk2 ) and Gangbo-McCann in the general case of
twisted costs [51].
Given X ⊆ ΩX ⊂ Rd , we define P ac (X) as the set of probability measures
on ΩX that are absolutely continuous with respect to the Lebesgue measure,
and with support included in X.
Theorem 12 (Brenier [19], Gangbo-McCann [51]). Let c ∈ C 1 (ΩX ×ΩY ) be a
twisted cost, let X ⊆ ΩX , Y ⊆ ΩY be compact sets, and let (µ, ν) ∈ P ac (X)×
P(Y ). Then, there exists a c-concave function ϕ ∈ Lip(X) such that ν =
T# µ where T (x) = yc (x, ∇ϕ(x)). Moreover, the only optimal transport plan
between µ and ν is γT .

Example 5. If h ∈ C 1 (Rd ) is strictly convex, in particular if h(x) = kxkp ,


then the map x 7→ ∇h(x) is injective. Take c(x, y) = h(x − y), so that
y 7→ ∇x c(x, y) = ∇x h(x − y) = ∇h(x − y) is also injective. Moreover,
given x0 ∈ Rd and v ∈ Rd , the unique solution y to v = ∇h(x0 − y) is
y = yc (x0 , v) := x0 − (∇h)−1 (v). As a consequence, under the hypothesis of
the theorem above, the transport map is of the form
T (x) = x − (∇h)−1 (∇ϕ(x))
where ϕ is a c-convex function.
The following lemma shows that a transport plan is induced by a transport
map if it is concentrated on the graph of a map.
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 15

Lemma 13. Let γ ∈ Γ(µ, ν) and T : X → Y measurable be such that


γ({(x, y) ∈ X × Y | T (x) 6= y}) = 0. Then, γ = γT .
Proof. By definition of γT one has γT (A × B) = µ(T −1 (B) ∩ A) for all Borel
sets A ⊆ X and B ⊆ Y . On the other hand,
γ(A × B) = γ({(x, y) | x ∈ A, and y ∈ B})
= γ({(x, y) | x ∈ A, y ∈ B and y = T (x)})
= γ({(x, y) | x ∈ A ∩ T −1 (B), y = T (x)}
= µ(A ∩ T −1 (B)),
thus proving the claim. 
Proof of Theorem 12. Enlarging X if necessary (while keeping it compact
and inside ΩX ), we may assume that spt(µ) is contained in the interior of
X. First note that by compactness of X × Y and since c is C 1 , the cost
c is Lipschitz on X × Y . Take (ϕ, ϕc ) a maximizing pair for (DP) with ϕ
c-concave. By the formula ϕ(x) = miny∈Y c(x, y) + ϕc (y) one can see that
ϕ is Lipschitz. By Rademacher theorem, ϕ is differentiable Lebesgue almost
everywhere, and by the hypothesis µ ∈ P ac (X), it is therefore differentiable
on a set B ⊆ spt(µ) with µ(B) = 1. Consider an optimal transport plan
γ ∈ Γ(µ, ν). For every pair of points (x0 , y0 ) ∈ spt(γ) ∩ B × Y , we have
∀x ∈ X, ϕ(x) − c(x, y0 ) 6 ϕc (y0 )
with equality at x = x0 , so that x0 maximizes the function ϕ − c(·, y0 ).
Since x0 ∈ spt(µ), x0 belongs to the interior of X, one necessarily has
∇ϕ(x0 ) = ∇x c(x0 , y0 ). Then, by the twist condition, one necessarily has
y0 = yc (x0 , ∇ϕ(x0 )). This shows that any optimal transport plan γ is sup-
ported on the graph of the map T : x ∈ B 7→ yc (x0 , ∇ϕ(x0 )), and γ = γT by
the previous lemma. 
We finish this section with a stability result for optimal transport maps
(a more general result can be found in [98, Chapter 5]).
Proposition 14 (Stability of OT maps). Let X ⊆ ΩX and Y ⊆ ΩY be
compact subsets of open sets ΩX , ΩY ⊆ Rd , and c ∈ C 1 (ΩX × ΩY ) be
a twisted cost. Let ρ ∈ P ac (X), and let (µk ) ∈ P(Y ) be a sequence of
measures converging weakly to µ ∈ P(Y ). Define Tk (resp. T ) as the
unique optimal transport map between ρ and µk (resp. ρ and µ). Then,
limk→+∞ kTk − T kL1 (ρ) = 0.
Remark 4. Note that unlike the stability theorem for optimal transport plans
(Theorem 7), the convergence in Proposition 14 is for the whole sequence and
not up to subsequence. This theorem is not quantitative, and there exists
very few quantitative variants of this theorem. We are aware of two such
results. To state them, given a fixed ρ ∈ P(X) and µ ∈ P(Y ), we denote Tµ
the unique optimal transport map between ρ and µ.
• A first result of Ambrosio, reported in an article of Gigli [53, Propo-
sition 3.3 and Corollary 3.4], shows that if µ0 is such that the optimal
transport map Tµ0 is Lipschitz, then
kTµ − Tµ0 k2L2 (ρ) 6 C(µ0 ) W2 (µ, µ0 ).
16 QUENTIN MÉRIGOT AND BORIS THIBERT

The theorem in [53] holds for the quadratic cost on Rd . It was re-
cently generalized to other cost functions [5].
• Berman [15] proves a global estimate, not assuming the regularity of
Tµ0 but with a worse Hölder exponent, of the form
d−1
kTµ − Tµ0 k2L2 (ρ) 6 C W1 (µ, µ0 )1/2 ,
assuming that ρ is bounded from below on a compact convex do-
main of Rd , when the cost is quadratic. The constant then C only
depends on X, Y and ρ. Recently a similar bound with an exponent
independent on the dimension was obtained by Mérigot, Delalande
and Chazal [71]:
kTµ − Tµ0 k2L2 (ρ) 6 C W1 (µ, µ0 )1/15 .
Proof. As before, without loss of generality, we assume that spt(σ) lies in
the interior of X. Let (ϕk , ψk ) be solutions to (DP)k , which are c-conjugate
to each other, and such that ϕk (x0 ) = 0 for some x0 ∈ X. Then, by stability
of Kantorovich potentials, there exists a subsequence (ϕk , ψk ) (which we
do not relabel) which converges uniformly to (ϕ, ψ). Moreover, (ϕ, ψ) are
Kantorovich potentials for (DP), and are also c-conjugate to each other.
Since ϕ, ϕk ∈ Lip(X) are differentiable almost everywhere, there exists a
subset Z ⊆ spt(σ) with µ(Z) = 1 and such that for all x ∈ Z, ∇ϕk exists
for all k and ∇ϕ exists. Let x ∈ Z. Using
ϕk (x) − ψk (Tk (x)) = c(x, Tk (x)),
we get that for any cluster point y of the sequence (Tk (x))k ,
(
ϕ(x) − ψ(y) = c(x, y),
,
ϕ(x0 ) − ψ(y) 6 c(x0 , y) ∀x0 ∈ X
where the second inequality is obtained using ϕk ψk 6 c. Thus, as in the
proof of Brenier-McCann-Gangbo’s theorem, x is a minimizer of c(·, y) −
ϕ, i.e. ∇x c(x, y) = ∇ϕ(x), implying that y = yc (x, ∇ϕ(x)) = T (x). By
compactness, this shows that the whole sequence (Tk (x))k converges to S(x).
Therefore, Tk converges σ-almost everywhere to T , and L1 (σ) convergence
follows easily. 
2.3. Kantorovich’s functional. As already mentioned in Equation (2.20),
the Kantorovich’s dual problem (DP) can be expressed as an unconstrained
maximization problem:
Z Z
c
(DP) = max ψ dµ − ψdν.
ψ∈C 0 (Y ) X Y
This motivates the definition of Kantorovich’s functional as follows
Definition 9. The Kantorovitch functional is defined on C 0 (Y ) by
Z Z
c
K(ψ) = ψ dµ − ψdν. (2.24)
X Y
The Kantorovitch dual problem therefore amounts to maximizing the Kan-
torovitch functional:
(DP) = max K(ψ).
ψ∈C 0 (Y )
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 17

This subsection is devoted to the general computation of the superdiffer-


ential of Kantorovich’s functional when Y is finite. This computation will
be used to construct and study algorithms for discretized optimal transport
problems. The definition, as well as basic properties on the superdifferential
∂ + F of a function F are recalled in Appendix 5.1.
Proposition 15. Let X be a compact space, Y be finite, c ∈ C 0 (X × Y ) and
µ ∈ P(X) and ν ∈ P(Y ). Then, for all ψ0 ∈ RY ,
∂ + K(ψ0 ) = {ΠY # γ − ν | γ ∈ Γψ0 (µ)}. (2.25)
where Γψ0 (µ) is the set of probability measures on X × Y with first marginal
µ and supported on the c-subdifferential ∂ c ψ0 (defined in Eq. (2.21)), i.e.
Γψ0 (µ) = {γ ∈ P(X × Y ) | ΠX# γ = µ and spt(γ) ⊆ ∂ c ψ0 }. (2.26)
Proof. Let γ ∈ Γψ0 (µ). Then, for all ψ ∈ RY ,
Z Z
c
K(ψ) = ψ dµ − ψdν
Z Z
c
= ψ (x)dγ(x, y) − ψdν
Z Z
6 c(x, y) + ψ(y)dγ(x, y) − ψdν,

where we used ΠX# γ = µ to get the second equality and ψ c (x) 6 c(x, y) +
ψ(y) to get the inequality. Note also that equality holds if ψ = ψ0 , by
assumption on the support of γ. Hence,
Z Z
K(ψ) 6 K(ψ0 ) + (ψ(y) − ψ0 (y))dγ(x, y) − (ψ − ψ0 )dν
= K(ψ0 ) + hΠY # γ − ν|ψ − ψ0 i.
This implies by definition that ΠY # γ−ν lies in the superdifferential ∂ + K(ψ0 ),
giving us the inclusion
D(ψ0 ) := {ΠY # γ − ν | γ ∈ Γψ0 (µ)} ⊆ ∂ + K(ψ0 ).
Note also that the superdifferential of K is non-empty at any ψ0 ∈ RY , so
that K is concave. As a concave function on the finite-dimensional space
RY , K is differentiable almost everywhere and one has ∂K+ (ψ) = {∇K(ψ)}
at differentiability points.
We now show that ∂K+ (ψ0 ) ⊂ D(ψ0 ), using the characterization of the
subdifferential recalled in the Appendix:
n o
∂K+ (ψ0 ) = conv lim ∇K(ψ n ) | (ψ n )n∈N ∈ S ,
n→∞
where S is the set of sequences (ψ n )n∈N that converge to ψ0 , such that
∇K(ψ n ) exist and admit a limit as n → +∞. Let v = limn→∞ ∇K(ψ n ) ,
where (ψ n )n∈N belongs to the set S. For every n, there exists γ n ∈ Γψn (µ)
such that ∇K(ψ n ) = v n := ΠY # γ n − ν. By compactness of P(X × Y ), one
can assume (taking a subsequence if necessary) that γ n weakly converges to
some γ, and it is not difficult to check that γ ∈ Γψ0 (µ), ensuring that the
sequence v n converges to some v ∈ D(ψ0 ). Thus,
n o
lim ∇K(ψ n ) | (ψ n )n∈N ∈ S ⊆ D(ψ0 ).
n→∞
18 QUENTIN MÉRIGOT AND BORIS THIBERT

Taking the convex hull and using the convexity of D(ψ0 ), we get ∂ + K(ψ0 ) ⊆
D(ψ0 ) as desired. 
As a corollary of this proposition, we obtain an explicit expression for the
left and right partial deriatives of K, and a characterization of its differen-
tiability. In this corollary, we use the terminology of semi-discrete optimal
transport (Section 4.1), and we will refer to the c-subdifferential at y ∈ Y as
Laguerre cell associated to y and we will denote it by Lagy (ψ).
Lagy (ψ) := {x ∈ X | ∀z ∈ Y, c(x, y) + ψ(y) 6 c(x, z) + ψ(z)}. (2.27)
We also need to introduce the strict Laguerre cell SLagy (ψ):
SLagy (ψ) := {x ∈ X | ∀z ∈ Y, c(x, y) + ψ(y) < c(x, z) + ψ(z)}. (2.28)

Corollary 16 (Directional derivatives of K). Let ψ ∈ RY , y ∈ Y and define


κ(t) = K(ψ t ) where ψ t = ψ + t1y . Then, κ is concave and
∂ + κ(t) = [µ(SLagy (ψ t )) − ν({y}), µ(Lagy (ψ t )) − ν({y})]
In particular K is differentiable at ψ ∈ RY iff µ(Lagy (ψ) \ SLagy (ψ)) = 0
for all y ∈ Y , and in this case
 
∇K(ψ) = µ(Lagy (ψ)) − ν({y}) .
y∈Y

Proof. Using Hahn-Banach’s extension theorem, one can easily see that the
super-differential of κ at t is the projection of the super-differential K at ψ t :
∂ + κ(t) = hπ|1y i | π ∈ ∂ + K(ψ t ) .


Combining with the previous proposition we get


∂ + κ(t) = hΠX#γ − ν|1y i | γ ∈ Γψt (µ)


= γ(X × {y}) − ν({y}) | γ ∈ Γψt (µ)
To obtain the desired formula for ∂κ+ (t), it remains to prove that
max γ(X × {y}) | γ ∈ Γψt (µ) = µ(Lagy (ψ t )),


min γ(X × {y}) | γ ∈ Γψt (µ) = µ(SLagy (ψ t )).




We only prove the first equality, the second one being similar. Denote Z =
X \ Lagy (ψ t ), so that for any γ ∈ Γψt (µ),
γ(X × {y}) = γ(Lagy (ψ t ) × {y}) + γ(Z × {y}).
Moreover, by definition of Lagy (ψ t ), Z × {y} ∩ ∂ c ψ t = ∅. Since γ belongs to
Γψt (µ), we have spt(γ) ⊆ ∂ c ψ t so that γ(Z × {y}) = 0. This gives us
γ(X × {y}) = γ(Lagy (ψ t ) × {y}) 6 γ(Lagy (ψ t ) × Y ) = µ(Lagy (ψ t )),
where we used ΠX# γ = µ to get the last equality. This proves that
sup γ(X × {y}) | γ ∈ Γψt (µ) 6 µ(Lagy (ψ t )).


To show equality, we consider an explicit γ ∈ Γψt (µ) using a map T : X → Y


defined as follows: for x ∈ Lagy (ψ t ), we set T (x) = y and for points x 6∈
Lagy (ψ t ), we define T (x) to be an arbitrary z ∈ Y such that x ∈ Lagz (ψ t ).
Then, one can readily check that γ = (id, T )# µ belongs to Γψt (µ) and that
by construction, γ(X × {y}) = µ(Lagy (ψ t )). 
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 19

3. Discrete optimal transport


In this part we present two algorithms for solving discrete optimal trans-
port problems, wich can be both be interpreted using Kantorovich’s duality:
• The first one is Bertsekas’ auction algorithm, which allows to solve
optimal transport problem where the source and targed measures
are uniform over two sets with the same cardinality, a case known
as the assignment problem in combinatorial optimization. Bertsekas’
algorithm is a coordinate-ascent method that iteratively modifies the
coordinates of the dual variable ψ so as to reach a maximizer of the
Kantorovitch functional K(ψ).
• The second algorithm is the Sinkhorn-Knopp’s algorithm that allows
to solve the entropic regularization of (discrete) optimal transport
problems. This algorithm can be seen as a block-coordinate ascent
method since it amounts to maximizing the dual of the regularized
optimal transport problem, denoted by Kη (ϕ, ψ), by alternatively
optimizing with respect to the two dual variables ϕ and ψ.

3.1. Formulation of discrete optimal transport.

Primal and dual problems. We consider in this section that the two sets
X and P Y are finite, and we P consider two discrete probability measures
µ = x∈X µ x δx and ν = y∈Y νy δy . This setting occurs frequently in
applications. The set of transport plans is then given by
 
 X X X 
Γ(µ, ν) = γ = γx,y δ(x,y) | γx,y > 0, γx,y = µx , γx,y = νy ,
 
x,y y∈Y x∈X

and is often referred to as the transportation polytope. In this discrete setting,


we will conflate a transport plan γ ∈ Γ(µ, ν) with the matrix (γx,y )(x,y)∈X×Y ,
P is the density of γ with respect to the counting measure. The
which formally
constraint y γx,y = µx encodes the fact that all mass from x is transported
P
somewhere in Y , while the constraint x γx,y = νy tells us that the mass
at y is transported from somewhere in X. The Kantorovitch problem for a
cost function c : X × Y → R then reads
X
(KP) = min c(x, y)γx,y . (3.29)
γ∈Γ(µ,ν)
x∈X,y∈Y

As seen in Section 2.2, the dual (DP) of this linear programming problem
amounts to maximizing the Kantorovitch functional K (2.24), which in this
setting can be expressed as
X X
K(ψ) = min(c(x, y) + ψ(y))µx − ψ(y)νy , (3.30)
y∈Y
x∈X y∈Y

where ψ ∈ RY is a function over the finite set Y . Since strong duality holds
(Theorem 5), one has
(KP) = (DP) = max K(ψ).
ψ∈RY
20 QUENTIN MÉRIGOT AND BORIS THIBERT

Remark 5. Knowing a maximizer ψ ∈ RY of K does not directly allow to


recover an optimal transport plan γ. However, by Corollary 6, we know that
any transport plan γ ∈ Γ(µ, ν) is optimal if and only if its support is included
in the c-subdifferential of ψ:
n o
spt(γ) ⊂ ∂ c ψ = (x, y) ∈ X × Y | ψ c (x) = c(x, y) + ψ(y) .

3.2. Linear assignment via coordinate ascent.

Assignment problem. When the two sets X and Y have the same cardinal N
and when µ and ν are uniform probability measures over these sets, namely
1 X 1 X
µ= δx , ν = δy , (3.31)
N N
x∈X y∈Y

then Monge’s problem corresponds to the (linear) assignment problem (AP)


which is one of the most famous combinatorial optimization problem:
( )
1 X
(AP) = min c(x, σ(x)) | σ : X → Y is a bijection . (3.32)
N
x∈X

This problem and its variants have generated a very important amount
of research, as demonstrated by the bibliography of the book by Burkard,
Dell’Amico and Martello on this topic [21].
Note that the set of bijections from X to Y has cardinal N !, making
it practically impossible to solve (AP) through direct enumeration. Using
Birkhoff’s theorem on bistochastic matrices, we will show that the assign-
ment problem coincides with the Kantorovitch problem.
Definition 10 (Bistochastic matrices). A N -by-N bistochastic matrix is a
square matrix M ∈ MN (R) with non-negative coefficients such that the sum
of any row and any column equals one:
X X
∀i ∈ {1, . . . , N }, Mij = 1, ∀j ∈ {1, . . . , N }, Mij = 1
j i

We denote the set of N -by-N bistochatic matrices as BN ⊆ MN (R).


Definition 11 (Permutation matrix). The set of permutations (bijections)
from {1, . . . , N } to itself is denoted SN . Given a permutation σ ∈ SN , we
associate the permutation matrix
(
1 if σ(i) = j
M [σ]ij = .
0 if not

One can easily check that if σ is a permutation, then M [σ] belongs to BN .


Birkhoff’s theorem on the other hand asserts that the extremal points of the
polyhedron BN are permutation matrices, implying thanks to Krein-Milman
theorem that every bistochastic matrix can be obtained as a (finite) convex
combination of permutation matrices.
Theorem 17 (Birkhoff). The extremal points of BN are the permutation
matrices. In particular, BN = conv{M [σ] | σ ∈ SN }.
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 21

Kantorovitch’s problem (KP) amounts to minimizing a linear function


over the set of bistochastic matrices which is convex. Birkoff’s theorem
implies that there exists a bijection that solves this problem (KP), hence the
following theorem:
Theorem 18. Let µ and ν be as in (3.31). Then, (AP) = (KP).

Proof. Take an arbitrary ordering of the points in X and Y , i.e. X =


{x1 , . . . , xN } and Y = {y1 , . . . , yN }. Then γ ∈ RX×Y ' MN (R) is a trans-
port plan between µ and ν iff N γ ∈ BN . Since bistochastic matrices include
permutation matrices, we have (KP) 6 (AP), and the converse follows from
the fact that the minimum in (KP) is attained at an extreme point of BN ,
i.e. a permutation matrix. 

Dual coordinate ascent methods. We follow Bertsekas [16] by trying to solve


the assignment problem (AP) through the unconstrained dual problem (3.1).
Combining Theorem 18 and Corollary 6, we have the following proposition.
Proposition 19. The following statements are equivalent
• ψ is a global maximizer of the Kantorovitch functional K
• There exists a bijection σ : X → Y that satisfies
∀x ∈ X, c(x, σ(x)) + ψ(σ(x)) = min c(x, y) + ψ(y).
y∈Y

A bijection σ satisfying this last equation is a solution to the linear assign-


ment problem.
The idea of Bertsekas [16] is to iteratively modify the weights ψ ∈ RY so as
to reach a maximizer of K. By Corollary 16, the gradient of the Kantorovitch
functional, when it exists, is given by
1 
∇K(ψ) = card(Lagy (ψ)) − 1 .
N y∈Y

In addition, recalling the definition of a Laguerre cell,


Lagy (ψ) = {x ∈ X | ∀y 0 ∈ Y, c(x, y) + ψ(y) 6 c(x, y 0 ) + ψ(y 0 )},
one can see that card(Lagy (ψ)) is obviously decreasing when ψ(y) increases.
Therefore, in order to maximize the concave function K, it is natural to
increase the weight ψ(y) of any Laguerre cells that satisfy card(Lagy (ψ)) > 1.
In the following lemma, we calculate the optimal increment, which is known
as the bid.
Lemma 20 (Bidding increment). Let ψ ∈ RY and y0 ∈ Y be such that
Lagy0 (ψ) 6= ∅. Then the maximum of the function t → K(ψ + t1y0 ) is
reached at
bidy0 (ψ) = max{bidy0 (ψ, x), x ∈ Lagy0 (ψ)},
where
 
bidy0 (ψ, x) := min c(x, y) + ψ(y) − (c(x, y0 ) + ψ(y0 )).
y∈Y \y0
22 QUENTIN MÉRIGOT AND BORIS THIBERT

Proof. Denote ψ t = ψ + t1y0 . For t > 0 one has Lagy0 (ψ t ) ⊆ Lagy0 (ψ).
Remark also that for every x ∈ X, one has
x ∈ Lagy0 (ψ t ) ⇔ ∀z 6= y0 c(x, y0 ) + ψ(y0 ) + t6 c(x, z) + ψ(z)
⇔ t 6 minz∈Y \y0 c(x, z) + ψ(z) − (c(x, y0 ) + ψ(y0 ))
⇔ t 6 bidy0 (ψ, x)
This implies that Lagy0 (ψ t ) 6= ∅ if and only if t 6 bidy0 (ψ). By Corollary 16,
the upper-bound of the superdifferential ∂ + κ(t) of the function κ(t) = K(ψ +
t1y0 ) is µ(Lagy0 (ψ t )) − N1 . It is non-negative for t ∈ [0, bidy0 (ψ)] and strictly
negative for t > bidy0 (ψ). This directly implies (for instance by (5.76)) that
0 ∈ ∂ + κ(bidy0 (ψ)), so that the largest maximizer of κ is bidy0 (ψ). 
Remark 6 (Economic interpretation of the bidding increment.). Assume that
Y is a set of houses owned by one seller and X is a set of customers that
want to buy a house. Given a set of prices ψ : Y → R, each customer x ∈ X
will make a compromise between the location of a house y ∈ Y (measured by
c(x, y)) and its price (measured by ψ(y)) by choosing a house among those
minimizing c(x, y) + ψ(y). In other words, x chooses y iff x ∈ Lagy (ψ). Let
y be a given house. The seller of y wants to maximize his profit, hence to
increase ψ(y) as much as possible while keeping (at least) one customer. Let
x ∈ X be a customer interested in the house y (i.e. x ∈ Lagy (ψ)). Then,
bidy,x (ψ) tells us how much it is possible to increase the price of y while
keeping it interesting to x. The best choice for the seller is to increase the
price by the maximum bid, which is the maximum raise so that there remains
at least one customer, giving the definition of bidy (ψ).
Remark 7 (Naive coordinate ascent). A naive algorithm would be to would
choose at each step a coordinate y ∈ Y such that Lagy (ψ) 6= ∅ and to increase
ψ(y) by the bidding increment bidy (ψ). In practice, such an algorithm might
get stuck at a point which is not a global maximizer, a phenomenon which is
referred to as jamming in [17, §2]. In practice, this can happen when some
bidding increments bidy (ψ) vanishes, see Remark 8 below. Note that this is a
particular case of the well known fact that coordinate ascent algorithms may
converge to points that are not maximizers, when the maximized functional
is nonsmooth.
In order to tackle the problem of non-convergence of coordinate ascent,
Bertsekas and Eckstein changed the naive algorithm outlined above to impose
that the bids are at least ε > 0. To analyse their algorithm, we introduce
the notion of ε-complementary slackness, where ε can be seen as a tolerance.
Definition 12 (ε-Complementary slackness.). A partial assignment is a cou-
ple (σ, S) where S ⊆ X and σ : S → Y is an injective map. A partial assign-
ment (σ, S) and a price function ψ ∈ RY satisfy ε-complementary slackness
if for every x in S the following inequality holds:
c(x, σ(x)) + ψ(σ(x)) 6 min [c(x, y) + ψ(y)] + ε. (CSε )
y∈Y

In the economic interpretation, a partial assignment σ : S ⊆ X → Y satis-


fies (CSε ) with respect to prices ψ ∈ RY if every customer x ∈ S is assigned
to a house σ(x) which is “nearly optimal”, i.e. is within ε of minimizing
c(x, ·) + ψ(·) over Y .
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 23

Lemma 21. If σ : X → Y is a bijection which satisfies (CSε ) together with


some ψ ∈ RY , then
1 X
(KP) 6 c(x, σ(x)) 6 (KP) + ε. (3.33)
N
x∈X

Proof. The first inequality just comes from the fact σ is a particular transport
plan. For the second inequality, by summing the (CSε ) condition, one gets
1 X 1 X
c(x, σ(x)) + ψ(σ(x)) 6 min(c(x, y) + ψ(y)) + ε.
N N y∈Y
x∈X x∈X
This leads to
1 X
c(x, σ(x)) 6 K(ψ) + ε 6 (DP) + ε = (KP) + ε. 
N
x∈X

Bertsekas’ auction algorithm. Bertsekas’ auction algorithm maintains a par-


tial matching (σ, S) and prices ψ ∈ RY that together satisfy ε-complementary
slackness. At the end of the execution, σ is a bijection, and (σ, ψ) satisfy
the ε-CS condition.

Algorithm 1 Bertsekas’ auction algorithm


function Auction(c, ε, ψ = 0)
S←∅ . All points are unassigned
while ∃x ∈ X \ S do
y0 ← arg miny∈Y c(x, y) + ψ(y)
y1 ← arg miny∈Y \{y0 } c(x, y) + ψ(y)
ψ(y0 ) ← ψ(y0 ) + (c(x, y1 ) + ψ(y1 )) − (c(x, y0 ) + ψ(y0 )) + ε
if ∃x0 ∈ X s.t. σ(x0 ) = y0 then . y0 is “stolen” from x0
S ← S \ {x0 }
S ← S ∪ {x}, σ(x) ← y0
return σ, ψ

Remark 8 (Non-convergence when ε = 0). Consider for instance X = {x1 , x2 , x3 },


Y = {y1 , y2 , y3 }, c(x, y) = kx − yk and
y1 = (0, 1), y2 = (0, −1), y3 = (10, 0),
x1 = (−1, 0), x2 = (−2, 0), x3 = (−3, 0).
The points x1 , x2 and x3 are equidistant to y1 and y2 and “far” from y3 .
Implementing auction’s algorithm with ε = 0 then leads to an infinite loop.
Indeed, at every steps, the customers x1 , x2 , x3 pick one of the houses y1 or
y2 , but do not raise the prices, as the second best house is equally interesting.
This “bidding war” goes on forever.
Remark 9 (Lower bound on the number of steps). Consider the same setting
as before, but with ε > 0. At the beginning of the algorithm, the customers
x1 , x2 and x3 pick alternatively y1 or y2 . As long as y3 has never been
selected, the difference of prices between y1 and y2 is either 0 or ε, so that
the bid is always ε or 2ε. After n iterations, the price of the houses y1 , y2
is at most equal to 2nε. This means that the third house y3 will never be
24 QUENTIN MÉRIGOT AND BORIS THIBERT

chosen until 2nε > mini kxi − y3 k := C. As a consequence, the number of


iterations is at least C/(2ε).
The lower bound in the previous remark has a matching upper bound.
Theorem 22. If one starts the auction algorithm with ψ = 0, then
• the number of steps in the auction algorithm is at most N (C/ε + 1),
where C := maxX×Y c(x, y).
• the number of operations is at most N 2 (C/ε + 1).
Moreover, the bijection σ and the prices ψ returned by the algorithm satisfy
(CSε ), so that in particular σ is ε-optimal (3.33).

Remark 10. Note that the computational complexity of this algorithm is very
bad. Indeed, if ε = 10−k and if C = 1, the number of steps in the worst-case
complexity is 10k N 2 . It would be highly desirable to replace the factor 1/ε
by log(1/ε). In the next paragraph, we see how this can be achieved using a
scaling technique.
The proof of Theorem 22 relies on the following lemma, whose proof is
straightforward.
Lemma 23. Over the course of the auction algorithm,
(i) the set of selected “houses” σ(S) is increasing w.r.t inclusion;
(ii) (ψ, σ) always satisfy the ε-complementary slackness condition ;
(iii) the price increments are by at least ε.
Proof of theorem 22. Suppose that after i steps the algorithm hasn’t stopped.
Then, there exists a point y0 in Y that does not belong to σ(S), i.e whose
price hasn’t increased since the beginning of the algorithm, i.e. ψ(y0 ) = 0.
Suppose now that there exists y1 whose price has been raised more than
n > C/ + 1. Then, by Lemma 23.(iii), one has for every x ∈ X
ψi (y0 ) + c(x, y0 ) = c(x, y0 ) 6 C < n −  6 ψi (y1 ) −  6 ψi (y1 ) + c(x, y1 ) − 
This contradicts the fact that y1 was chosen at a former step. From this, we
deduce that there is no point in Y whose price has been raised n times with
n > C/ + 1. With at most C/ε + 1 price rise for each of the N objects,
and every step costing N (finding the minimum among N ) we deduce the
desired bound. 
Auction algorithm with ε-scaling. Following [43], Bertsekas and Eckstein [17]
modified Algorithm 1 using a scaling technique which improves dramatically
both the running time and worst-case complexity of the algorithm. Note that
similar scaling techniques have also been applied to improve other algorithms
for the assignment problem, see e.g. [54, 47].
The modified algorithm can be described as follows: define ψ0 = 0 and re-
cursively, let ψk+1 be the prices returned by Auction(ψk , εk ), where εk = 2Ck .
One stops when εk < ε, so that the number of runs of the unscaled auction
algorithm is bounded by log2 (C/ε). Bounding carefully the complexity of
each auction run, one gets:
Theorem 24. The auction algorithm with scaling constructs an η-optimal
assignement in time O(N 3 log(C/η)).
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 25

Algorithm 2 Bertsekas-Eckstein auction algorithm with ε-scaling


function AuctionScaling(c, η)
ε ← C, ψ ← 0
while ε > η do
σ, ψ ← Auction(c, ε, ψ)
ε ← ε/2
return σ, ψ

Lemma 25. Consider a bijection σ0 : X → Y , an injective map σ : S ⊆


X → Y , and two price vector ψ0 , ψ : Y → R. Assume that (σ0 , ψ0 ) and
(σ, ψ) satisfy respectively the λ- and ε-complementary slackness conditions,
with ε 6 λ. Moreover, suppose that S 6= X and that ψ0 and ψ agree on the
set Y \ σ(S). Then,

∀y ∈ Y, ψ(y) 6 ψ0 (y) + N (λ + ε)

Proof. Consider a point y0 in Y , and define yk+1 as follows: (a) if yk ∈ σ(S),


let xk := σ −1 (yk ), and yk+1 = σ0 (yk ) (b) if yk 6∈ σ(S), then stop. The
ε-complementary slackness for (ψ, σ) at (xk , yk ) implies

c(xk , yk )+ψ(yk ) 6 min c(xk , y)+ψ(y)+ε 6 ψ(yk+1 )+c(xk , yk+1 )+ε. (3.34)
y∈Y

Similarly, λ-CS for (ψ0 , σ0 ) at (xk , yk+1 ) with y = yk implies

ψ0 (yk+1 ) + c(xk , yk+1 ) 6 c(xk , yk ) + ψ0 (yk ) + λ. (3.35)

Summing the inequalities (3.34) and (3.35) for k = 0 to k = K − 1 gives

ψ0 (yK ) − ψ0 (y0 ) + ψ(y0 ) − ψ(yK ) 6 K × (λ + ε)

By assumption, the point yK does not belong to σ(S) and ψ(yK ) = ψ0 (yK ).
This gives us ψ(y0 ) 6 ψ0 (y0 ) + K(λ + ε), and we conclude by remarking that
the path (y0 , x0 , . . . , yK ) is simple, i.e. K 6 N . 

Proof of Theorem 24. Lemma 25 implies that during the run k + 1 of the
(unscaled) auction algorithm, the price vector never grows larger than ψ0 +
(ε + λ)N = ψ0 + 3εN , with λ := εk and ε := 12 λ. Since at each step,
the price grows by at least ε, there are at most 3N 2 steps in the run k.
Taking into account the cost of finding miny∈Y c(x, y) + ψ(y) at each step,
the computational complexity of each auction run is therefore O(N 3 ). Since
the the number of runs is O(log(C/η)), we get the claimed estimate. 

Implementations of auction’s algorithm. One of the most expensive phase


of Auction’s algorithm is the computation of the bid. Computing the bid
for a certain customer x ∈ X requires one to browse through all the houses
y ∈ Y in order to determine the smallest values of c(x, y) + ψ(y), y ∈ Y .
The cost of determining the bid accounts for a factor N = Card(Y ) in
the computational complexity of auction’s algorithm in Theorems 22 and
Theorem 24. We mention two possible ways to overcome this difficulty.
26 QUENTIN MÉRIGOT AND BORIS THIBERT

Exploiting the geometry of the cost. The first idea is to exploit the geom-
etry of the space in order to reduce the cost of finding the minimum of
c(x, y) + ψ(y), y ∈ Y , which accounts for a cost of N in the complexity
analysis of Theorem 24. The computation of this minimimum is similar to
the nearest neighbor problem in computational geometry, and nearest neigh-
bors can sometimes be found in log(N ) time, after some preprocessing. For
instance, in the case of c(x, y) = kx − yk2 on Rd , and for ψ > 0 one can
rewrite
2
c(x, y) + ψ(y) = kx − yk2 + ( ψ(y) − 0)2 = (x, 0) − (y, ψ(y)) ,
p p

thus showing that finding the smallest value of c(x, py)+ψ(y) over Y amounts
to finding the closest point to (x, 0) in the set {(y, ψ(y)) | y ∈ Y } ⊆ Rd+1 .
This idea and variants thereof leads to practical and theoretical improve-
ments, both for auction’s algorithm and for other algorithms for the assign-
ment problem. We refer to [63, 1] and references therein.

Exploiting the graph structure of solutions. When the cost satisfies the Twist
conditition (2.23) on Rd and the source measure is absolutely continuous,
Theorem 12 guarantees that the solution to the Kantorovich’s problem is
concentrated on a graph, i.e. dim(spt(γ)) = d while a priori, the dimension
of spt(γ) ⊆ R2d could be as high as 2d. It is natural, in view of the stability of
the optimal transport plans (Theorem 7), to hope that this feature remains
true at the discrete level, meaning that one expects that the support of the
discrete solution concentrates on a lower dimensional graph G. One could
then try to use this phenomenom to prune the search space, i.e. taking the
minimum in c(x, y) + ψ(y) not over the whole space but over points y such
that (x, y) lie “close” to G. In practice, G is unknown but can estimated
in a coarse-to-fine way. This idea or variants thereof has been used as a
heuristic in several works [70, 78, 10], and has been analyzed more precisely
by Bernhard Schmitzer [89, 90].

3.3. Discrete optimal transport via entropic regularization. We now


turn to another method to construct approximate solutions to optimal trans-
port problems between probability measures on two finite sets X and Y .
Here, the measures are not supposed uniform any more, and we set
X X
µ= µx δ x ν= νy δy .
x∈X y∈Y

For simplicity, we assume throughout that all the points in X and Y carry
some mass, that is min(minx∈X µx , miny∈Y νy ) > 0. As before, we conflate a
transport plan γ ∈ Γ(µ, ν) with its density (γx,y )(x,y)∈X×Y .

Entropic regularization problem. We start from the primal formulation of


the optimal transport problem, but instead of imposing the non-negativity
constraints γx,y > 0, we add a term to the transport cost, which penalizes
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 27

(minus) the entropy of the transport plan and acts as a barrier for the non-
negativity constraint:
X
H(γ) = h(γx,y ),
x∈X,y∈Y

t(log(t) − 1) if t > 0
 (3.36)
where h(t) = 0 if t = 0

+∞ if t 6 0

The regularized problem is the following minimization problem:


(KPη ) := min hc|γi + ηH(γ),
γ∈Γ(µ,ν)
 
 X X  (3.37)
where Γ(µ, ν) = γ = (γx,y ) | γx,y = µx , γx,y = νy .
 
y∈Y x∈X

Theorem 26. The problem (KPη ) has a unique solution γ, which belongs
to Γ(µ, ν). Moreover, if minx∈X µx > 0 and miny∈Y µy > 0, then
∀(x, y) ∈ X × Y, γx,y > 0.

Lemma 27. H : γ ∈ (R∗+ )X×Y 7→ x,y h(γx,y ) is 1-strongly convex.


P

Proof. From h00 (t) = 1/t, one sees that the Hessian D2 H(γ) is diagonal with
diagonal coefficients 1/γx,y > 1 since γx,y ∈]0, 1]. 
Proof. The regularized problem (KPη ) amounts to minimizing a continuous
and coercive function over a closed convex set, thus showing existence. Let
us denote by γ ∗ a solution of (KPη ). Then, γ ∗ has a finite entropy, so
∗ > 0. This implies that γ ∗ is a transport
that it satisfies the constraint γx,y
map between µ and ν. We now prove by contradiction that the set Z :=
∗ = 0} is empty. For this purpose, we define a new transport
{(x, y) | γx,y
map γ ∈ Γ(µ, ν) by γ ε = (1 − ε)γ ∗ + εµ ⊗ ν, and we give an upper bound on
ε

the energy of γ ε . We first observe that by convexity of h : r 7→ r(log r − 1),


one has
ε ∗ ∗
h(γx,y ) 6 (1 − ε)h(γx,y ) + εh(µx νy ) 6 h(γx,y ) + O(ε).
We consider some (x, y) ∈ Z. Introducing C = minx,y µx νy , which is strictly
positive by assumption, we have
ε
h(γx,y ) = h(εµx νy ) = µx νy ε(log ε + log(µx µy )) − µx νy ε
6 Cε log ε + O(ε),
Summing the two previous estimates over Z and (X × Y ) \ Z, and setting
n = Card(Z), we get
H(γ ε ) 6 H(γ ∗ ) + Cnε log ε + O(ε).
Since in addition we have by linearity hc|γ ε i 6 hc|γ ∗ i + O(ε), we get
hc|γ ∗ i + H(γ ∗ ) 6 hc|γ ε i + H(γ ε ) 6 hc|γ ∗ i + H(γ ∗ ) + Cnε log ε + O(ε),
where the lower bound comes from the optimality of γ ∗ . Thus, Cnε log ε +
O(ε) > 0, which is possible if and only if n = Card(Z) vanishes, implying
the strict positivity of γ ∗ .
28 QUENTIN MÉRIGOT AND BORIS THIBERT

By continuity of the function minimized in (KPη ), the set of solutions γ ∗


is closed and therefore included in [δ, +∞)X×Y for some δ > 0. Therefore,
by Lemma 27, the regularized problem (KPη ) amounts to minimizing a co-
ercive and strictly convex function over a closed convex set, thus showing
uniqueness of the solution. 

Dual formulation. We start by deriving (formally) the dual problem and first
introduce the Lagragian of (KPη )
 
X X X
L(γ, ϕ, ψ) := γx,y c(x, y) + ηh(γx,y ) + ϕ(x) µx − γx,y 
x,y x∈X y∈Y
 
X X
+ ψ(y)  γx,y − νy  ,
y∈Y y∈Y
(3.38)
where ϕ : X → R and ψ : Y → R are the Lagrange multipliers. Then,
(KPη ) = min sup L(γ, ϕ, ψ).
γ ϕ,ψ

As always, the dual problem is obtained by inverting the infimum and the
supremum. We also simplify slightly the expressions:
X
sup min L(γ, ϕ, ψ) = sup min γx,y (c(x, y) + ψ(y) − ϕ(x) + η(log(γx,y ) − 1))
ϕ,ψ γ ϕ,ψ γ
x,y
X X
+ ϕ(x)µx − ψ(y)νy . (3.39)
x∈X y∈Y

Taking the derivative with respect to γx,y , we find that for a given ϕ, ψ, the
optimal γ must satisfy:
c(x, y) + ψ(y) − ϕ(x) + η log(γx,y ) = 0
1
(ϕ(x)−ψ(y)−c(x,y))
(3.40)
i.e. γx,y = e η
Putting these values in the Equation (3.39) gives the following definition:
Definition 13 (Dual regularized problem). The dual of the regularized op-
timal transport problem is defined by
(DPη ) = sup Kη (ϕ, ψ) (3.41)
ϕ,ψ

where
1
(ϕ(x)−ψ(y)−c(x,y))
X X X
Kη (ϕ, ψ) := − ηe η + ϕ(x)µx − ψ(y)νy .
(x,y)∈X×Y x∈X y∈Y
(3.42)
We can now state the strong duality result
Theorem 28 (Strong duality). Strong duality holds and the maximum in
the dual problem is reached, i.e. there exist ϕ ∈ RX an ψ ∈ RY such that
(KPη ) = (DPη ) = Kη (ϕ, ψ).
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 29

Corollary 29. If ϕ, ψ is the solution to the dual problem (DPη ), then the
solution γ of (KPη ) is given by
ϕ(x)−ψ(y)−c(x,y)
γx,y = e η .
Corollary 29 is a direct consequence of the relation (3.40). This holds be-
cause, unlike the original linear programming formulation of optimal trans-
port, the regularized problem (KPη ) is smooth and strictly convex.
Proof of Theorem 28. Weak duality (KPη ) > (DPη ) always hold. To prove
the strong duality, we denote by γ ∗ the solution to (KPη ), and we note
that by Theorem 26, γxy ∗ > 0 for all (x, y) ∈ X × Y . This implies that the

optimized functional γ 7→ hc|γi+ηH(γ) is C 1 in a neighborhood of γ ∗ . Thus,


there exists Lagrange multipliers for the equality constrained problem, i.e.
ϕ̃ ∈ RX and ψ̃ ∈ RY such that
∇γ L(γ ∗ , ϕ̃, ψ̃) = 0.
Since the function L(·, ϕ̃, ψ̃) is convex, this implies that γ ∗ = argminγ L(γ, ϕ̃, ψ̃).
Hence
(DPη ) = sup min L(γ, ϕ, ψ) > min L(γ, ϕ̃, ψ̃) = L(γ ∗ , ϕ̃, ψ̃) = (KPη ).
ϕ,ψ γ γ

The last equality follows from the fact that γ ∗ satisfies the constraints and
is a solution to (KPη ). Thus (DPη ) = (KPη ). 

Regularized c-transform. A natural way to maximize Kη (ϕ, ψ) is to maxi-


mize alternatively in ϕ and ψ. In the case of entropy-regularized optimal
transport, each of the partial maximization problems (maxϕ Kη (ϕ, ψ) and
maxψ Kη (ϕ, ψ)) have explicit solutions, which are connected to the notion of
c-transform in (non-regularized) optimal transport:
Proposition 30. The following holds
(i) Given ψ ∈ RY , the maximizer of Kη (·, ψ) is attained at a unique
point in RX , denoted ψ c,η , and defined by
 
X 1 (−c(x,y)−ψ(y))
ψ c,η (x) = η log(µx ) − η log  eη . (3.43)
y∈Y

(ii) Given, ϕ ∈ RX , the maximizer of Kη (ϕ, ·) is attained at a unique


point in RY , denoted ϕc,η , and defined by
!
X 1 (−c(x,y)+ϕ(x))
ϕc,η (y) = −η log(µy ) + η log eη . (3.44)
x∈X

Proof. To prove (i), consider ϕ ∈ RX the maximizer of Kη (·, ψ). Taking the
derivative of Kη with respect to the variable ϕ(x) gives us
ϕ(x) X
− 1 (ψ(y)+c(x,y))
µx = e η e η ,
y∈Y

implying the desired formula. The second formula is proven similarly. 


30 QUENTIN MÉRIGOT AND BORIS THIBERT

Definition 14 (Regularized c-transform). Given ψ ∈ RY , we will call the


function ψ c,η defined by (3.43) its regularized c-transform. Similarly, given
ϕ ∈ RX , we call the function ϕc,η defined by (3.44) its regularized c-transform
Remark 11 (Relation to the c-transform). As the notation indicates, ψ c,η is
related to the c-transform used in optimal transport (Def. 7). Indeed, when
η tends to zero, one has
  
X 1 (−c(x,y)−ψ(y))
lim ψ c,η (x) = lim η log(µx ) − log  eη 
η→0 η→0
y∈Y
c
= min c(x, y) + ψ(y) = ψ (x).
y∈Y

This explains the choice of notation: ψ c,η is a smoothed version of the c-


transform introduced in Definition 7.
The following two properties are very similar to some properties holding
for the standard c-transform. In the following, we denote k·ko,∞ the pseudo-
norm of uniform convergence up to addition of a constant:
1
kf ko,∞ = inf kf + ak∞ = (sup f − inf f ).
a∈R 2
This pseudo-norm will be very useful to state convergence results for Sinkhorn-
Knopp’s algorithm for solving the regularized optimal transport problem.
Proposition 31. Let ψ, ψ ∈ RY . Then,
(i) for a ∈ R, (ψ + a)c,η = ψ c,η + a.
(ii) kψ c,η ko,∞ 6 η klog(ν)ko,∞ + kcko,∞ ,

c,η
(iii) ψ c,η − ψ 6 ψ − ψ .

o,∞ o,∞

Similar properties hold for the map ϕ ∈ RX 7→ ϕc,η .


Proof. (ii) Using the formula (3.43), and c(x, y) − c(x0 , y) 6 sup c − inf c,
ψ c,η (x) − ψ c,η (x0 )
= η(log(µx ) − log(µx0 ))
    
X 1 (−c(x0 ,y)−ψ(y)) X 1 (−c(x,y)−ψ(y))
+ η log  eη  − log  eη  .
y∈Y y∈Y

6 η(sup log(µ) − inf log(µ)) + sup c − inf c,


implying the first inequality.

c,η
(iii) If we show that ψ c,η − ψ 6 ψ − ψ ∞ , the same inequality


with k·ko,∞ will follow easily using (i). Using ψ(y) 6 ψ(y) + ψ − ψ ∞ , we
have
c,η
ψ c,η (x) − ψ (x)
   
X 1 (−c(x,y)−ψ(y)) X 1 (−c(x,y)−ψ(y))
= −η log  eη  + η log  eη 
y∈Y y∈Y

6 ψ − ψ ∞ 
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 31

Regularized Kantorovitch functional. As in standard optimal transport (see


§2.3) and following Cuturi and Peyré [36], we can express the regularized
dual maximization problem (3.41) using only the variable ψ ∈ RY .

Definition 15 (Regularized Kantorovitch functional). The regularized Kan-


torovitch functional Kη : RY → R is defined by

Kη (ψ) = max Kη (ϕ, ψ) = hψ c,η |µi − hψ|νi (3.45)


ϕ∈RX

Since ψ c,η has a closed-form expression, the functional Kη can be computed


explicitely. This explicit expression is a special feature of P
the choice of the
entropy as the regularization. In the next formula, H(µ) = x∈X µx log(µx ):
 
X X −c(x,y)−ψ(y) X
Kη (ψ) = −η µx log e η  + ηH(µ) − ψ(y)νy ,
x∈X y∈Y y∈Y

Remark 12. Note the similarity between the formula for Kantorovich func-
tional derived from regularized transport (3.45) and the formula for the Kan-
torovich functional without regularization (2.24). Note also that Kη is also
η η
P of a constant, namely K (ψ + λ1Y ) = K (ψ) for any
invariant by addition
λ ∈ R and 1Y = y∈Y 1y the constant function equal to one.

In order to express the gradient and the Hessian of Kη , we introduce the


notion of smoothed laguerre cells.

Definition 16 (Smoothed Laguerre cells). Given ψ ∈ RY , we define


c(·,y)+ψ(y)

e η
RLagηy (ψ) =P c(·,z)+ψ(z)
. (3.46)

z∈Y e η

Unlike the standard Laguerre cell Lagy (ψ) defined in (2.27), which is a set,
RLagηy (ψ) is a function. The family (RLagηy (ψ))y∈Y is a partition of unity,
meaning that the sum over y of RLagηy (ψ) equals one. One can loosely
think of the regularized Laguerre cells as smoothed indicator functions of
the (standard) Laguerre cells. In particular,
(
η 0 if x 6∈ Lagy (ψ)
lim RLagy (ψ)(x) =
η→0 1 if x ∈ SLagy (ψ),

where SLagy (ψ) is the strict Laguerre cell introduced in (2.28). We also
introduce the two quantities

Gηy (ψ) = hRLagηy (ψ)|µi


(
1
hRLagη (ψ)RLagηz (ψ)|µi if z 6= y
Gyz (ψ) = η P y η
η
− z6=y Gyz (ψ) if z = y.

Informally, Gηy (ψ) measures the quantity of mass of µ within the regularized
Laguerre cell RLagηy (ψ).
32 QUENTIN MÉRIGOT AND BORIS THIBERT

Theorem 32.
• The regularized Kantorovitch functional Kη is C ∞ , concave, with first and
second-order partial derivatives given by
∂Kη
∀y ∈ Y, (ψ) = Gηy (ψ) − νy ,
∂1y
∂ 2 Kη
∀y 6= z ∈ Y, (ψ) = Gηyz (ψ).
∂1z ∂1y
• The function Kη is strictly concave on the orthogonal of the set of constant
functions. More precisely, for every ψ ∈ RY one has
X
∀v ∈ RY s.t. v(y) = 0, D2 Kη (ψ)(v, v) < 0.
y∈Y

• If ψ is a maximizer in (DPη ), then the solution to (KPη ) is given by


X
γ= γx,y δ(x,y) , with γx,y = RLagηy (ψ)(x)µx .
x,y

Proof. For every y ∈ Y , the derivative is given by


−ψ(y)−c(x,y)
∂Kη X e η
(ψ) = µx P −c(x,z)−ψ(z)
− νy = Gηy (ψ) − νy .
∂1y
z∈Y e
x∈X η

The second order derivative is given for z 6= y by


−ψ(z)−c(x,z)
1
∂ 2 Kη ηe
X −ψ(y)−c(x,y) η
η
(ψ) = µx e η
2 = Gyz (ψ).
∂1z ∂1y

−c(x,z)−ψ(z)
x∈X
P
z∈Y e η

The relation
X ∂Kη
(ψ) = 1
∂1y
y∈Y

gives the desired formula for the second order derivatives when z = y. The
hessian of Kn is therefore symmetric with dominant diagonal, with negative
diagonal coefficients. This implies that the Hessian is negative, hence that
Kη is concave. Let us now show that ker H = R1Y , where H = D2 Kη (ψ).
Consider v ∈ ker H and let y0 ∈ Y be the point where v attains its maximum.
Then using Hv = 0, and in particular (Hv)(y0 ) = 0, one has
 
X
0= Hy,y0 v(y) + Hy0 ,y0 v(y0 )
y6=y0
X
= Hy,y0 (v(y) − v(y0 )).
y6=y0
P
This follows from Hy0 ,y0 = − y6=y0 Hy,y0 . Since for every y 6= y0 , one has
Hy,y0 > 0 and v(y0 ) − v(y) > 0, this implies that v(y) = v(y0 ). Therefore
ker H ⊆ R1Y . The reverse inclusion is obvious and therefore Kη is strictly
concave on the orthogonal of the set of constant functions.
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 33

To prove the last claim we note that if ψ maximizes Kη (·), then (ψ c,η , ψ)
maximizes Kη (·, ·). By Corollary 29, the optimal transport map γ is
ψ(y)+c(x,y)

ϕ(x)−ψ(y)−c(x,y) e η µx
γx,y = e η =P ψ(z)+c(x,z)
= RLagηy (ψ)(x)µx . 

z∈Y e η

Sinkhorn-Knopp as block coordinate ascent. We present here the Sinkhorn-


Knopp algorithm that consists in computing a maximizer to the dual prob-
lem (DPη ) by optimizing the functional Kη alternatively in ϕ and ψ. The
iterations are defined by
(
ϕ(k+1) = (ψ (k) )c,η
(3.47)
ψ (k+1) = (ϕ(k+1) )c,η ,
or equivalently ψ (k+1) = S(ψ (k) ) where
S(ψ) = (ψ c,η )c,η . (3.48)
Remark 13 (Relation to matrix factorization). This algorithm is in fact a re-
formulation, using a logarithmic change of variable, of Sinkhorn-Knopp’s
algorithm [92] for finding a factorization of non-negative matrices intro-
duced by Sinkhorn [91]. We therefore refer to the iterations (3.47)–(3.48)
as Sinkhorn-Knopp’s algorithm.
Correctness. We first show the correctness of Sinkhorn–Knopp’s algorithm,
using a simple expression for S(ψ) which can be found in an article of Robert
Berman [14].
Proposition 33 (Correctness of Sinkhorn-Knopp). Let ψ ∈ RY be a poten-
tial. The following assertions are equivalent:
(i) ψ is a fixed point of S;
(ii) for every y ∈ Y hµ|RLagηy (ψ)i = νy ;
(iii) ψ is a maximizer of the regularized Kantorovich function Kη
This proposition follows at once from the next lemma, and from the com-
putation of ∇Kη in Theorem 32.
S(ψ)(y)−ψ(y)
Lemma 34. η = − log(νy ) + loghµ|RLagηy (ψ)i.
Proof. A calculation shows that
−c(z,y)−ψ(z)
   
−c(x,y)+η log(µx )−log
P η
z∈Z e

 X 
S(ψ)(y) = −η 
log(νy ) − log e η 

x∈X
 −c(x,y)

X e η
= −η log(νy ) − log µx P −c(z,y)−ψ(z)

x∈X z∈Y e η
!
ψ(y) X
= −η log(νy ) − log e η µx RLagηy (ψ)(x)
x∈X
 
ψ(y)
η
= −η log(νy ) − log e η hµ|RLagy (ψ)i ,
34 QUENTIN MÉRIGOT AND BORIS THIBERT

which implies the equation. 

Convergence. In order to prove convergence, we need to strengthen the 1-


Lipschitz estimation from Proposition 31. This allows to apply Picard’s
fixed point theorem to get the contraction of the Sinkhorn-Knopp iteration
(3.48). The proof we present in this chapter has been first introduced in
course notes of Vialard [96].
Theorem 35 (Convergence of Sinkhorn, [96]). The map S is a contraction
for k·ko,∞ . More precisely,
 kcko,∞

0 1
S(ψ ) − S(ψ )
−2 0
ψ − ψ 1

o,∞
6 1−e η
o,∞
.

In particular, the iterates (ϕ(k) , ψ (k) ) of Sinkhorn-Knopp’s algorithm (3.47)


converge with linear rate to the unique (up to constant) maximizer the regu-
larized dual problem (3.41)

Remark 14 (Other convergence proofs). The convergence of Sinkhorn-Knopp’s


algorithm is usually proven (e.g. in [92]) using a theorem of Birkhoff [18].
We refer to the recent book by Peyré and Cuturi [83] for this point of view.
Other convergence proofs exist, see for instance Berman [14] (in the contin-
uous case), and Altschuler, Weed and Rigolet [3].
Remark 15 (Convergence speed). This theorem shows that the Sinkhorn-
Knopp algorithm converges with linear speed, but the contraction constant
has a bad dependency in η. Denoting C = kcko,∞ , to get an error of ε one
needs
(1 − e−2C/η )k 6 ε
i.e. k & e2C/η log(1/ε),
where the second inequality holds for small values of η. This bad dependency
in η seems to be a practical obstacle to choosing a very small smoothing
parameter. This calls for scaling techniques, as for the auction’s algorithm,
and was considered by Schmitzer [89, 90].
Remark 16 (Implementation). The numerical implementation of Sinkhorn-
Knopp’s algorithm is more complicated than it seems:
• In a naive implementation, the computation of the smoothed c-
transforms (3.43)–(3.44) has a cost proportional to Card(X)Card(Y ).
This can be alleviated for instance when X = Y are grids and when
the cost is a k·kp norm, using fast convolution techniques (see e.g.
[93] or [83, Remark 4.17]), or when the cost is the squared geodesic
distance on a Riemannian manifold [33, 93].
• The convergence speed can be slow when the supports of the data
X, Y are “far” from each other, and when η is small. This difficulty is
cirvumvented using the η-scaling techniques mentioned above, often
combined with multi-scale (coarse-to-fine) strategies, studied in this
context by Benamou, Carlier and Nenna [11] and Schmitzer [89].
• Finally, some numerical difficulties (divisions by zero) can occur when
η is small and the potential ψ is far from the solution.
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 35

The book of Cuturi and Peyré present these difficulties in more details and
explain how to circumvent them [83]. In addition to the works already
cited, we refer to the PhD work of Feydy [29, 45], and especially to the
implementation of regularized optimal transport in the library GeomLoss2.
In order to prove this theorem, we will make use of the following elemen-
tary lemma, giving an upper bound on the L1 distance between two Gibbs
kernels eui /Zi for i ∈ {0, 1} as a function of ku1 − u0 ko,∞ .
Lemma 36. Let u0 , u1 be two functions on Y and denote gi = eui /Zi where
Zi = y∈Y eui (y) . Then,
P
X
|g1 (y) − g0 (y)| 6 2(1 − e−2ku0 −u1 ko,∞ ).
y∈Y

Proof. Note that by definition the Gibbs kernel gi does not change if a con-
stant is added to ui , so that we can assume that
ε := ku0 − u1 ko,∞ = ku0 − u1 k∞ .
Using the inequality u0 − ε 6 u1 6 u0 + ε, one easily shows that
eu0 u1 eu0
e−2ε 6 6 e2ε ,
Z0 Z1 Z0
thus implying e−2ε g0 6 g1 6 e2ε g0 . This gives
(
(e−2ε − 1)g0 6 g1 − g0
(e−2ε − 1)g1 6 g0 − g1 ,
thus implying
|g1 − g0 | 6 (1 − e−2ε ) max(g0 , g1 ) 6 (1 − e−2ε )(g0 + g1 ).
P
Summing this inequality over Y and using Y gi = 1, we obtain the desired
inequality. 
Proof of Theorem 35. Consider ψ0 , ψ1 ∈ RY and ψt = ψ0 + tv with v =
ψ1 − ψ0 . Without loss of generality, we assume that the functions ψ0 , ψ1
are translated by a constant so that kψ0 − ψ1 k∞ = kψ0 − ψ1 ko,∞ . We will
first give an upper bound on kψ1c,η − ψ0c,η ko,∞ , and to do that we will give
an upper bound on
A(x, x0 ) = (ψ1c,η (x) − ψ0c,η (x)) − (ψ1c,η (x0 ) − ψ0c,η (x0 ))
which is independent of x, x0 ∈ X. For this purpose, we introduce
   
X 1 (−c(x,y)−ψ (y)) X 1 (−c(x0 ,y)−ψ (y))
B(t, x, x0 ) = −η log  eη t  + η log  eη t ,
y∈Y y∈Y

and
1
(−c(x,y)−ψt (y))

gx,t (y) = P 1 .
(−c(x,z)−ψt (z))
z∈Y eη

2https://www.kernel-operations.io/geomloss/
36 QUENTIN MÉRIGOT AND BORIS THIBERT

Then, recalling the definition of ψtc,η in Eq. (3.44),


A(x, x0 ) = B(1, x, x0 ) − B(0, x, x0 )
Z 1 Z 1
= ∂t B(t, x, x0 )dt = hv|gx,t − gx0 ,t iRY dt,
0 0
Z 1X

6 kvk∞ gx,t (y) − gx0 ,t (y) dt
0 y∈Y

Then, by the previous lemma (Lemma 36) and setting ux,t (y) = − η1 (c(x, y)+
ψt (y)), so that gx,t = eux,t /Zx,t with Zx,t = Y gx,t , we obtain
P
Z 1
−2kux,t −ux0 ,t ko,∞
A(x, x0 ) 6 2 kvk∞ 1−e dt
0
−2kux,t −ux0 ,t ko,∞
6 2 kψ1 − ψ0 ko,∞ (1 − e )
In addition,
kcko,∞
ux,t − ux0 ,t 6 .
o,∞ η
We therefore obtain
1
kψ1c,η − ψ0c,η ko,∞ 6 sup A(x, x0 )
2 x,x0 ∈X
 kck

−2 ηo,∞
6 kψ1 − ψ0 ko,∞ 1 − e .

We conclude the proof of the contraction inequality by remarking that the


map ϕ 7→ ϕc,η is 1-Lipschitz, thanks to Proposition 31.(iii). 

4. Semi-discrete optimal transport


In this part, we consider the semi-discrete optimal transport problem,
where the source measure is a probability density and the target is a finitely
supported measure. We start by introducing in Section 4.1 the framework of
semi-discrete optimal transport, showing its connection with the notion of
Laguerre tessellation in discrete geometry. We study in detail the regularity
of Kantorovitch functional K in this setting, in connection with algorithms
for solving the semi-discrete optimal transport problem:
• In Section 4.2, we show convergence of the coordinate-wise increment
algorithm introduced by Oliker and Prüssner using the Lipschiz-
continuity of the gradient ∇K.
• In Section 4.3 a damped Newton method and prove its convergence
from a C 1 -regularity and monotonicity property of the gradient ∇K.
• Finally, we consider in Section 4.4 the entropic regularization of the
semi-discrete optimal transport problem and its relation to unregu-
larized semi-discrete optimal transport.
4.1. Formulation of semi-discrete optimal transport. Our working as-
sumptions for this section are the following:
• ΩX , ΩY are two open subsets of Rd . The cost function c ∈ C 1 (ΩX ×
ΩY ) satisfies the twist condition introduced in Definition 8.
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 37

• the source measure ρ is absolutely continuous with respect to the


Lebesgue measure on ΩX and its support is contained in a compact
subset X of ΩX . When writing ρ ∈ P ac (X) we always mean
that ρ belongs to P ac (ΩX ) with spt(ρ) ⊆ X.
• the targetP
space Y is finite so that ν ∈ P(Y ) can be written under the
form ν = y∈Y νy δy . For simplicity, we assume that miny νy > 0.
Note that by an abuse of notation, we will often conflate ρ with its density
with respect to the Lebesgue measure.

Laguerre tessellation. In the semi-discrete setting, the dual of Kantorovich’s


relaxation can be conveniently phrased using the notion of Laguerre tes-
sellation, a variant of the Voronoi tesselation. This connection was al-
ready known and used in the 1980s and 1990s, see for instance Cullen–
Purser [34], Aurenhammer–Hoffman–Aronov [6] or Gangbo-McCann [51],
Caffarelli–Kochengin–Oliker [26]. Large-scale numerical implementations are
more recent, starting in the 2010s, see e.g. [70, 38, 55, 64, 68, 65, 42, 58,
40, 41]. To explain the connection, we start with an economic metaphor.
Assume that the probability density ρ describes the population distribution
over a large city ΩX , and that the finite set Y describes the location of bak-
eries in the city. Customers living at a location x in ΩX try to minimize
the walking cost c(x, y), resulting in a decomposition of the space called a
Voronoi tessellation. The number of customers received by a bakery y ∈ Y
is equal to the integral of ρ over its Voronoi cell,
Vory := {x ∈ ΩX | ∀z ∈ Y, c(x, y) 6 c(x, z)}.
If the price of bread is given by a function ψ : Y → R, customers living
at location x in X make a compromise between walking cost and price by
minimizing the sum c(x, y) + ψ(y). This leads to the notion of Laguerre
tessellation.
Definition 17 (Laguerre tessellation). The Laguerre tessellation associated
to a set of prices ψ : Y → R is a decomposition of the space into Laguerre
cells defined by
Lagy (ψ) := {x ∈ ΩX | ∀z ∈ Y, c(x, y) + ψ(y) 6 c(x, z) + ψ(z)}. (4.49)
More generally, for any distinct y1 , . . . , y` ∈ Y , we denote the common facet
between the Laguerre cells Lagyi (ψ) by
\
Lagy1 ...y` (ψ) = Lagyi (ψ) (4.50)
16i6`

We will also frequently consider the following hypersurfaces/halfspaces:


Hyz (ψ) = {x ∈ ΩX | c(x, y) + ψ(y) = c(x, z) + ψ(z)},
6 (4.51)
Hyz (ψ) = {x ∈ ΩX | c(x, y) + ψ(y) 6 c(x, z) + ψ(z)},
which are defined so that
6
∀y ∈ Y, Lagy (ψ) = ∩z∈Y \{y} Hyz (ψ)
(4.52)
∀y, z ∈ Y, Lagyz (ψ) ⊆ Hyz (ψ)
38 QUENTIN MÉRIGOT AND BORIS THIBERT

X
Y

Figure 1. (Left) The domain X (with boundary in blue)


is endowed with a probability density pictured in grayscale
representing the density of population in a city. The set Y (in
red) represents the location of bakeries. Here, X, Y ⊆ R2 and
c(x, y) = |x − y|2 (Middle) The Voronoi tessellation induced
by the bakeries (Right) The Laguerre tessellation: the price
of bread the bakery near the center of X is higher than at
the other bakeries, effectively shrinking its Laguerre cell.

Remark 17. For the quadratic cost c(x, y) = kx − yk2 , one has
c(x, y) + ψ(y) 6 c(x, z) + ψ(z)
1
⇐⇒ hx|z − yi 6 (ψ(z) + kzk2 − (ψ(y) − kyk2 )),
2
which easily implies that the Laguerre cells are convex polyhedra intersected
with the domain ΩX . Introducing ψ̃(z) = 21 (ψ(z) + kzk2 ), one has
Lagy (ψ) = {x ∈ ΩX | ∀z ∈ Y, hx|z − yi 6 ψ̃(z) − ψ̃(y)}.
As a direct consequence, the intersection of two distinct Laguerre cells is con-
tained in an hyperplane and is therefore Lebesgue negligible. If in addition
ψ ≡ 0, then the Laguerre tessellation coincides with the Voronoi tessellation.
The shape of the Voronoi and Laguerre tessellations is depicted in Figure 1.
The following proposition shows that Laguerre tessellations can be used
to build optimal transport maps.
Proposition 37. Under the twist condition (Def. 8), the intersection of two
distinct Laguerre cells Lagy (ψ) ∩ Lagz (ψ) (y 6= z) is Lebesgue-negligible, and
the map
Tψ : x ∈ ΩX 7→ arg min c(x, y) + ψ(y)
y∈Y

is well-defined Lebesgue almost-everywhere. In addition for any ψ ∈ RY and


any ρ ∈ P ac (X), Tψ is an optimal transport map for the cost c between ρ
and the measure X
νψ := Tψ# ρ = ρ(Lagy (ψ))δy . (4.53)
y∈Y

Proof. By Equation (4.52), one has


Lagy (ψ) ∩ Lagz (ψ) = Lagyz (ψ) ⊆ f −1 ({0}),
where we have set f (x) = c(x, y) − c(x, z) + ψ(y) − ψ(z). By the twist
condition, ∇f (x) 6= 0 for all x ∈ ΩX , implying that the set f −1 ({0}) is
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 39

a (d − 1)-submanifold and is in particular Lebesgue-negligible. This easily


implies that Tψ is well-defined.
Let us now prove optimality of Tψ in the optimal transport problem be-
tween ρ and Tψ# ρ. By definition of Tψ , one has
∀(x, y) ∈ X, c(x, Tψ (x)) + ψ(Tψ (x)) 6 c(x, y) + ψ(y).
Let γ be a transport plan between ρ and νψ . Integrating the above inequality
with respect to γ gives
Z Z
(c(x, Tψ (x)) + ψ(Tψ (x)))ρ(x)dx 6 (c(x, y) + ψ(y))dγ(x, y),
X X×Y
where we have used ΠX# γ = ρ to simplify the left-hand side. Since ν =
ΠY # γ = Tψ# ρ, applying change of variable formulas we get
Z Z Z
ψ(y)dγ(x, y) = ψ(y)dν = ψ(Tψ (x))ρ(x)dx
X×Y Y X
Substracting this equality from the inequality above shows that the map Tψ
is optimal in the optimal transport :
Z Z
c(x, Tψ (x))ρ(x)dx 6 c(x, y)dγ(x, y) 
X X×Y

Monge-Ampère equation. Proposition 37 implies that any map Tψ induced by


a Laguerre tessellation of the domain solves the optimal transport between
ρ and the image measure νψ = Tψ# ρ. From now on, we will denote
Gy : RY → R, ψ 7→ ρ(Lagy (ψ))
(4.54)
G : RY → RY , ψ 7→ (y 7→ Gy (ψ)).
In the bakery analogy, the function Gy (ψ) measures the number of customers
for the bakery y given a family of prices ψ ∈ RY , and G : RY → RY maps
a family of prices to a distribution of customers among the bakeries. By
(4.53), one has X
νψ = Gy (ψ)δy .
y∈Y
we consider P(Y ) as a subset of RY , conflating a probability
For simplicity, P
measure ν = y∈Y νy δy with the function ν : y 7→ νy . Then, Tψ is an
optimal transport map between ρ and ν iff Tψ# ρ = ν iff
G(ψ) = ν. (4.55)
In other words, we have transformed the optimal transport problem into a
finite-dimensional non-linear system of equations (4.55).
Remark 18 (Relation to subdifferential and Monge-Ampère equation). As-
sume that X = ΩX = Rd and that c(x, y) = −hx|yi. Then,
Lagy (ψ) = {x ∈ ΩX | ∀z ∈ Y, −hx|yi + ψ(y) 6 −hx|zi + ψ(z)}
= {x ∈ ΩX | ∀z ∈ Y, ψ(z) > hx|z − yi + ψ(y)}
Denote ψ̂ the convex envelope of ψ, which can be defined using the double
Legendre-Fenchel transform by
ϕ(x) = maxhx|yi − ψ(y),
y∈Y
40 QUENTIN MÉRIGOT AND BORIS THIBERT

ψ̂(z) = maxhx|yi − ϕ(x).


x∈X
Then the Laguerre cells defined above agree with the subdifferential of ψ̂,
i.e. Lagy (ψ) = ∂ ψ̂(y). Moreover, in the context of Monge-Ampère equations,
the (infinite) measure
X X
λ(Lagy (ψ))δy = λ(∂ ψ̂(y))δy
y∈Y y∈Y

is called the Monge-Ampère measure of the function ψ̂ [57]. Semi-discrete


techniques can also be applied to the numerical resolution of Monge-Ampère
equations (with e.g. Dirichlet boundary conditions). We refer the reader
to the pioneering work of Oliker-Prussner [79] and to the survey by Neilan,
Salgado and Zhang [77].
Remark 19 (Lack of uniqueness). The solution ψ to G(ψ) = ν is never
unique, because G is invariant under addition of a constant (see Propo-
sition 38-(iii)). When spt(ρ) is disconnected there might also exist two
solutions ψ 0 , ψ 1 to G(ψ i ) = ν such that ψ 0 − ψ 1 is not constant. Take
X = [−1, 1], Y = {−1, 1}, choose c(x, y) = (x − y)2 and
1
ρ = 1[−1,− 1 ]∪[ 1 ,1] ν = (δ−1 + δ1 ),
2 2 2
A computation shows that if |ψ(1) − ψ(−1)| 6 2, then G(ψ) = ν.
The existence of solutions to (4.55) and the algorithms that one can use
to solve this system depend crucially on the properties of the function G.
In the next proposition, we denote (1y )y∈Y the canonical basis of RY , i.e.
1y (z) = 1 if y = z and 0 if not. We also denote 1Y the constant function on
Y equal to 1. On RY we consider two norms:
sX
kψk = |ψ(y)|2 and kψk∞ = max |ψ(y)| .
y∈Y
y∈Y

We will often use the notation R, which measures the oscillation of the cost
function:
R := max c − min c, (4.56)
X×Y X×Y

Proposition 38. Assume c is twisted (Def. 8) and ρ ∈ P ac (X). Then,


(i) ∀y ∈ Y, ∀t > 0, Gy (ψ + t1y ) 6 Gy (ψ),
(ii) ∀y 6= z ∈ Y, ∀t > 0, Gy (ψ + t1z ) > Gy (ψ),
(iii) ∀ψ ∈ RY , ∀t ∈ R, G(ψ + t1Y ) = G(ψ),
(iv) ∀ψ ∈ RY , G(ψ) ∈ P(Y ),
(v) if ψ ∈ RY is such that Gy0 (ψ) > 0, then ψ(y0 ) 6 minY ψ + R,
(vi) if ψ ∈ RY is such that Gy (ψ) > 0 for every y ∈ Y , then
maxY ψ − minY ψ 6 R,
(vii) G is continuous,
where R = maxX×Y c − minX×Y c.
Proof. The properties (i), (ii), (iii) are straightforward consequences of the
definition of Laguerre cells. Property (iv) is a consequence of Proposition 37
and of the assumption ρ ∈ P ac (X). To prove (v), take ψ such that Gy0 (ψ) >
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 41

0, implying in particular that the Laguerre cell Lagy0 (ψ) is non-empty and
contains a point x ∈ X. Then, by definition of the cell one has for all
y ∈ Y \ {y0 }, c(x, y0 ) + ψ(y0 ) 6 c(x, y) + ψ(y), thus showing that ψ(y0 ) 6
minY ψ + R. Point (v) is a consequence of Point (vi).
It remains to establish that each of the maps Gy is continuous. For this
purpose, we consider a sequence (ψn )n∈N ∈ RY converging to some ψ∞ ∈ RY .
We first note that as in the proof of Proposition 37, the set
S = {x ∈ X | ∃y 6= z ∈ Y s.t. c(x, y) + ψ(y) = c(x, y) + ψ(z)}.
is Lebesgue-negligible and therefore also ρ-negligible. Defining χ = 1Lagy (ψ)
and χn = 1Lagy (ψn ) ,
Z Z
Gy (ψn ) = χn dρ, and G(ψ) = χdρ.

To prove that limn→+∞ Gy (ψn ) = Gy (ψ) it suffices to establish that χn


converges to χ on X \ S, which is straightforward (because the inequali-
ties defining the set X \ S are strict), and to apply Lebesgue’s dominated
convergence theorem. 

From these properties of G, we can deduce the existence of a solution


to the equation G(ψ) = ν. The strategy used to prove this proposition
is borrowed from [24] and is also reminiscent of Perron’s method to prove
existence to Monge-Ampère equations, see e.g. [57].
Corollary 39. Let G : RY → RY satisfying (i)– (vi) in Proposition 38 and
let ν ∈ P(Y ). Then, there exists ψ ∈ RY such that G(ψ) = ν.

Proof. Fix some y0 ∈ Y such that νy0 6= 0, and consider the set
K = {ψ ∈ RY | ψ(y0 ) = 0 and ∀y ∈ Y \ {y0 }, Gy0 (ψ) 6 νy and ψ(y) 6 R},
where R is defined as in (4.56). Given ψ ∈ K, one has
X
Gy0 (ψ) = 1 − Gy (ψ) > νy0 > 0,
y6=y0

implying by (v) that miny∈Y ψ > ψ(y0 ) − R. The set K is therefore bounded
and closed (by continuity of the functions Gy ) and therefore compact. We

P
consider ψ a minimizer over the set K of the function J(ψ) = y∈Y ψy .
Assume that Gy (ψ ∗ ) < νy for some y ∈ Y \ {y0 }. Then, by continuity of Gy ,
there exists some t > 0 such that Gy (ψ ∗ − t1y ) < νy . Then, by property (ii),
we have
∀z 6= y, Gz (ψ ∗ − t1y ) 6 Gz (ψ ∗ ) 6 νz ,
thus showing that ψ ∗ − t1y ∈ K. Since J(ψ ∗ − t1y ) = J(ψ ∗ ) − t < J(ψ ∗ ), we
get a contradiction. We thus have showed that ∀y ∈ Y \ {y0 }, Gy (ψ ∗ ) = νy ,
and using (iv) and ν ∈ P(Y ) we obtain
X X
Gy0 (ψ ∗ ) = 1 − Gy (ψ ∗ ) = 1 − νy = νy0 ,
y∈Y \{y0 } y∈Y \{y0 }

so that G(ψ ∗ ) = ν and ψ ∗ is a solution to (4.55). 


42 QUENTIN MÉRIGOT AND BORIS THIBERT

Kantorovich’s functional. We now show that Equation (4.55) is the optimal-


ity condition of the Kantorovitch functional, and can thus be recast as a
smooth unconstrained optimization problem. We recall that
(KP) = max K(ψ),
ψ∈RY

where K is the Kantorovich functional given by


Z Z
c
K(ψ) = ψ dµ − ψdν
X Y
XZ X
= (c(x, y) + ψ(y))dρ(x) − ψ(y)νy .
y∈Y Lagy (ψ) y∈Y

Theorem 40 (Aurenhammer, Hoffman, Aronov). Assume that ρ ∈ P ac (X),


that c is twisted (Def. 8), and consider K defined in (2.24). Then:
• K is concave and C 1 -smooth and its gradient is
∇K(ψ) = G(ψ) − ν (4.57)
where G is defined in (4.54).
• ∀ψ ∈ RY , ∀t ∈ R, K(ψ + t1Y ) = K(ψ),
• K attains its maximum over RY , and ∇K(ψ) = 0 iff ψ solves (4.55).
Remark 20. This theorem could be deduced from the computation of direc-
tional derivatives of K given in Corollary 16, however we prefer to give a
simple and self-contained proof due to Aurenhammer, Hoffman, Aronov [6].
Proof of Theorem 40. We simultaneously show that the functional is concave
and compute its gradient. For any function ψ on Y and any measurable map
T : X → Y , one has
min(c(x, y) + ψ(y)) 6 c(x, T (x)) + ψ(T (x)),
y∈Y

which by integration against ρ gives


Z X
K(ψ) 6 (c(x, T (x)) + ψ(T (x)))ρ(x)dx − ψ(y)νy . (4.58)
X y∈Y

Moreover, equality holds when T = Tψ . Taking another function ψ 0 ∈ RY


and setting T = Tψ0 in Equation (4.58) gives
Z X
K(ψ) 6 (c(x, Tψ0 (x)) + ψ(Tψ0 (x)))ρ(x)dx − ψ(y)νy
X y∈Y
XZ X
= (c(x, y) + ψ(y))ρ(x)dx − ψ(y)νy
y∈Y Lagy (ψ 0 ) y∈Y
XZ
= (c(x, y) + ψ 0 (y))ρ(x)dx+
y∈Y Lagy (ψ 0 )
X X
ρ(Lagy (ψ 0 ))(ψ(y) − ψ 0 (y)) − ψ(y)νy
y∈Y y∈Y
= K(ψ 0 ) + hG(y) − ν|ψ − ψ 0 i
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 43

By definition, this shows that G(ψ) − ν belongs to the superdifferential to K


(Definition 21) at ψ, i.e. G(ψ)−ν ∈ ∂ + K(ψ), thus proving by Proposition 55
that K is concave.
We now prove that K belongs to C 1 (RY ). Consider ψ ∈ RY and let
(ψn )n∈N be a sequence converging to ψ and such that ∇K(ψn ) exists for every
n ∈ N. Since G(ψn ) − ν ∈ ∂ + K(ψn ) = {∇K(ψn )}, we obtain ∇K(ψn ) =
G(ψn ) − ν. Thus, by the continuity of G (Proposition 38),
lim ∇K(ψn ) = lim G(ψn ) − ν = G(ψ) − ν,
n→+∞ n→+∞

ensuring by (5.76) that ∂ + G(ψ) = {G(ψ) − ν}, so that ∇K(ψ) = G(ψ) − ν


for all ψ ∈ RY . By continuity of G we get K ∈ C 1 as announced, and
Equation (4.54) holds for all ψ ∈ RY , so that one trivially has ∇K(ψ) = 0 iff
G(ψ) = ν. Finally, we note that thanks to Corollary 39, there exists ψ ∈ RY
such that G(ψ) = ν, which automatically is a maximizer of K because K is
concave and ∇K(ψ) = 0. 

4.2. Semi-discrete optimal transport via coordinate decrements. As


before, we assume that X ⊆ ΩX is compact, that Y ⊆ ΩY is finite and that
ΩX , ΩY ⊆ Rd are open sets. We recall the notation Gy (ψ) := ρ(Lagy (ψ)).
Oliker-Prussner’s algorithm for solving G(ψ) = ν is described in Algorithm 3,
and bears strong resemblance with Bertsekas’ auction algorithm, in that the
“prices” are evolved in a monotonic way.

Algorithm 3 Oliker-Prussner algorithm


Input: A tolerence parameter δ > 0.
Initialization: Fix some y0 ∈ Y once for all. Set
(
(0) 0 if y = y0
ψ (y) :=
R if not.
δ
While: ∃y 6= y0 such that Gy (ψ (k) )) 6 νy − N
Step 1: Compute
ty = min{t > 0 | Gy (ψ (k) − t1y ) > νy }. (4.59)
Step 2: Set ψ (k+1) = ψ (k) − t1
y . (k)
(k)
Output: A vector ψ that satisfies G(ψ ) − ν ∞ 6 δ.

This algorithm can be described in words using the bakery analogy of


Section 4.1. We choose once and for all a bakery y0 ∈ Y whose price will
be set to zero. Initially, the price of bread ψ (0) is zero at this bakery y0
and set to the prohibitively large value R, defined in Equation (4.56), at
any other location. This choice guarantees that the bakery y0 initially gets
all the customers. The prices ψ (k) ∈ Ry are then constructed iteratively by
performing a sort of reverse auction: at step k, start by finding some bakery
y = y (k) ∈ Y \ {y0 } which sells less bread than its production capacity, i.e.
δ
Gy (ψ (k) ) 6 νy − .
N
44 QUENTIN MÉRIGOT AND BORIS THIBERT

The price of bread at y is then decreased so that the amount of bread sold
equals the production capacity of y, i.e. one finds ty > 0 such that
Gy (ψ (k) − ty 1y ) = νy
and then updates ψ (k+1) = ψ (k) − ty 1y .
Remark 21 (Origin and extensions). This algorithm was introduced by Oliker
and Prussner, for the purpose of solving Monge-Ampère equations with
Dirichlet boundary conditions in [79]. In the context of optimal transport,
the first use of Algorithm 3 seems to be in an article of Caffarelli, Kochengin
and Oliker [26] (see also [24]), in the setting of the reflector problem, namely
c(x, y) = − log(1 − hx|yi) on X = Y = S d−1 . Since then, the convergence of
this algorithm has been generalized to more other costs and/or more general
assumptions on the probability density ρ, we refer the reader to [64, 42] and
to references therein.
C 1,1 estimates for Kantorovich functional. The proof of convergence of Oliker-
Prussner’s algorithm relies on the Lipschitz regularity of the map G when ρ
is bounded, proven in the next proposition. (Since ∇K = G − ν, this propo-
sition also implies that Kantorovich’s functional K has Lipschitz gradient,
improving from the C 1 estimate of Theorem 40.)
Proposition 41. Assume that c ∈ C 2 (ΩX ×ΩY ) satisfies the twist condition,
and assume also that ρ ∈ P ac (X) ∩ L∞ (X). Then for every y ∈ Y , the map
Gy : RY → R defined in (4.54) is globally Lipschitz.
Remark 22. The proof of this proposition comes with an estimation of the
Lipschitz constant: namely it shows |Gy (ψ) − Gy (ϕ)| 6 LG kϕ − ψk∞ with
 
1 M
LG = c(d)N kρk∞ 1+ diam(X) diam(X)d−1 ,
κ κ
κ = min min k∇x c(·, y) − ∇x c(·, z)k , (4.60)
y6=z∈Y X

M = max max Dxx c(·, y) − D2xx c(·, z) .


2
y6=z∈Y X

In the estimation of the Lipschitz constant LG (4.60), it is possible that the


term in N is not tight, but the other terms cannot be improved without
adding assumptions on the cost.
Example 6. With c(x, y) = 12 kx − yk2 , one has ∇x c(x, y) = (x − y) and
D2xx c(x, y) = id, so that M = 0 and κ is the minimal distance between two
distinct points in Y : κ = miny6=z∈Y ky − zk .
The proof relies on the following lemma, which allows to estimate the
variations of Gy in the direction 1z , z 6= y.
Lemma 42. Let c ∈ C 1 (ΩX × ΩY ) be a twisted cost and ρ ∈ P ac (X). For
every y 6= z ∈ Y and ψ ∈ RY ,
Z t
Gy (ψ + t1z ) − Gy (ψ) = Gyz (ψ + s1z )ds. (4.61)
0
where
ρ(x)
Z
Gyz (ψ) = dvold−1 (x).
Lagyz (ψ) k∇x c(y, x) − ∇x c(y, z)k
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 45

Proof. This is a consequence of the coarea formula, Equation (5.78). In order


to see this, we first note that
\
6 6
Lagy (ψ + t1z ) = E ∩ Hyz (ψ) where E = Hyw (ψ).
w∈Y \{y,z}

In particular, for t > 0, setting cyz = c(·, y) − c(·, z) and a = ψ(z) − ψ(y),

Lagy (ψ + t1z ) \ Lagy (ψ) = E ∩ c−1


yz ((a, a + t])

Thus, by the coarea formula,

Gy (ψ + t1z ) − Gy (ψ) = ρ(Lagy (ψ + t1z ) \ Lagy (ψ))


Z
= ρ(x)d vold (x)
E∩c−1
yz ((a,a+t])
Z tZ
ρ(x)
= dvold−1 (x)ds
0 E∩c−1
yz (a+s)
k∇cyz (x)k

One concludes by remarking that

x ∈ Lagyz (ψ + s1z ) ⇐⇒ x ∈ E and c(x, y) + ψ(y) = c(x, z) + ψ(z) + s


⇐⇒ x ∈ E ∩ c−1
yz (a + s).

This establishes (4.61) in the case t > 0, and the case t 6 0 can be treated
similarly. 

The second ingredient to prove Proposition 41 is an uniform upper bound


on the (d − 1)–Hausdorff measure of the level set of a C 2 function f with
non-vanishing gradient.

Lemma 43. Let X ⊆ ΩX ⊆ Rd with ΩX open and X compact, and let


f ∈ C 2 (ΩX ) such that ∀x ∈ ΩX , k∇f (x)k > 0. Then,
 
d−1 −1 M
vol (f (0) ∩ X) 6 c(d) 1 + diam(X) diam(X)d−1 .
κ

where κ = minX k∇f k and M = maxX D2 f .

Proof. By compactness, there exists a finite number of unit vectors u1 , . . . un


and V1 , . . . , Vn an open covering of the unit sphere S d−1 such that if u ∈ Vi ,
then hu|ui i > 3/4, implying in particular, ku − ui k2 = 2 − 2hu|ui i 6 12 .
Moreover n depends only on the dimension d. Let S = f −1 (0) ∩ X. This set
S can be covered by patches Si , i.e. S = ∪i Si where

Si = {x ∈ S | ∇f (x) ∈ Vi }.

We will now estimate the volume of each patch Si using the coarea formula
recalled in Theorem 56 of the appendix. To apply this formula, we consider
Πi : Si ⊆ Rd → {ui }⊥ the orthogonal projection onto the hyperplane Hi =
{ui }⊥ . We need to estimate the Jacobian JΠi (x) (see (5.77)). Since Πi is
linear, we have DΠi = Πi . Moreover, for any tangent vector v at x ∈ Si , one
46 QUENTIN MÉRIGOT AND BORIS THIBERT

has hv|∇f (x)i = 0. Setting u = ∇f (x) ∈ Vi , we get

kΠi vk2 = kvk2 − hv|ui i2


= kvk2 − hv|ui − ui2
1
> kvk2 (1 − kui − uk2 ) > kvk2 .
2
This directly shows that the restriction of DΠi (x) to the tangent space Tx Si
at Si is injective and that its inverse is √12 -Lipschitz. This implies that
 d−1
1
JΠi (x) > c(d) = √ .
2
We now apply the co-area formula (5.78) to the manifold M = f −1 (0),
E = Si ⊆ N , N = Hi , n = m = d − 1, Φ = Πi , and u ≡ 1:
Z Z
d−1 1
vol (Si ) = dvol0 (x)dvold−1 (y)
Hi Π−1 (y) JΠ i (x)
Zi
6 c(d) Card(Si ∩ (y + Rui ))dvold−1 (y).
{ui }⊥

We now give an upper bound on Card(Si ∩ (y + Rui )). Let x ∈ Si and use
Taylor’s formula to get
M 2 3 M 2
f (x + tui ) > f (x) + th∇f (x)|ui i − t > κt − t
2 4 2
so that f (x + tui ) > 0 as long as t ∈ (0, t∗ ) with t∗ = 2M

. One has a similar
bound for negative t. This directly implies that the number of intersection
points between Si and y + Rui is at most 1 + diam(X)/t∗ . Since the number
n of directions ui only depends on the dimension d, we have
X
vold−1 (S) 6 vold−1 (Si )
16i6n
 
X
d−1 M
6 c(d) vol (Hi ∩ Πi (X)) 1 + diam(X)
κ
16n
 
M
6 c(d) 1 + diam(X) diam(X)d−1 . 
κ

Proof of Proposition 41. Let y ∈ Y . Applying Lemma 42, we have


Z Z
t ρ(x)
d−1
|Gy (ψ + t1y ) − Gy (ψ)| 6 dvol (x)ds

0 Lagyz (ψ+s1y ) k∇cyz (x)k
kρk∞
6 max vold−1 (c−1
yz (s) ∩ X) |t| ,
κ a6s6a+t
where we used the bound k∇cyz (x)k > κ, which comes from the twist as-
sumption and the inclusion Lagyz (ψ+s1z ) ⊆ c−1
yz (a+s) with a = ψ(z)−ψ(y),
as in the proof of the previous lemma. Applying Lemma 43 to the function
f = cyz − s, we get a uniform upper bound on the (d − 1)-volume of the level
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 47

set c−1
yz (s):
 
M
vol d−1
(c−1
yz (s) ∩ X) 6 c(d) 1 + diam(X) diam(X)d−1 ,
κ
which yields
|Gy (ψ + t1z ) − Gy (ψ)| 6 L̂G |t| ,
(4.62)
 
c(d) M
with L̂G = 1+ diam(X) diam(X)d−1 kρk∞ .
κ κ
Take ψ, ψ̃ ∈ RY . Order the points in Y , i.e. let Y = {y1 , . . . , yN } and
define recursively
(
ψ0 = ψ
ψ k+1 = ψ k + (ψ̃(yk ) − ψ(yk ))1yk
Then, ψ N = ψ̃ and for k > 1, ψ k+1 and ψ k differ only by the value at yk
Thus, applying (4.62),
X
Gy (ψ̃) − Gy (ψ) = Gy (ψ k+1 ) − Gy (ψ k )

16k6N
X
6 L̂G ψ(yk ) − ψ̃(yk )

16k6N

6 LG ψ − ψ̃ with LG = N L̂G 


Convergence of Oliker-Prussner’s algorithm. Now that we have established
the Lipschitz continuity of Gy , the convergence of Algorithm 3 follows eas-
ily, using arguments similar to those used to establish the convergence of
Auction’s algorithm.
Theorem 44 (Oliker-Prussner). Assume that the cost c ∈ C 2 (ΩX × ΩY ) is
twisted (Def. 8) and that ρ ∈ P ac (X) ∩ L∞ (X). Then,
• Oliker-Prussner’s algorithm converges in a finite number of steps k 6
CN 3 /δ, where C is a constant that depends on X, Y , ρ and c.
• Furthermore, at step k, one has

∀1 6 i 6 N, Gi (ψ (k) ) − νi 6 δ.

Remark 23 (Computational complexity). The computational complexity is


actually much higher than the number of steps of the algorithm, since:
• at each iteration, one needs to compute ty (this could be done using
for instance a binary search or more clever techniques).
• each time the map Gy is evaluated, one needs to compute the La-
guerre cell Lagy (ψ), which, if done naively, requires to compute the
intersection of N − 1 half-spaces Hyz 6 (ψ).

Overall, this leads to an upper bound on computational complexity of at


4
least O( Nδ log(N )), assuming that one can compute Lagy (ψ) in time N . To
the best of our knowledge, there exists no lower bound on the number of
iterations of Algorithm 3, i.e. specific instances of the problem for which one
can count the number of iterations.
48 QUENTIN MÉRIGOT AND BORIS THIBERT

Remark 24 (δ-Scaling). It is tempting to perform δ-scaling as in the case of


Auction’s algorithm (see Algorithm 2). In practice, one could start with a
rather large δ (0) ∈ (0, 1), to get a first estimation of the prices using Oliker-
Prussner’s algorithm. Then one would iteratively replace δ (`) by δ (`+1) =
1 (`)
2δ and run again the algorithm starting from the prices found at the
previous iteration. Doing so, one could hope to get rid of the 1δ term in the
number of iterations, and to replace it by e.g. log 1δ .


Proof of Theorem 44.


Step 1 (Correctness) When Algorithm 3 terminates with ψ := ψ (k) , one has
for any y 6= y0 , ρ(Lagy (ψ)) 6 νy . When it stops, it also means that one has
ρ(Lagy (ψ)) > νy − Nδ . Then, as desired, we get
X
ρ(Lagy0 (ψ)) = 1 − ρ(Lagy0 (ψ)) ∈ [νy0 , νy0 + δ].
y6=y0

Step 2 (A priori bound on ψk ) By construction one has ρ(Lagy (ψ (k) )) 6 νy ,


which also imply that
X
ρ(Lagy0 (ψ (k) )) = 1 − ρ(Lagy (ψ (k) )) > νy0 > 0.
y∈Y \{y0 }

By Proposition 38–(v), we get 0 = ψ k (y0 ) 6 minY ψ (k) + R. Since the price


of y0 is never changed, ψ (k) (y0 ) = 0 and R > ψ (k) > −R.
Step 3 (Minimum decrease and termination) In the second step of the algo-
rithm, when ψ (k) is updated one has Gy (ψ (k) − tt 1y ) > Gy (ψ (k) ) + Nδ . Since
δ
Gy is Lipschitz with some constant LG , this implies that |ty | > N L G
. Then,
since ψ0 (y) = R and for any k, ψk (y) > −R, the number of times ky the
price of a point y ∈ Y has been updated cannot be too large:
ky δ/(N LG ) 6 2R,
i.e. ky 6 (2RN LG )/δ. Since this bound on the number of steps is for a
single point, it needs to be multiplied by N to get the total number of steps.
3
Using the bound on LG given in (4.60), we get an upper bound of O( Nδ ) on
the number of iterations of the algorithm. 

4.3. Semi-discrete optimal transport via Newton’s method. We con-


sider a simple damped Newton’s algorithm to solve semi-discrete optimal
transport problem introduced in [65], and adapted from a similar algorithm
for solving Monge-Ampère equations with Dirichlet boundary conditions [75].

Hessian of Kantorovich’s functional. In order to write the Newton’s algo-


rithm, we first show that G is C 1 (or equivalently K is C 2 ) and we compute
its derivatives under a genericity assumption, which depends on the cost and
on the choice of points Y . This condition is a bit technical, but is for instance
satisfied for the quadratic cost on Rd (see Remark 27 below).
Definition 18 (Genericity assumption). Let ΩX , ΩY ⊆ Rd open, c ∈ C 1 (ΩX ×
ΩY ), and X ⊆ ΩX , Y ⊆ ΩY , with X compact and Y finite.
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 49

• We call Y generic with respect to c if for all distinct y0 , y2 , y2 ∈ Y


and all t ∈ R2 , one has
vold−1 ({x ∈ ΩX | (c(x, y1 ) − c(x, y0 ), c(x, y2 ) − c(x, y0 )) = t}) = 0.
• We call Y generic with respect to ∂X if for all distinct y0 , y1 ∈ Y and
all t ∈ R, one has
vold−1 ({x ∈ ΩX | c(x, y1 ) − c(x, y0 ) = t} ∩ ∂X) = 0.
Remark 25 (d=1). The genericity assumption is never satisfied in dimension
d = 1, because it requires that the intersection of the 0-dimensional sets
{c(·, yi ) − c(·, y0 ) = ti }, with i = 1, 2 is empty for all t ∈ R2 . Nonetheless,
quasi-Newton methods seem to be quite efficient in this case as well [40].
Remark 26 (Sufficient genericity condition). Assume for all distinct points
y0 , y1 , y2 ∈ Y and for every x ∈ X, the vectors ∇x c(x, y1 ) − ∇x c(x, y0 ) and
∇x c(x, y2 )−∇x c(x, y0 ) are independent. Then, the implicit function theorem
guarantees that for every t ∈ R2 the set
(c(·, y1 ) − c(·, y0 ), c(·, y2 ) − c(·, y0 ))−1 (t)
is a (d − 2) dimensional submanifold, and therefore has zero (d − 1)–volume.
In particular, the set Y is generic with respect to c (but not necessarily with
respect to ∂X).

Remark 27 (Quadratic cost). For the quadratic cost c(x, y) = 21 kx − yk2 ,


we have ∇x c(x, yi ) − ∇x c(x, y0 ) = y0 − yi . Using the previous remark, we
see that the set Y is generic with respect to c if it does not include three
aligned points.
Genericity with respect to the boundary ∂X requires more assumptions.
For instance, if X is a strictly convex set (or more generally if the Gaussian
curvature is nonzero at any point on ∂X) and if the cost is quadratic, then
Y is automatically generic with respect to ∂X. As a second example, we
assume that X is a compact convex polyhedron, e.g.
X = {x ∈ Rd | ∀1 6 j 6 M, hx|wj i 6 1},
where w1 , . . . , wM ∈ Rd . Then Y is generic with respect to ∂X if for all
distinct y0 , y1 ∈ Y and any 1 6 i 6 M , the vectors y1 − y0 and wj are
independent.
Example 7 (Non-differentiability of G). When the set Y isn’t generic, the
map G might be non-differentiable. Consider for instance Y = {y−1 , y0 , y1 } ⊆
R2 with yi = (i, 0), X = [−1, 1]2 and ρ = 41 vol2 X . Define a one-parameter

family of prices ψt : Y → R by ψt (y0 ) = t and ψt (y±1 ) = 0. Then, for t > 0,


1
Lagy0 (ψt ) = {x = (x1 , x2 ) ∈ R2 | |x1 | 6 |1 − t|}.
2
(
1
2 |1 − t| if t 6 1
Gy0 (ψt ) = ρ(Lagy0 (ψt )) = ,
0 if not.
showing that the function Gy0 is non-differentiable at t = 1.
50 QUENTIN MÉRIGOT AND BORIS THIBERT

Theorem 45. If c ∈ C 2 (ΩX × ΩY ) satisfies the twist condition (Def. 8), Y


is generic with respect to c and ∂X (Def. 18), and the restriction ρ|X of ρ
to X is continuous (ρ|X ∈ C 0 (X)), then the map G : RY → RY is C 1 , and
∂Gy ρ(x)
Z
∀z 6= y, (ψ) = Gyz (ψ) := dx,
∂1z Lagyz (ψ) k∇x c(x, y) − ∇x c(x, z)k
∂Gy X
∀y ∈ Y, (ψ) = Gyy (ψ) := − Gyz (ψ)
∂1y
z∈Y \{y}
(4.63)
where we denote Lagyz (ψ) = Lagy (ψ) ∩ Lagz (ψ) for y 6= z.
The formula that one should expect for the partial derivative of Gy with
respect to 1z (z 6= y) is already quite clear from Lemma 42. The main
difficulty in order to establish Theorem 45 is to prove that the function Gyz
defined in (4.63) is continuous.
Lemma 46. Assume that Y is generic with respect to c and ∂X. Then, for
any y 6= z ∈ Y , the function Gyz defined in (4.63) is continuous.
Proof. Let f = cyz = c(·, y) − c(·, z). By Cauchy-Lipschitz’s theory, one can
construct a flow Φ : [−ε, ε] × Ω1X → ΩX , where ε > 0, such that
(
Φ(0, x) = x
∇f (Φ(t,x)) (4.64)
Φ̇(t, x) = k∇f (Φ(t,x))k2
,

where Φ̇ is the derivative with respect to t and Ω1X ⊂ ΩX is an open set


d
containing X. A simple calculation shows that dt f (Φ(t, x)) = 1, which
implies that f (Φ(t, x)) = f (Φ(0, x)) + t. Moreover, since ∇f / k∇f k is of
class C 1 on ΩX , then Ft := Φ(t, ·) converges pointwise in a C 1 sense to the
identity as t → 0.
Let (ψn ) be a sequence in RY converging to some ψ∞ ∈ RY . We put
an = ψn (z) − ψn (y), a = ψ∞ (z) − ψ∞ (y) and tn = an − a and define
Ln = Φ(−tn , Lagyz (ψn )) and L∞ = Lagyz (ψ∞ ).
By definition, one has f (Lagyz (ψn )) = an and f (Lagyz (ψ∞ )) = a. Using
the flow property, one gets that both Ln and L∞ are subsets of the hyper-
surface H = f −1 (a). Denoting Fn the restriction of Φ(tn , ·) to H, one has
Lagyz (ψn ) = Fn (Ln ).
We now need to consider a continuous extension ρ of ρ|X onto ΩX , since Ln
may not be included in X. By a change of variable (see (5.79) for instance),
one gets
ρ(y)
Z
gn := Gyz (ψn ) = dvold−1 (y)
Lagyz (ψn ) k∇c yz (y)k
ρ(y)
Z
= χX (y)dvold−1 (y)
Lagyz (ψn ) k∇c yz (y)k
ρ(Fn (x))
Z
= JFn (x)χLn (x)χX (Fn (x))dvold−1 (x),
H k∇c yz (Fn (x))k
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 51

where χA is the indicator function of A. Moreover,


ρ(x)
Z
g∞ := Gyz (ψ) = χL∞ ∩X (x)dvold−1 (x).
H k∇c yz (x)k
By Lebesgue’s dominated convergence theorem, to prove that (gn )n>0 con-
verges to g∞ , it suffices to prove that the integrand of gn (seen as a function
on H) tends to the integrand of g∞ vold−1 -almost everywhere. Since Fn
converges to the identity in a C 1 sense and ρ is continuous, it remains to
show that limn→∞ χLn (x)χX (Fn (x)) = χL∞ ∩X (x) for almost every x ∈ H
(for the (d − 1) Hausdorff measure).
We first prove that lim supn→∞ χLn (x))χX (Fn (x)) 6 χL∞ ∩X (x) for every
x ∈ H. The limsup is non-zero if and only if there exists a subsequence σ(n)
such that x ∈ Lσ(n) and Fn (x) ∈ X. Then, since Fσ(n) (Lσ(n) ) = Lagyz (ψσ(n) )
we get
(
c(Fσ(n) (x), y) + ψσ(n) (y) 6 c(Fσ(n) (x), w) + ψσ(n) (w)
c(Fσ(n) (x), y) + ψσ(n) (y) = c(Fσ(n) (x), z) + ψσ(n) (z).
Passing to the limit n → +∞, we see that x belongs to Lagyz (ψ∞ ) = L∞
and to X, thus ensuring
lim sup χLn (x)χX (Fn (x)) 6 χL∞ ∩X (x).
n→+∞

We now pass to the liminf inequality. Denote


 
[
S= Hyzw (ψ∞ ) ∪ (Hyz (ψ) ∩ ∂X) ,
w∈Y \{y,z}

where Hyz is defined in Equation (4.51) and Hyzw (ψ∞ ) := Hyz (ψ∞ ) ∩
Hzw (ψ∞ ) which by assumption has zero (d − 1) Hausdorff measure. We
now prove that lim inf n→∞ (x)χLn χX (Fn (x)) > χL∞ ∩X on H \ S. If x 6∈
L∞ ∩ X, χL∞ ∩X (x) = 0 and there is nothing to prove. We therefore consider
x ∈ (L∞ ∩ X) \ S, meaning by definition of S that x belongs to the interior
int(X) and that
∀w ∈ Y \ {z, y}, c(x, y) + ψ∞ (y) < c(x, w) + ψ∞ (w).
Since Fn (x) converges to x, this implies that for n large enough one has
Fn (x) ∈ int(X) and
∀w ∈ Y \ {z, y}, c(Fn (x), y) + ψ∞ (y) < c(Fn (x), w) + ψ∞ (w).
By definition, this means that Fn (x) belongs to Lagyz (ψn ), and therefore
x ∈ Ln by definition of Ln . Thus
lim inf χLn (x)χX (Fn (x)) = 1 > χL∞ ∩X (x)
n→+∞


Proof of Theorem 45. Lemma 42 shows that for any distinct point y 6= z ∈ Y
and any ψ ∈ RY one has
Z t
Gy (ψ + t1z ) = Gy (ψ) + Gyz (ψ + s1z )ds.
0
52 QUENTIN MÉRIGOT AND BORIS THIBERT

Moreover, by Lemma 46, we know that the function Gyz is continuous. The
fundamental theorem of calculus implies that f : t 7→ Gy (ψ + t1z ) is differen-
∂G
tiable, and that f 0 (0) = ∂1zy (ψ) = Gyz (ψ). To compute the partial derivative
of Gy with respect to 1y , we note that by invariance of Gy under addition
of a constant, X
Gy (ψ + t1y ) = Gy (ψ − t1z )
z6=y
The right-hand side of this expression is differentiable with respect to t, so
that the left-hand side is also differentiable, and the chain rule gives
∂Gy X
(ψ) = − Gyz (ψ).
∂1y
z∈Y \{y}

Using again the continuity of Gyz on RY , we obtain G ∈ C 1 (RY ). 


Strong concavity of Kantorovich’s functional. We show here a strict mono-
tonicity property of G, which corresponds to a concavity property on the
Kantorovitch functional K, since we have D2 K = DG.
Theorem 47. Assume that c, X, Y, ρ are as in Theorem 45, and in addition
that ρ(∂X) = 0 and that the set {ρ > 0} ∩ int(X) is connected. Define
S+ := {ψ ∈ Rd | ∀y ∈ Y, Gy (ψ) > 0}.
S := {ψ ∈ Rd | ∀y ∈ Y, Gy (ψ) > }.
• Kantorovich’s functional is locally strongly concave on S+ ∩ {1Y }⊥ :
∀ψ ∈ S+ , ∀v ∈ {1Y }⊥ \ {0}, hDG(ψ)v|vi < 0
• For every  > 0, the set of functions S ∩ {1Y }⊥ is compact.

Definition 19 (Irreducible matrix). A square matrix H is called irreducible


if and only if the graph induced by H is connected3, i.e.
∀(a, b) ∈ {1, . . . , N }, ∃i1 = a, . . . , ik = b s.t. ∀j ∈ {1, . . . , k − 1}, Hij ,ij+1 6= 0.
Lemma 48. Let H be P a symmetric irreducible matrix such that Hij > 0
if i 6= j and Hii = − j6=i Hij . Then, H is non-positive and ker H =
R(1, . . . , 1) = R1Y .
Proof. The non-positivity follows from Gershgorin’s circle theorem. The
lemma will be established if we prove that any vector in the kernel of H
is constant. Consider v ∈ ker H and let i0 be an index where v attains its
maximum, i.e. i0 ∈ arg max16i6n vi . Then using Hv = 0, and in particular
(Hv)i0 = 0, one has
X X X X
0= Hi,i0 vi + Hi0 ,i0 vi0 = Hi,i0 vi − Hi,i0 vi0 = Hi,i0 (vi − vi0 ).
i6=i0 i6=i0 i6=i0 i6=i0
P
This follows from Hi0 ,i0 = − i6=i0 Hi,i0 . Since for every i 6= i0 , one has
Hi,i0 > 0 and vi0 − vi > 0, this implies that vi = vi0 for every i such that
Hi,i0 6= 0. By induction and using the connectedness of the graph induced
by H, this shows that v has to be constant. 
3the graph induced by the N × N matrix H is the graph with vertices {1, . . . , N }, and
where i, j are linked by an edge if Hij 6= 0
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 53

Lemma 49. Let U ⊆ Rd be a connected open set, and S ⊆ Rd be a closed


set such that vold−1 (S) = 0. Then, U \ S is path-connected.
Proof. It suffices to treat the case where U is an open ball, the general
case will follow by standard connectedness arguments. Let x, y ∈ U \ S be
distinct points. Since U \ S is open, there exists r > 0 such that B(x, r) and
B(y, r) are included in U \ S. Consider H the hyperplane orthogonal to the
segment [x, y], and ΠH the projection on H. Then, since ΠH is 1-Lipschitz,
vold−1 (ΠH S) 6 vold−1 (S) = 0, so that H \ ΠH S is dense in the hyperplane
H. In particular, there exists a point
z ∈ ΠH (B(x, r)) \ S = ΠH (B(y, r)) \ S.
By construction the line z + R(y − x) avoids S and passes through the balls
B(x, r) ⊆ U \ S and B(y, r) ⊆ U \ S. This shows that the points x, y can be
connected in U \ S. 
Proof of Theorem 47. Let Y = {y1 , . . . , yN }. Fix ψ ∈ S+ and define
∂Gyi
Hij := (ψ).
∂1yj
By Lemma 48, the first claim will hold if we prove that the matrix H is
irreducible. We define Z = int(X) ∩ {ρ > 0}, which by assumption is a
connected open set.
Step 1: We show here that for all i ∈ {1, . . . , N }, int(Lagyi (ψ)) ∩ Z contains
at least a point which we denote xi . Indeed, since ψ ∈ S+ , we know that
ρ(Lagyi (ψ)) > 0. In addition, by Proposition 37, ρ(Lagyi (ψ) ∩ Lagyj (ψ)) = 0
for all j 6= i, and ρ(∂X) = 0 by assumption. This implies that ρ(Li ) =
ρ(Lagyi (ψ)) > 0, where
Li = {x ∈ Z | ∀j 6= i, c(x, yi ) + ψ(yi ) < c(x, yj ) + ψ(yj )} ⊆ int(Lagyi (ψ)).
We conclude by remarking that Li is contained in int(Lagyi (ψ)) ∩ Z, which
therefore has to be nonempty.
Step 2: Let S the union of facets that are common to at least three distinct
Laguerre cells, i.e.
[
S= Lagy1 ,y2 ,y3 (ψ).
y1 ,y2 ,y3 distinct

Then, Z \ S is open and path-connected. Indeed, by the genericity assump-


tion (Def 18), we already know that vold−1 (S) = 0, and Lemma 49 then
implies that Z \ S is path-connected.
Step 3: Let x ∈ Z \ S be such that x ∈ Lagyi (ψ) ∩ Lagyj (ψ) for i 6= j. Then,
Hij > 0. To see this, we note that since x belongs to the complement of S,
(
c(x, yi ) + ψ(yi ) = c(x, yj ) + ψ(yj ),
∀k 6∈ {i, j}, c(x, yi ) + ψ(yi ) < c(x, yk ) + ψ(yk ).
This implies that there exists a ball with radius r > 0 around x such that
∀x0 ∈ B(x, r), ∀k 6∈ {i, j}, c(x, yi ) + ψ(yi ) < c(x, yk ) + ψ(yk ),
directly implying that
Hyy0 (ψ) ∩ B(x, r) ⊆ Lagyi yj (ψ).
54 QUENTIN MÉRIGOT AND BORIS THIBERT

By the twist hypothesis and the inverse function theorem, Hyy0 (ψ) is a (d−1)-
dimensional submanifold. In addition, ρ(x) > 0 because x belongs to Z. This
implies that
ρ(x0 )
Z
Hij = 0 0
dvold−1 (x0 )
Lagyi yj (ψ) k∇x c(x , yi ) − ∇x c(x , yj )k

ρ(x0 )
Z
> 0 0
dvold−1 (x0 ) > 0.
Hyi yj (ψ)∩B(x,r) k∇x c(x , yi ) − ∇x c(x , yj )k

Step 4: We now fix i 6= j ∈ {1, . . . , N } and the points xi , xj whose existence


is established in Step 1:
xi ∈ int(Lagyi (ψ)) ∩ Z, xj ∈ int(Lagyj (ψ)) ∩ Z,
so that in particular xi , xj belongs to Z \ S. By Step 2, we get the existence
of a continuous path γ ∈ C 0 ([0, 1], Z \ S) such that γ(0) = xi and γ(1) = xj .
We define a sequence ik ∈ {1, . . . , N } of indices by induction, starting from
i0 = i. For k > 0 we define tk = max{t ∈ [0, 1] | γ(t) ∈ Lagyi }. If tk = 1
k
we are done. If not, γ(tk ) belongs to exactly two distinct Laguerre cells, and
we define ik+1 6= ik so that γ(tk ) ∈ Lagyi (ψ) ∩ Lagyi (ψ). By definition
k k+1
of ti as a maximum, the points y1 , . . . , yk must be distinct, so that t` = 1
after a finite number of iterations and then i` = j. By Step 3, we get that
Hyik yik+1 > 0 for any k ∈ {0, `−1}, proving that the matrix H is irreducible,
thus ker DG(ψ) = R1Y by Lemma 48, implying the strict concavity property.

Compactness of S ∩ {1Y }⊥ : By continuity of the function G, this set is


closed. By Proposition 38-(vi), maxY ψ − minY is bounded on the set S .
This implies that S ∩ {1Y }⊥ is bounded since every function of {1Y }⊥ has
a mean value equal to zero, and is thus compact.


Damped Newton algorithm and its convergence.


Proposition 50. Let G = (G1 , . . . , GN ) ∈ C 1 (RN , RN ) be a function satis-
fying the following properties:
(1) (Invariance and image) P G is invariant under the addition of a con-
stant, Gi (ψ) > 0 and i Gi (ψ) = 1 for all ψ ∈ RN .
(2) (Compactness) For any ε > 0 the set Sε ∩ {1}⊥ is compact, where
Sε := ψ ∈ RN | ∀i, Gi (ψ) > 


1 = (1, . . . , 1) ∈ RN
(3) (Strict monotonicity) The matrix DG(ψ) is symmetric nonpositive,
and
∀ψ ∈ Sε , ∀v ∈ {1}⊥ \ {0}, hDG(ψ)v|vi < 0.
Then Algorithm 4 terminates in a finite number of steps. More precisely, the
iterates (ψ (k) ) of Algorithm 4 satisfy, for some τ ∗ > 0,
 τ?

k+1 k
G(ψ ) − ν 6 1 − G(ψ ) − ν .

2
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 55

Algorithm 4 Damped Newton algorithm


Input: A tolerance η > 0 and an initial ψ (0) ∈ RY such that
 
1 (0)
ε := min min Gy (ψ ), min νy > 0. (4.65)
2 y∈Y y∈Y
(k)

While: G(ψ ) − ν > η∞
Step 1: Compute v (k) satisfying
(
DG(ψ (k) )v (k) = ν − G(ψ (k) )
(k) (y) = 0
P
y∈Y v

Step 2: Determine the minimum ` ∈ N such that ψ (k,`) := ψ (k) +


2−` v (k) satisfies

 ∀y ∈ Y, Gy (ψ (k,`) ) > ε

 G(ψ (k,`) ) − ν 6 (1 − 2−(`+1) ) G(ψ (k) ) − ν

Step 3: Set ψ (k+1) = ψ (k) + 2 −` v (k) and k ← k + 1.


Output: A vector ψ (k) that satisfies G(ψ (k) ) − ν ∞ 6 η.

Proof.
Estimates. Let ν ∈ RN be such that i νi = 1. We assume that ψ (0) ∈
P

RN ∩ {1}⊥ is chosen so that


 
1
 = min min Gi (ψ 0 ), min νi > 0,
2 i i

and we let S := Sε ∩ {1}⊥ . Let ψ ∈ S. By Theorem 47, the matrix DG(ψ)


is symmetric non-positive, and its kernel is the one-dimensional space R1.
Thus, the equation
(
DG(ψ)v = ν − G(ψ),
P
i vi = 0,

has a unique solution, which we denote v(ψ), and we let ψτ = ψ + τ v(ψ). By


continuity of DG(ψ) over the compact domain S, the non-zero eigenvalues of
−DG(ψ) lie in [a, A] for some 0 < a 6 A < +∞. In particular, there exists
a constant M > 0 such that for all ψ ∈ S

kG(ψ) − νk kG(ψ) − νk
6 kv(ψ)k 6 6 M. (4.66)
A a
In particular, the function F : (ψ, τ ) ∈ S × [0, 1] 7→ ψτ is continuous. Since
S × [0, 1] is compact, K := F (S × [0, 1]) is also compact. Then, by uniform
continuity of DG over K, we see that there exists an increasing function
ω such that limt→0 ω(t) = 0 and kDG(ψ) − DG(ψ 0 )k 6 ω(kψ − ψ 0 k) for all
ψ, ψ 0 ∈ K. Since G is of class C 1 , a Taylor expansion in τ gives

G(ψτ ) = G(ψ + τ v(ψ)) = (1 − τ )G(ψ) + τ ν + R(τ ) (4.67)


56 QUENTIN MÉRIGOT AND BORIS THIBERT

where R(τ ) = 0 (DG(ψt ) − DG(ψ))v(ψ)dt is the integral remainder. Then,
we can bound the norm of R(τ ) for τ ∈ [0, 1]:
Z τ

kR(τ )k = (DG(ψt ) − DG(ψ))v(ψ)dt


0
Z τ
6 kv(ψ)k ω(kψt − ψk)dt
0
6 kv(ψ)k τ ω(τ kv(ψ)k). (4.68)

To establish the first inequality, we used that ψ and ψt belong to the com-
pact set K and for the second one that ω is increasing and that t ∈ [0, τ ].

Linear convergence. We first show the existence of τ1∗ > 0 such that for
all ψ ∈ S and τ ∈ (0, τ1∗ ), one has ψτ ∈ S. By definition of ε, for every
i ∈ {1, . . . , N } one has νi > 2 and Gi (ψ) > . Using (4.67) and (4.68), one
deduces a lower bound on Gi (ψτ ):

Gi (ψτ ) > (1 − τ )Gi (ψ) + τ νi + Ri (τ )


> (1 + τ ) − kR(τ )k
> ε + τ (ε − M ω(τ M )).

If we choose τ1∗ > 0 small enough so that M ω(τ1∗ M ) 6 ε, this implies that
ψτ ∈ S for all ψ ∈ S and τ ∈ [0, τ1∗ ].
We now prove that there exists τ2∗ > 0 such that for τ ∈ [0, τ2∗ ], one
has kG(ψτ ) − νk 6 (1 − τ /2) kG(ψ) − νk. From Equation (4.67), we have
G(ψτ ) − ν = (1 − τ )(G(ψ) − ν) + R(τ ), and it is therefore sufficient to prove
τ
kR(τ )k 6 kG(ψ) − νk .
2

With the upper bound on R(τ ) given in Equation (4.68) combined with the
two bounds on kv(ψ)k of Equation (4.66), this condition will hold provided
that τ is such that ω(τ M )/a 6 1/2.
These two bounds directly imply that the τ (k) chosen in Algorithm 4
always satisfy τ (k) > τ ∗ with τ ∗ = 12 min(τ1∗ , τ2∗ ), so that
 τ∗

(k+1) (k+1)
G(ψ ) − ν 6 1 − G(ψ ) − ν .

2

This establishes the linear convergence of Algorithm 4. 

Application to optimal transport. The damped Newton algorithm allows to


solve the semi-discrete optimal transport problem when applied to the func-
tion G given by Gy (ψ) = ρ(Lagψ (y)). The function G satisfies the as-
sumptions of Proposition 50: it is of class C 1 by Theorem 45, satisfies the
compactness and strict monotonicity property by Theorem 47 and clearly
also satisfies Assumption (1). We therefore have the following theorem:
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 57

Theorem 51. We make the following assumptions:


• c ∈ C 2 (ΩX × ΩY ) satisfies the twist condition (Def. 8),
• Y is generic with respect to c and ∂X (Def. 18),
• ρ(∂X) = 0 and ρ|X ∈ C 0 (X) is such that the set {ρ > 0} ∩ int(X) is
connected.
Then Algorithm 4 terminates in a finite number of steps. More precisely, the
iterates (ψ (k) ) of Algorithm 4 satisfy, for some τ ∗ > 0,
 τ?

k+1
G(ψ ) − ν 6 1 − G(ψ k ) − ν .

2

Remark 28 (Quadratic convergence). The above theorem shows that the


convergence of the damped Newton algorithm is globally linear. When the
cost c satisfies the Ma-Trudinger-Wang (MTW) condition that appears in
the regularity theory of optimal transport, and when the density function ρ
is Lipschitz-continuous, the convergence is even locally quadratic [65].
Remark 29 (Implementation). The most difficult part in the implementation
of both Oliker–Prussner’s algorithm and the damped Newton algorithm is
the computation of the Laguerre tessellation. In several interesting cases,
Laguerre cells can be obtained by intersecting Power diagram with surfaces,
such as planes, spheres or triangulated surfaces. Recall that the Power di-
agram of a weighted point cloud P = (p1 , . . . , pN ) ∈ (Rd )N with weights
(ω1 , . . . , ωN ) ∈ R, is defined by the cells
Powi := {x ∈ R3 | kx − pi k2 + ωi 6 kx − pj k2 + ωj ∀j}.
This diagram can be efficiently computed by using libraries, such as for
instance Cgal or Geogram.
When c(x, y) = kx − yk2 is the quadratic cost and X is a triangulated
surface in R3 , the Laguerre cells can be obtained by intersecting power cells
with the triangulated surface X [72]. This approach is also used in several
inverse problems arising in nonimaging optics that correspond to optimal
transport problems. For instance, when c(x, y) = − log(1 − hx|yi) is the re-
flector cost on the unit sphere, the Laguerre cells are obtained by intersecting
power cells with the unit sphere [74, 37].
4.4. Semi-discrete entropic transport. The semi-discrete entropic trans-
port problem was introduced by Genevay, Cuturi, Peyré and Bach [52], as
a regularization of high-dimensional optimal transport problems, see also
[36]. Such high-dimensional problems occur for instance in image genera-
tion, we refer for instance to the work of Galerne, Leclaire and Rabin [48].
Our goal here is to investigate briefly the relation between the semi-discrete
Kantorovich functional K and its entropically regularized variant Kη .
Let X ⊆ ΩX be compact and Y ⊆ ΩY be finite. We recall that the entropy
of a probability measure is
(R
ΩX ρ log ρ if ρ ∈ P ac (X)
H(ρ) = (4.69)
+∞ if not
We also recall that if γ is a transport plan between
P a probability density ρ in
P ac (X) and a finitely supported measure ν = y∈Y νy δy , then there exists
58 QUENTIN MÉRIGOT AND BORIS THIBERT

probability densities ρy ∈ M+ (X) ∩ L1 (X) such that γ =


P
y ρy ⊗ δy , and
which satisfy the two marginal conditions

X Z
ρy = ρ and ρy = νy . (4.70)
y

Then, the entropy of γ, with respect to vold ⊗vol0 , is the sum of the entropies
of the ρy . This leads to the following definition.

Definition 20 (Semi-discrete entropic transport). The entropy-regularized


semi-discrete optimal transport problem
P between a density ρ ∈ P ac (X) and
a finitely supported measure ν = y∈Y νy δy is defined for any η > 0 by

 
 X X Z 
(KP)η = min hc|γi + η H(ρy ) | ρy ∈ L1 (X), s.t. ρy = ρ, ρy = νy .

y∈Y X 
y∈Y

Dual problem. The dual problem is constructed, as always, by introducing


Lagrange multipliers ϕ, ψ for the marginal constraints (4.70). We skip the
derivation of the dual problem, which is very similar to the one presented in
the discrete case (Section 3.3), and we directly state it:

η
XZ −
c(x,y)+ψ(y)−ϕ(x)
(DP) = sup hϕ|ρi − hψ|νi − η e η dx. (4.71)
(ϕ,ψ)∈L1 (X)×RY y∈Y X

Maximizing with respect to ϕ for a given ψ ∈ RY , we obtain a second formu-


lation as a finite-dimensional optimization problem involving a regularized
Kantorovich functional, exactly as in §3.3:

0
(DP)η = sup Kη (ψ) (4.72)
ψ∈RY
 
Z c(x,y)+ψ(y)

X
where Kη (ψ) := −η log  e η  ρ(x)dx − hψ|νi + ηH(ρ)
X y∈Y

In order to express the gradient and the Hessian of Kη , we also need


the notion of smoothed laguerre cells, introduced in the discrete case (see
Equation (3.46)) and defined by

c(·,y)+ψ(y)

e η
RLagηy (ψ) =P c(·,z)+ψ(z)
. (4.73)

z∈Y e η
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 59

Theorem 52. Assume that c ∈ C 1 (X × Y ) is twisted. Then, Kη is a C 2


strictly concave function over RY , with first-order partial derivatives
∂Kη
∀y ∈ Y, (ψ) = Gηy (ψ) − νy with Gηy (ψ) := hRLagηy (ψ)|ρi. (4.74)
∂1y
and second-order partial derivatives
∂ 2 Kη 1
∀y 6= z ∈ Y, (ψ) = Gηyz (ψ) := hRLagηy (ψ)RLagηz (ψ)|ρi
∂1z ∂1y η
2 η (4.75)
∂ K X
∀y ∈ Y, (ψ) = Gηyy (ψ) := − Gηyz (ψ)
∂12y
z6=y
η0
If ψ is a maximizer in (DP) , then the solution to (KP)η is given by
X
γ= ρy ⊗ δy , with ρy = RLagηy (ψ)ρ.
y

We skip the proof of this theorem which follows closely the one of Theorem 32
in the discrete case.
Strong convergence of Kη to K. The next proposition show that for twisted
costs, Gη converges to G locally uniformly (i.e. Kη converges to K in C 1 ).
Its proof follows closely the proof of the Lipschitz estimate for Gη in Propo-
sition 41.
Proposition 53. Assume that c ∈ C 2 (ΩX × ΩY ) is twisted (Def 8), that
X ⊆ ΩX is compact and Y ⊆ ΩY is finite and that ρ ∈ P ac (X) ∩ L∞ (X).
Then:
(i) Gη converges pointwise to G as η → 0, i.e.
η→0
∀y ∈ Y, RLagηy (ψ) −−−−→ 1Lagy (ψ) .
L1 (X)

(ii) Gη is L-Lipschitz, where L depends on c, X and N only.


(iii) Gη converges locally uniformly to G.
Remark 30. The formula (4.75) implies that Gη is η1 -Lipschitz continous, see
[52] or Remark 5.1 in [83]. When the cost is twisted, the previous proposition
shows that the family of functions (Gη )η>0 is in fact uniformly Lipschitz.
Proof. (i) To prove this statement, it suffices to remark that
(
η 1 if x ∈ SLagy (ψ),
lim RLagy (ψ)(x) =
η→0 0 if x ∈ X \ Lagy (ψ).
Thus, RLagηy (ψ) converges to 1Lagy (ψ) pointwise on the complement in X
of Lagy (ψ) \ SLagy (ψ). Since the cost is twisted, Lagy (ψ) \ SLagy (ψ) is
Lebesgue-negligible, therefore proving that RLagηy (ψ) converges almost ev-
erywhere to 1Lagy (ψ) . One concludes by applying Lebesgue’s dominated con-
vergence theorem.
(ii) To prove that Gη is Lipschitz, we compute an upper bound on DGη =
D Kη , recalling that for z 6= y,
2

∂ 2 Kη 1
(ψ) = hRLagηy (ψ)RLagηz (ψ)|ρi.
∂1z ∂1y η
60 QUENTIN MÉRIGOT AND BORIS THIBERT

For getting such an upper bound, as in Proposition 41, we will apply the
co-area formula using the function
f (x) = c(x, z) + ψ(z) − (c(x, y) + ψ(y))
We note that
c(x,y)+ψ(y)+c(x,z)+ψ(z)

e η
RLagηy (ψ)(x)RLagηz (ψ)(x) = 2 .
c(x,w)+ψ(w)
P −
w∈Y e η

When f (x) > 0, we use the equality c(x, z) + ψ(z) = c(x, y) + ψ(y) + f (x)
to obtain the upper bound
 2
c(x,y)+ψ(y) f (x)
− −

c(x,y)+ψ(y)+c(x,z)+ψ(z) e η e η
e η

f (x)
 2 6  2 6 e η .
c(x,w)+ψ(w) c(x,w)+ψ(w)
P − P −
w∈Y e η
w∈Y e η

Reasoning similarly when f (x) 6 0, we obtain


|f (x)|

RLagηy (ψ)(x)RLagηz (ψ)(x) 6 e η .
Using the co-area formula and the previous upper bound, we get
hRLagηy (ψ)RLagηz (ψ)|ρi
Z
= RLagηy (ψ)(x)RLagηz (ψ)(x)ρ(x)dx
X
Z ∞Z
ρ(x)
= RLagηy (ψ)(x)RLagηz (ψ)(x) dvold−1 (x)dt
−∞ f −1 (t) k∇f (x)k
∞ Z
ρ(x)
Z |t|

6 e η dvold−1 (x)dt
−∞ f −1 (t)∩X k∇f (x)k
We now apply Lemma 43, which gives an upper bound on vold−1 (f −1 (t)∩X)
in terms of the constants κyz = minX k∇f k and Myz = maxX D2 f :
hRLagηy (ψ)RLagηz (ψ)|ρi
  Z ∞
kρk∞ Myz d−1 −
|t|
6 c(d) 1+ diam(X) diam(X) e η dt
κyz κyz −∞
6 Cη,
where the constant C depends on the domain, ρ and the cost only. In other
words, for z 6= y,
2 η
∂ K 1 η η
∂1z ∂1y (ψ) = η hRLagy (ψ)RLagz (ψ)|ρi 6 C.

A similar upper bound holds Since the diagonal elements, thus ensuring
that Gη is L-Lipschitz with L independent on η. (iii) follows at once from
pointwise convergence and the uniform Lipschitz estimate. 
To finish this section, we show that under the genericity assumption in-
troduced in Section 4.3, the Hessian of Kantorovich’s regularized functional
D2 Kη converges pointwise to D2 K as η converges to 0.
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 61

Theorem 54. Assume that c ∈ C 1 (X × Y ) is twisted (Def 8), that Y is


generic with respect to c and ∂X (Def 18), that ρ|X ∈ C 0 (X). Then,
∀ψ ∈ RY , lim DGη (ψ) = DG(ψ).
η→0

Proof. We let f (x) = c(x, z) + ψ(z) − (c(x, y) + ψ(y)) and  > 0. From the
proof of Proposition 53, one has for every x ∈ X \ f −1 ([−ε, ε]) that
− η
RLagηy (ψ)(x)RLagηz (ψ)(x) 6 e .
This implies that
Z
lim RLagηy (ψ)(x)RLagηz (ψ)(x)ρ(x)dx = 0.
η→0 X\f −1 ([−ε,ε])

Since the smoothed Laguerre are non-negative on X, we get


Z
η 1
Gyz (ψ) = RLagηy (ψ)(x)RLagηz (ψ)(x)ρ(x)dx ∼ Gη,ε
yz (ψ)
η X η→0

where
Z
1
Gη,ε
yz (ψ) := RLagηy (ψ)(x)RLagηz (ψ)(x)ρ(x)dx.
η X∩f −1 ([−ε,ε])

As in the proof of Lemma 46, we construct Φ : [−ε, ε] × M → ΩX (for some


positive ε), with M = f −1 (0) by solving the Cauchy problem
(
Φ(0, x) = x
d ∇f (Φ(t,x))
dt Φ(t, x) = k∇f (Φ(t,x))k2 ,

so that f (Φ(t, f −1 (0))) = t. Then one has Φ([−, ] × M ) = X ∩ f −1 ([−, ]),


and by a change of variable formula, using the definition of the smoothed
indicator function of Laguerre cells and the definition of f , we get
Gη,ε
yz (ψ)
Z Z ε
1
= RLagηy (ψ)(Φ(t, x))RLagηz (ψ)(Φ(t, x))ρ(Φ(t, x))JΦ (t, x)dtdvold−1 (x).
η M −ε

Remark that
c(Φ(t,x),y)+ψ(y)+c(Φ(t,x),z)+ψ(z)

e η
RLagηy (ψ)(Φ(t, x))RLagηz (ψ)(Φ(t, x)) =  2
c(Φ(t,x),z)+ψ(z)
P −
z∈Y e η

|f (Φ(t,x))|

= χη (t, x)e η

|t|

= χη (t, x)e η ,
where we put
 
c(Φ(t,x),y)+ψ(y) c(Φ(t,x),z)+ψ(z)
−2 min ,
e η η
χη (t, x) :=  2 .
c(Φ(t,x),z)+ψ(z)
P −
z∈Y e η
62 QUENTIN MÉRIGOT AND BORIS THIBERT

We deduce that one gets


Z
Gη,ε
yz (ψ) = gη (x)dvold−1 (x)
M
with
|t|
ε −
e
Z η
gη (x) = χη (t, x) ρ(Φ(t, x))JΦ (t, x)dt
−ε η
We first note that |χη (t, x)| 6 1, so that (gη )η is bounded in L1 (M ). We now
prove that gη (x) converges to g(x) = ρ(x)JΦ (0, x)1Lagyz (ψ) (x) vold−1 -almost
everywhere. More precisely, we show convergence for any x belonging to the
following set E, which has full vold−1 measure in Lagyz (ψ) by the genericity
assumption (Def. 18):
 
[
E = Lagyz (ψ) \ ∂X ∪ Lagw (ψ) .
w6∈{y,z}

We split the integral defining gη by distinguishing the case t 6 0 and t > 0.


For t > 0,
t = f (Φ(t, x)) = c(Φ(x, t), z) + ψ(z) − (c(Φ(x, t), y) + ψ(y)) > 0
giving
 2
c(Φ(t,x),y)+ψ(y)

e η

χη (t, x) =  2
c(Φ(t,x),w)+ψ(w)
P −
w∈Y e η

!−2
c(Φ(t,x),w)+ψ(w)−(c(Φ(t,x),y)+ψ(y))

X
= e η

w∈Y
−2
−t

= 1 + e η + rη (t, x)

with
− η1 (c(Φ(t,x),w)+ψ(w)−(c(Φ(t,x),y)+ψ(y))
X
rη (t, x) = e .
w∈Y \{y,z}

Now, by assumption on the point x, for any w 6∈ {y, z}, one has c(x, y) +
ψ(y) < c(x, w) + ψ(w) so that rη (t, x) is negligible A similar computation
can be done for t 6 0, giving us the estimation
|t|
Z ε −
e η η→0
gη (x) ∼ |t|
ρ(Φ(t, x))JΦ (t, x)dt −−−→ ρ(x)JΦ (0, x).
η→0 −ε −
η(1 + e η )2
On the other hand, one can show that for almost every x in M but not in
Lagyz (ψ), |χη (t, x)| tends to zero when η goes to zero, thus implying that
the sequence (gη (x)) also converges to 0. In other words,
η→0 ρ(x)
gη (x) −−−→ ρ(x)JΦ (0, x)1Lagyz (ψ) (x) = 1 (x).
a.e. k∇f (x)k Lagyz (ψ)
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 63

By Lebesgue’s dominated convergence theorem, we get


ρ(x)
Z Z
lim Gηyz (ψ) = lim gη (x)dvold−1 (x) = dvold−1 (x).
η→0 η→0 M Lagyz (ψ) k∇f (x)k

From the relation k∇f (x)k = k∇x c(x, y) − ∇x c(x, z)k we get as desired,

lim Gηyz (ψ) = Gyz (ψ) 


η→0

5. Appendix
5.1. Convex analysis. We recall a few relevant definitions and facts from
convex analysis (adapted to concave functions).

Definition 21. The superdifferential of function F : RN → R ∪ {−∞} at


x ∈ RN is the set of vectors v ∈ RN such that

∀y ∈ Y, F (y) 6 F (x) + hv|y − xi.

This set is denoted ∂ + F (x).

Proposition 55. The following hold:


• A function F : RN → R is concave if and only if

∀x ∈ RN , ∂ + F (x) 6= ∅.

• The superdifferential can be characterized by ([86, Theorem 25.6]):


n o
∂ + F (x) = conv lim ∇F (xn ) | (xn )n∈N ∈ S , (5.76)
n→∞

where conv(Z) denotes the convex envelope of the set Z and

S = {(xn )n∈N | ∀n > 1, ∇F (xn ) exists and ∇F (xn ) exists}.

5.2. Coarea formula. We consider two Riemannian sub-manifolds M and


N , respectively of dimensions m and n, of two Euclidean spaces and assume
that n 6 m. Let Φ : M → N be a function of class C 1 between the two
manifolds. The Jacobian determinant of Φ : E ⊆ M → N at x is defined by
q
JΦ (x) = det(DΦ(x)DΦ(x)T ) (5.77)

Note that if M = Rm , N = Rn and Φ = (Φ1 , . . . , Φn ), one has

DΦ(x)DΦ(x)T = (h∇Φi (x)|∇Φj (x)i)16i,j6n .

In particular, for n = 1, one has JΦ(x) = k∇Φ1 (x)k, and for n = 2 one gets

JΦ(x)2 = k∇Φ1 (x)k2 k∇Φ2 (x)k2 − h∇Φ1 (x)|∇Φ2 (x)i2

which by Cauchy-Schwarz’s inequality is always non-negative and vanishes


iff ∇Φ1 (x) and ∇Φ2 (x) are collinear.
64 QUENTIN MÉRIGOT AND BORIS THIBERT

Theorem 56 (Coarea formula). Let Φ : M → N be a function of class C 1 .


For every volm -measurable function u : M → R, one has
Z Z Z
m
u(x)JΦ (x)dvol (x) = u(x)dvolm−n (x)dvoln (y).
M N Φ−1 (y)
If JΦ (x) does not vanish on a measurable subset E ⊂ M , then
u(x)
Z Z Z
m
u(x)dvol (x) = dvolm−n (x)dvoln (y). (5.78)
M −1
N Φ (y) JΦ (x)
In particular, if m = n and Φ : M → N is an homeomorphism of class C 1 ,
letting v = u ◦ Φ−1 : N → R, one recovers the change of variable formula
Z Z
m
v(Φ(x))JΦ (x)dvol (x) = v(y)dvoln (y). (5.79)
M N

References
1. Pankaj K Agarwal and R Sharathkumar, Approximation algorithms for bipartite
matching with metric and geometric costs, Proceedings of the forty-sixth annual ACM
symposium on Theory of computing, ACM, 2014, pp. 555–564.
2. Martial Agueh and Guillaume Carlier, Barycenters in the wasserstein space, SIAM
Journal on Mathematical Analysis 43 (2011), no. 2, 904–924.
3. Jason Altschuler, Jonathan Weed, and Philippe Rigollet, Near-linear time approxi-
mation algorithms for optimal transport via sinkhorn iteration, Advances in Neural
Information Processing Systems, 2017, pp. 1964–1974.
4. Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré, Gradient flows: in metric spaces
and in the space of probability measures, Springer Science & Business Media, 2008.
5. Luigi Ambrosio, Federico Glaudo, and Dario Trevisan, On the optimal map in the
2-dimensional random matching problem, arXiv preprint arXiv:1903.12153, 2019.
6. Franz Aurenhammer, Friedrich Hoffmann, and Boris Aronov, Minkowski-type theorems
and least-squares clustering, Algorithmica 20 (1998), no. 1, 61–76.
7. J-D Benamou, Yann Brenier, and Kevin Guittet, The monge–kantorovitch mass trans-
fer and its computational fluid mechanics formulation, International Journal for Nu-
merical methods in fluids 40 (2002), no. 1-2, 21–30.
8. Jean-David Benamou and Yann Brenier, Mixed L2 -Wasserstein optimal mapping be-
tween prescribed density functions, Journal of Optimization Theory and Applications
111 (2001), no. 2, 255–271.
9. Jean-David Benamou and Guillaume Carlier, Augmented lagrangian methods for
transport optimization, mean field games and degenerate elliptic equations, Journal
of Optimization Theory and Applications 167 (2015), no. 1, 1–26.
10. Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel
Peyré, Iterative bregman projections for regularized transportation problems, SIAM
Journal on Scientific Computing 37 (2015), no. 2, A1111–A1138.
11. Jean-David Benamou, Guillaume Carlier, and Luca Nenna, A numerical method to
solve multi-marginal optimal transport problems with coulomb cost, Splitting Methods
in Communication, Imaging, Science, and Engineering, Springer, 2016, pp. 577–601.
12. Jean-David Benamou and Vincent Duval, Minimal convex extensions and finite dif-
ference discretisation of the quadratic monge–kantorovich problem, European Journal
of Applied Mathematics 30 (2019), no. 6, 1041–1078.
13. Jean-David Benamou, Brittany D Froese, and Adam M Oberman, Numerical solution
of the optimal transportation problem using the monge–ampère equation, Journal of
Computational Physics 260 (2014), 107–126.
14. Robert J. Berman, The Sinkhorn algorithm, parabolic optimal transport and geometric
Monge-Ampère equations, arXiv preprint arXiv:1712.03082, 2017.
15. Robert J Berman, Convergence rates for discretized monge-ampére equations and
quantitative stability of optimal transport, arXiv preprint arXiv:1803.00785, 2018.
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 65

16. D.P. Bertsekas, A new algorithm for the assignment problem, Mathematical Program-
ming 21 (1981), no. 1, 152–171.
17. D.P. Bertsekas and J. Eckstein, Dual coordinate step methods for linear network flow
problems, Mathematical Programming 42 (1988), no. 1, 203–243.
18. Garrett Birkhoff, Tres observaciones sobre el algebra lineal, Univ. Nac. Tucuman, Ser.
A 5 (1946), 147–154.
19. Yann Brenier, Polar factorization and monotone rearrangement of vector-valued func-
tions, Communications on pure and applied mathematics 44 (1991), no. 4, 375–417.
20. , Minimal geodesics on groups of volume-preserving maps and generalized so-
lutions of the euler equations, Communications on Pure and Applied Mathematics: A
Journal Issued by the Courant Institute of Mathematical Sciences 52 (1999), no. 4,
411–452.
21. R.E. Burkard, M. Dell’Amico, and S. Martello, Assignment problems, Society for
Industrial Mathematics, 2009.
22. Giuseppe Buttazzo, Luigi De Pascale, and Paola Gori-Giorgi, Optimal-transport for-
mulation of electronic density-functional theory, Physical Review A 85 (2012), no. 6,
062502.
23. Giuseppe Buttazzo, Chloé Jimenez, and Edouard Oudet, An optimization problem for
mass transportation with congested dynamics, SIAM Journal on Control and Opti-
mization 48 (2009), no. 3, 1961–1976.
24. LA Caffarelli and VI Oliker, Weak solutions of one inverse problem in geometric optics,
Journal of Mathematical Sciences 154 (2008), no. 1, 39–49.
25. Luis Caffarelli and Robert J McCann, Free boundaries in optimal transport and monge-
ampere obstacle problems, Annals of mathematics 171 (2010), no. 2, 673–730.
26. Luis A Caffarelli, Sergey A Kochengin, and Vladimir I Oliker, Problem of reflector
design with given far-field scattering data, Monge Ampère Equation: Applications to
Geometry and Optimization: NSF-CBMS Conference on the Monge Ampère Equa-
tion, Applications to Geometry and Optimization, July 9-13, 1997, Florida Atlantic
University, vol. 226, American Mathematical Soc., 1999, p. 13.
27. Guillaume Carlier, Victor Chernozhukov, Alfred Galichon, et al., Vector quantile re-
gression: an optimal transport approach, The Annals of Statistics 44 (2016), no. 3,
1165–1192.
28. Jose A Carrillo, Katy Craig, Li Wang, and Chaozhen Wei, Primal dual methods for
wasserstein gradient flows, arXiv preprint arXiv:1901.08081, 2019.
29. Benjamin Charlier, Jean Feydy, Joan Alexis Glaunes, and Alain Trouvé, An efficient
kernel product for automatic differentiation libraries, with applications to measure
transport, Working version, 2017.
30. Victor Chernozhukov, Alfred Galichon, Marc Hallin, Marc Henry, et al., Monge–
kantorovich depth, quantiles, ranks and signs, The Annals of Statistics 45 (2017),
no. 1, 223–256.
31. Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard, An
interpolating distance between optimal transport and fisher–rao metrics, Foundations
of Computational Mathematics 18 (2018), no. 1, 1–44.
32. Codina Cotar, Gero Friesecke, and Claudia Klüppelberg, Density functional theory
and optimal transportation with coulomb cost, Communications on Pure and Applied
Mathematics 66 (2013), no. 4, 548–599.
33. Keenan Crane, Clarisse Weischedel, and Max Wardetzky, Geodesics in heat: A new
approach to computing distance based on heat flow, ACM Transactions on Graphics
(TOG) 32 (2013), no. 5, 152.
34. Michael JP Cullen and R James Purser, An extended lagrangian theory of semi-
geostrophic frontogenesis, Journal of the atmospheric sciences 41 (1984), no. 9, 1477–
1497.
35. Marco Cuturi, Sinkhorn distances: Lightspeed computation of optimal transport, Ad-
vances in neural information processing systems, 2013, pp. 2292–2300.
36. Marco Cuturi and Gabriel Peyré, Semidual regularized optimal transport, SIAM Re-
view 60 (2018), no. 4, 941–965.
66 QUENTIN MÉRIGOT AND BORIS THIBERT

37. Pedro Machado Manhães de Castro, Quentin Mérigot, and Boris Thibert, Far-field
reflector problem and intersection of paraboloids, Numerische Mathematik 134 (2016),
no. 2, 389–411.
38. Fernando De Goes, Katherine Breeden, Victor Ostromoukhov, and Mathieu Des-
brun, Blue noise through optimal transport, ACM Transactions on Graphics (TOG)
31 (2012), no. 6, 171.
39. Fernando de Goes, Corentin Wallez, Jin Huang, Dmitry Pavlov, and Mathieu Desbrun,
Power particles: an incompressible fluid solver based on power diagrams., ACM Trans.
Graph. 34 (2015), no. 4, 50–1.
40. Frédéric De Gournay, Jonas Kahn, and Léo Lebrat, 3/4-discrete optimal transport,
arXiv preprint arXiv:1806.09537, 2018.
41. , Differentiation and regularity of semi-discrete optimal transport with respect
to the parameters of the discrete measure, Numerische Mathematik 141 (2019), no. 2,
429–453.
42. Roberto De Leo, Cristian E Gutiérrez, and Henok Mawi, On the numerical solution
of the far field refractor problem, Nonlinear Analysis 157 (2017), 123–145.
43. J. Edmonds and R.M. Karp, Theoretical improvements in algorithmic efficiency for
network flow problems, Journal of the ACM (JACM) 19 (1972), no. 2, 248–264.
44. Matthias Erbar, Martin Rumpf, Bernhard Schmitzer, and Stefan Simon, Com-
putation of optimal transport on discrete metric measure spaces, arXiv preprint
arXiv:1707.06859, 2017.
45. Jean Feydy, Pierre Roussillon, Alain Trouvé, and Pietro Gori, Fast and scalable opti-
mal transport for brain tractograms, International Conference on Medical Image Com-
puting and Computer-Assisted Intervention, Springer, 2019, pp. 636–644.
46. Brittany D Froese, A numerical method for the elliptic Monge–Ampère equation with
transport boundary conditions, SIAM Journal on Scientific Computing 34 (2012),
no. 3, A1432–A1459.
47. H.N. Gabow and R.E. Tarjan, Faster scaling algorithms for network problems, SIAM
Journal on Computing 18 (1989), 1013.
48. Bruno Galerne, Arthur Leclaire, and Julien Rabin, A texture synthesis model based
on semi-discrete optimal transport in patch space, SIAM Journal on Imaging Sciences
11 (2018), no. 4, 2456–2493.
49. Alfred Galichon, Optimal transport methods in economics, Princeton University Press,
2018.
50. Alfred Galichon and Bernard Salanié, Matching with trade-offs: Revealed preferences
over competing characteristics, Tech. report, CEPR Discussion Papers, 2010.
51. Wilfrid Gangbo and Robert J McCann, The geometry of optimal transportation, Acta
Mathematica 177 (1996), no. 2, 113–161.
52. Aude Genevay, Marco Cuturi, Gabriel Peyré, and Francis Bach, Stochastic optimiza-
tion for large-scale optimal transport, Advances in neural information processing sys-
tems, 2016, pp. 3440–3448.
53. Nicola Gigli, On hölder continuity-in-time of the optimal transport map towards mea-
sures along a curve, Proceedings of the Edinburgh Mathematical Society 54 (2011),
no. 2, 401–409.
54. A.V. Goldberg, Efficient graph algorithms for sequential and parallel computers, Ph.D.
thesis, Massachussetts Institute of Technology, 1987.
55. Xianfeng Gu, Feng Luo, Jian Sun, and Shing-Tung Yau, Variational principles for
minkowski type problems, discrete optimal transport, and discrete monge–ampère equa-
tions, Asian Journal of Mathematics 20 (2016), no. 2, 383–398.
56. Nestor Guillen, A primer on generated jacobian equations: Geometry, optics, econom-
ics, Notices of the American Mathematical Society 66 (2019), no. 9.
57. Cristian E Gutiérrez and Haim Brezis, The monge-ampere equation, vol. 44, Springer,
2001.
58. Valentin Hartmann and Dominic Schuhmacher, Semi-discrete optimal transport – the
case p = 1, arXiv preprint arXiv:1706.07650, 2017.
OPTIMAL TRANSPORT: DISCRETIZATION AND ALGORITHMS 67

59. Morgane Henry, Emmanuel Maitre, and Valérie Perrier, Optimal transport using
helmholtz-hodge decomposition and first-order primal-dual algorithms, 2015 IEEE In-
ternational Conference on Image Processing (ICIP), IEEE, 2015, pp. 4748–4752.
60. Romain Hug, Analyse mathématique et convergence d’un algorithme pour le trans-
port optimal dynamique: cas des plans de transports non réguliers, ou soumis à des
contraintes, Thèse de doctorat de l’Université Grenoble-Alpes, 2016.
61. Jan-Christian Hütter and Philippe Rigollet, Minimax rates of estimation for smooth
optimal transport maps, arXiv preprint arXiv:1905.05828, 2019.
62. Richard Jordan, David Kinderlehrer, and Felix Otto, The variational formulation of
the fokker–planck equation, SIAM journal on mathematical analysis 29 (1998), no. 1,
1–17.
63. Michael Kerber, Dmitriy Morozov, and Arnur Nigmetov, Geometry helps to compare
persistence diagrams, Journal of Experimental Algorithmics (JEA) 22 (2017), 1–4.
64. Jun Kitagawa, An iterative scheme for solving the optimal transportation problem,
Calculus of Variations and Partial Differential Equations 51 (2014), no. 1-2, 243–263.
65. Jun Kitagawa, Quentin Mérigot, and Boris Thibert, Convergence of a newton al-
gorithm for semi-discrete optimal transport, Journal of the European Mathematical
Society (2019), OnlineFirst.
66. Stanislav Kondratyev, Léonard Monsaingeon, Dmitry Vorotnikov, et al., A new opti-
mal transport distance on the space of finite radon measures, Advances in Differential
Equations 21 (2016), no. 11/12, 1117–1164.
67. Hugo Lavenant, Unconditional convergence for discretizations of dynamical optimal
transport, arXiv preprint arXiv:1909.08790, 2019.
68. Bruno Lévy, A numerical algorithm for l2 semi-discrete optimal transport in 3d,
ESAIM: Mathematical Modelling and Numerical Analysis 49 (2015), no. 6, 1693–
1715.
69. Damiano Lombardi and Emmanuel Maitre, Eulerian models and algorithms for un-
balanced optimal transport, ESAIM: Mathematical Modelling and Numerical Analysis
49 (2015), no. 6, 1717–1744.
70. Quentin Mérigot, A multiscale approach to optimal transport, Computer Graphics
Forum 30 (2011), no. 5, 1583–1592.
71. Quentin Mérigot, Alex Delalande, and Frédéric Chazal, Quantitative stability of op-
timal transport maps and linearization of the 2-wasserstein space, arXiv preprint
arXiv:1910.05954 (2019).
72. Quentin Mérigot, Jocelyn Meyron, and Boris Thibert, An algorithm for optimal trans-
port between a simplex soup and a point cloud, SIAM Journal on Imaging Sciences 11
(2018), no. 2, 1363–1389.
73. Quentin Mérigot and Jean-Marie Mirebeau, Minimal geodesics along volume-
preserving maps, through semidiscrete optimal transport, SIAM Journal on Numerical
Analysis 54 (2016), no. 6, 3465–3492.
74. Jocelyn Meyron, Quentin Mérigot, and Boris Thibert, Light in power: a general and
parameter-free algorithm for caustic design, ACM Transactions on Graphics (TOG)
37 (2019), no. 6, 224.
75. Jean-Marie Mirebeau, Discretization of the 3d Monge-Ampère operator, between wide
stencils and power diagrams, ESAIM: Mathematical Modelling and Numerical Anal-
ysis 49 (2015), no. 5, 1511–1523.
76. Gaspard Monge, Mémoire sur la théorie des déblais et des remblais, 1781.
77. Michael Neilan, Abner J Salgado, and Wujun Zhang, The monge-amp\{e} re equation,
arXiv preprint arXiv:1901.05108, 2019.
78. Adam M Oberman and Yuanlong Ruan, An efficient linear programming method for
optimal transportation, arXiv preprint arXiv:1509.03668, 2015.
79. VI Oliker and LD Prussner, On the numerical solution of the equation and its dis-
cretizations, i, Numerische Mathematik 54 (1989), no. 3, 271–293.
80. Vladimir Oliker, Mathematical aspects of design of beam shaping surfaces in geomet-
rical optics, Trends in Nonlinear Analysis, Springer, 2003, pp. 193–224.
81. Nicolas Papadakis, Gabriel Peyré, and Edouard Oudet, Optimal transport with prox-
imal splitting, SIAM Journal on Imaging Sciences 7 (2014), no. 1, 212–238.
68 QUENTIN MÉRIGOT AND BORIS THIBERT

82. Brendan Pass, Multi-marginal optimal transport: theory and applications, ESAIM:
Mathematical Modelling and Numerical Analysis 49 (2015), no. 6, 1771–1790.
83. Gabriel Peyré and Marco Cuturi, Computational optimal transport, Foundations and
Trends R in Machine Learning 11 (2019), no. 5-6, 355–607.
84. Svetlozar T Rachev and Ludger Rüschendorf, Mass transportation problems: Volume
i: Theory, vol. 1, Springer Science & Business Media, 1998.
85. , Mass transportation problems: Applications, Springer Science & Business
Media, 2006.
86. R Tyrrell Rockafellar, Convex analysis, vol. 28, Princeton university press, 1970.
87. Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas, The earth mover’s distance as a
metric for image retrieval, International journal of computer vision 40 (2000), no. 2,
99–121.
88. Filippo Santambrogio, Optimal transport for applied mathematicians, Springer, 2015.
89. Bernhard Schmitzer, A sparse multiscale algorithm for dense optimal transport, Jour-
nal of Mathematical Imaging and Vision 56 (2016), no. 2, 238–259.
90. , Stabilized sparse scaling algorithms for entropy regularized transport problems,
SIAM Journal on Scientific Computing 41 (2019), no. 3, A1443–A1481.
91. Richard Sinkhorn, A relationship between arbitrary positive matrices and doubly sto-
chastic matrices, The annals of mathematical statistics 35 (1964), no. 2, 876–879.
92. Richard Sinkhorn and Paul Knopp, Concerning nonnegative matrices and doubly sto-
chastic matrices, Pacific Journal of Mathematics 21 (1967), no. 2, 343–348.
93. Justin Solomon, Fernando De Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher,
Andy Nguyen, Tao Du, and Leonidas Guibas, Convolutional Wasserstein distances:
Efficient optimal transportation on geometric domains, ACM Transactions on Graph-
ics (TOG) 34 (2015), no. 4, 66.
94. Justin Solomon, Raif Rustamov, Leonidas Guibas, and Adrian Butscher, Earth
mover’s distances on discrete surfaces, ACM Transactions on Graphics (TOG) 33
(2014), no. 4, 67.
95. Neil S Trudinger, On the local theory of prescribed jacobian equations, Discrete &
Continuous Dynamical Systems-A 34 (2014), no. 4, 1663–1681.
96. François-Xavier Vialard, An elementary introduction to entropic regularization and
proximal methods for numerical optimal transport, Lecture, May 2019.
97. Cédric Villani, Topics in optimal transportation, no. 58, American Mathematical Soc.,
2003.
98. , Optimal transport: old and new, vol. 338, Springer Science & Business Media,
2008.
99. Xu-Jia Wang, On the design of a reflector antenna ii, Calculus of Variations and
Partial Differential Equations 20 (2004), no. 3, 329–341.

You might also like