Efficient Optimal Transport Algorithm by Accelerated Gradient Descent

Efficient Optimal Transport Algorithm by Accelerated Gradient descent
Dongsheng An 1 Na Lei 2 Xianfeng Gu 1
Abstract nonparametric models (Nguyen, 2013), computer vision (Ar-

jovsky et al., 2017; Courty et al., 2017; An et al., 2020a;b),
Optimal transport (OT) plays an essential role natural language processing (Kusner et al., 2015; Yurochkin
in various areas like machine learning and deep et al., 2019) and so on. In these areas, the complex proba-
arXiv:2104.05802v2 [cs.LG] 19 Jul 2021
learning. However, computing discrete optimal bility measures are approximated by summations of Dirac
transport plan for large scale problems with ade- measures supported on the samples. To obtain the Wasser-
quate accuracy and efficiency is still highly chal- stein distance between the empirical distributions, we have
lenging. Recently, methods based on the Sinkhorn to solve the discrete OT problem.
algorithm add an entropy regularizer to the prime
problem and get a trade off between efficiency For the discrete optimal transport problem, where both the
and accuracy. In this paper, we propose a novel source and target measures are discrete, the Kantorovich
algorithm to further improve the efficiency and ac- functional becomes a convex function defined on a convex
curacy based on Nesterov’s smoothing technique. domain. Due to the lack of smoothness, conventional gradi-
Basically, the non-smooth c-transform of the Kan- ent descend method can not be applied directly. Instead, it
torovich potential is approximated by the smooth can be optimized with the sub-differential method (Nesterov,
Log-Sum-Exp function, which finally smooths 2005), where the gradient is replaced by the sub-differential.
the original non-smooth Kantorovich dual func- In order to achieve an approximation error less than ε, the
tional. The smooth Kantorovich functional can be sub-differential method requires O(1/ε2 ) iterations. Re-
optimized by the fast proximal gradient algorithm cently, several approximation methods have been proposed
(FISTA) efficiently. Theoretically, the computa- to improve the computational efficiency. In these methods
tional complexity of the proposed method is given (Cuturi, 2013; Benamou et al., 2015; Altschuler et al., 2017),
5√
by O(n 2 log n/), which is lower than current a strongly convex entropy function is added to the prime
estimation of the Sinkhorn algorithm. Empirically, Kantorovich problem and thus the regularized problem can
compared with the Sinkhorn algorithm, our ex- be efficiently solved by the Sinkhorn algorithm. More de-
perimental results demonstrate that the proposed tailed analysis shows that the computational complexity of
method achieves faster convergence and better the Sinkhorn algorithm is Õ(n2 /ε2 ) (Dvurechensky et al.,
accuracy with the same parameter. 2018) by setting λ = /4 log n. Also, a series of primal-dual
algorithms are proposed, including the APDAGD algorithm
(Dvurechensky et al., 2018) with computational complexity
Õ(n2.5
√/), the APDAMD algorithm (Lin et al., 2019) with
1. Introduction Õ(n2 r/) where r is a complex constant of the Bregman
Optimal transport (OT) is a powerful tool to compute the divergence, and the APDRCD algorithm (Guo et al., 2020)
Wasserstein distance between probability measures, which with Õ(n2.5 /). But all of the three methods need to build
are widely used to model various natural and social phenom- a matrix with space complexity O(n3 ), which makes them
ena, including economics (Galichon, 2016), optics (Glimm hard to use when n is large.
& Oliker, 2003), biology (Schiebinger et al., 2019), physics Our Method In this work, instead of starting from the prime
(Jordan et al., 1998) and so on. Recently, optimal transport Kantorovich problem like the Sinkhorn based methods, we
has been successfully applied in the areas of machine learn- directly deal with the dual Kantorovich problem. The key
ing and statistics, such as parameter estimation in Bayesian idea is to approximate the original non-smooth c-transform
1
Department of Computer Science, Stony Brook Univer- of the Kantorovich potential by Nesterov’s smoothing idea.
sity, NY, USA 2 DUT-RU ISE, Dalian University of Technol- Specifically, we approximate the max function by the Log-
ogy, Liaoning, China. Correspondence to: Dongsheng An Sum-Exp function, which has also been used in (Schmitzer,
<doan@cs.stonybrook.edu>, Na Lei <nalei@dlut.edu.cn>, Xian- 2019; Peyré & Cuturi, 2018), such that the original non-
feng Gu <gu@cs.stonybrook.edu>. smooth Kantorovich functional is converted to be an uncon-
strained (n − 1)-dimensional smooth convex energy. By
using the Fast Proximal Gradient Method (FISTA) (Beck form with the same constraints:
& Teboulle, 2008), we can quickly optimize the smoothed ( m
X n
X
)
energy to get a precise estimate of the OT cost. In the- (DP 2) Mc (µ, ν) = − min φi µ i − ψj ν j (3)
ory, the method can achieve the approximate error ε with i=1 j=1
space complexity
√ O(n2 ) and computational complexity Definition 3 (c-transform). Let φ ∈ L1 (X, µ) and ψ ∈
2.5
O(n log n/ε). L1 (Y, ν), we define
The contributions of the proposed method includes: (i) φi = ψ c (xi ) = max{ψj − cij } (4)
j
We convert the dual Kantorovich problem to be an uncon-
strained convex optimization problem by approximating
With c-transform, Eqn. (3) is equivalent to solving the
the nonsmooth c-transform of the Kantorovich potential
following unconstrained convex optimization problem:
with Nesterov’s smoothing. (ii) The smoothed Kantorovich
m n
functional can be efficiently solved by the√FISTA algorithm X X
Mc (µ, ν) = − min E(ψ) = − min{ µi ψ c (xi ) − νj ψj }
with computational complexity Õ(n2.5 / ε). At the same ψ ψ
i=1 j=1
time, the computational complexity of the Kantorovich func- (5)
tional itself is given by Õ(n2.5 /ε). (iii) The experiments Suppose ψ ∗ is the solution to the problem Eqn. (5), then
demonstrate that compared with the Sinkhorn algorithm, ψ ∗ + k1 is also an optimal solution for Eqn. (5). In order to
the proposed method achieves faster convergence, better ∈ H using
make the solution unique, we add a constraintPψ n
accuracy and higher stability with the same parameter λ. the indicator function IH , where H = {ψ| j=1 ψj = 0},
and modify the Kantorovich functional E(ψ) in Eqn. (5):
2. Optimal Transport Theory
0 ψ∈H
In this section, we introduce some basic concepts and theo- Ẽ(ψ) = E(ψ) + IH (ψ), IH (ψ) = (6)
∞ ψ 6∈ H
rems in the classical OT theory, focusing on Kantorovich’s
approach and its generalization to the discrete settings via Then the problem Eqn. (5) is equivalent to solving:
c-transform. The details can be found in Villani’s book Mc (µ, ν) = − min Ẽ(ψ)
ψ
(Villani, 2008).
m
X n
X (7)
Problem 1 (Kantorovich Problem). Suppose X ⊂ R , d = − min µi ψ c (xi ) − νj ψj + IH (ψ)
ψ
Y ⊂PRd are two subsets ofPthe Euclidean space Rd , i=1 j=1
m n
µ = i=1 µi δ(x − xi ), ν = j=1 νj δ(y − yj ) are two which is essentially an (n − 1)-dimensional unconstrained
discrete probability measures
Pm Pn on X and Y with
defined convex problem. According to the definition of c-transform
equal total measure, i=1 µi = j=1 νj . Then the Kan- in Eqn. (4), ψ c is non-smooth w.r.t ψ.
torovich problem aims to find the optimal transport plan
P = (pij ) that minimizes the total transport cost:
m X
n
3. Nesterov’s Smoothing of Kantorovich
(KP ) Mc (µ, ν) = min
X
pij cij . (1) functional
P ∈π(µ,ν)
i=1 j=1
In this section, we smooth the non-smooth discrete Kan-
where cij = c(xi , yj ) is the cost that transports
Pm one unit torovich functional E(ψ) by approximating ψ c (x) with the
mass from
Pn x i to y j , and π(µ, ν) = {P | i=1 pij = Log-Sum-Exp function to get the smooth Kantorovich func-
νj , j=1 pij = µi , pij ≥ 0}. tional Eλ (ψ), following Nesterov’s original strategy (Nes-
Problem 2 (Dual Kantorovich). Given two discrete proba- terov, 2005), which has also been applied in the OT field
Pm Pn (Peyré & Cuturi, 2018; Schmitzer, 2019). Then through the
bility measures µ = i=1 µi δ(x − xi ), ν = j=1 νj δ(y −
yj ) supported on X and Y , and the transport cost function FISTA algorithm (Beck & Teboulle, 2008), we can easily
c : X × Y → R, the Kantorovich problem is equivalent to induce √that the computation complexity of our algorithm is
maximizing the following Kantorovich functional: O(n2.5 log n/ε), with Ẽ(ψ t ) − Ẽ(ψ ∗ ) ≤ . By abuse of
notation, in the following we call both E(ψ) and Ẽ(ψ) the
  Kantorovich functional, both Eλ (ψ) and Ẽλ (ψ) the smooth
m n
 X X  Kantorovich functional.
(DP 1) Mc (µ, ν) = max − φi µi + ψj νj (2)
φ,ψ 
i=1

j=1
Definition 4 ((α, β)-smoothable). A convex function f is
called (α, β)-smoothable if for any λ > 0, ∃ a convex
function fλ s.t.
where φ ∈ L1 (X, µ) and ψ ∈ L1 (Y, ν) are called the
fλ (x) ≤ f (x) ≤ fλ (x) + βλ
Kantorovich potentials, and −φi + ψj ≤ cij . The above
α
problem can be reformulated as the following minimization fλ (y) ≤ fλ (x) + h∇fλ (x), y − xi + (y − x)T ∇2 fλ (x)(y − x)
2λ
Here fλ is called a λ1 -smooth approximation of f with pa- Similar to the discrete Kantorovich functional in Eqn. (5),
rameters (α, β). the optimizer of the smooth Kantorovich functional in
Eqn. (10) is also not unique: given an optimizer ψλ∗ , then
In the above definition, the parameter λ defines a tradeoff ψλ∗ + k1, k ∈ R is also an optimizer. We can eliminate
between the approximation accuracy and the smoothness, the ambiguity by adding the indicator function as Eqn. (6),
the smaller the λ, the better approximation and the less Ẽλ (ψ) = Eλ (ψ) + IH (ψ),
smoothness. m n
! n
X X (ψj −cij )/λ
X
n
Lemma 5 (Nesterov’s Smoothing). Given f : R → R, Ẽλ (ψ) = λ µi log e − νj ψj − λ log n + IH (ψ)
f (x) = max{xj : j = 1, . . . , n}, for any λ > 0, we have i=1 j=1 j=1
its λ1 -smooth approximation with parameters (1, log n) (14)
n
! This energy can be optimized effectively through the accel-
erated proximal algorithms (Nesterov, 2005), such as the
X xj /λ
fλ (x) = λ log e − λ log n, (8)
j=1 Fast Proximal Gradient Method (FISTA) (Beck & Teboulle,
2008). The FISTA iterations are as follows:
Recall the definition of c-transform of the Kantorovich po- z t+1 = Πηt IH ψ t − ηt ∇Eλ (ψ t )

tential in Eqn. (4), we obtain the Nesterov’s smoothing of θt − 1 t+1 (15)
ψ c by applying Eqn. (8) ψ t+1 = z t+1 + (z − zt)
θt+1
ψ 0 = v0 = 0, θ0 = 1, ηt = λ
n
!
X (ψj −cij )/λ with initial conditions
ψλc (xi ) = λ log e − λ log n. (9) 1
p
j=1
and θt+1 = 2 1 + 1 + 4θt2 . Here Πηt IH (z) = z −
1
Pn
n j=1 zj is the projection of z to H (the proximal func-
We use ψλc to replace ψ c in Eqn. (7) to approximate the tion of IH (x)) (Parikh & Boyd, 2014). Similar to the
Kantorovich functional. Then the Nesterov’s smoothing of Sinkhorn’s algorithm, this algorithm can be parallelized,
the Kantorovich functional becomes since all the operations are row based.
m n
! n
X X (ψj −cij )/λ
X Theorem 8. Given the cost matrix C = P(cij ), the
Eλ (ψ) = λ µi log e − νj ψj − λ log n m
source measure µ and target measure ν with i=1 µi =
i=1 j=1 j=1 n ∗
P
(10) j=1 νj = 1, ψ is the optimizer of the discrete dual Kan-
and its gradient is given by torovich functional Ẽ(ψ), and ψλ∗ is the optimizer of the
m
smooth Kantorovich functional Ẽλ (ψ). Then the approxi-
∂Eλ (ψ) X e(ψj −cij )/λ mation error is
= µi Pn (ψk −cik )/λ
− νj , ∀ j ∈ [n] (11)
∂ψj k=1 e
i=1
|Ẽ(ψ ∗ ) − Ẽλ (ψλ∗ )| ≤ 2λ log n
Furthermore, we can directly compute the Hessian matrix 0
of Eλ (ψ). Let −cij /λ
and vj = eψj /λ , and set q ∗9. Suppose λ is fixed and ψ = 0, then for any
Corollary
PnKij = e t≥
2kψλ k2
λε , we have
i
Eλ := λ log j=1 Kij vj , ∀ i ∈ [m]. Direct computation
gives the following Hessian matrix:
Ẽλ (ψ t ) − Ẽλ (ψλ∗ ) ≤ ε,
m
1X 1 1
∇2 Eλ = µi Λi − T Vi ViT (12) where ψλ∗ is the optimizer of Ẽλ (ψ).
λ i=1 1T Vi (1 Vi )2
√
ε 8kC̄k2 n log n
Theorem 10. If λ = 2 log n , then for any t ≥
where Vi = (Ki1 v1 , Ki2 v2 , . . . , Kin vn )T , and Λi =
diag(Ki1 v1 , Ki2 v2 , . . . , Kin vn ). By the Hessian matrix, with C̄ = Cmax − λ log νmin , we have
we can show that Eλ is a smooth approximation of E. Ẽ(ψ t ) − Ẽ(ψ ∗ ) < ε
1
Lemma 6. Eλ (ψ) is a λ -smooth approximation of E(ψ) where ψ t is the solver of Ẽλ (ψ) after t steps in the iterations
with parameters (1, log n). in Eqn. (15), and ψ ∗ is the optimizer of√Ẽ(ψ). Then the
Lemma 7. Suppose Eλ (ψ) is the λ1 -smooth approximation 2.5
total computational complexity is O( n ε log n ).
of E(ψ) with parameters (1, log n), ψλ∗ is the optimizer of
Eλ (ψ), the approximate OT plan is unique and given by Since Ẽ(ψ t ) > Ẽ(ψλ∗ ), thus we have
0 < Ẽ(ψλ∗ ) − Ẽ(ψ ∗ ) < 2λ log n
∗
e((ψλ )j −cij )/λ µi Kij vj∗ (16)
(Pλ∗ )ij = µi Pn ∗ = (13)
k=1 e
((ψλ )k −cik )/λ Ki v ∗
This shows that Ẽ(ψλ∗ ) at least linearly converges to Ẽ(ψ ∗ )
∗
∗ ψλ /λ
where Ki is the ith row of K and v = e . with respect to λ.
Figure 1. Comparison with the Sinkhorn algorithm (Cuturi, 2013) under different costs.
4. Experiments
Table 1. The running iterations and time comparison between the
In this section, we investigate the performance of the pro- proposed method and Sinkhorn (Cuturi, 2013) with T = 700.
posed algorithm by comparing with the Sinkhorn algorithm Sinkhorn Ours
Iterations Time(s) Iterations Time(s)
(Cuturi, 2013) with computational complexity Õ(n2 /2 ) SED 209 0.0431 29 0.0197
(Dvurechensky et al., 2018). All of the codes are written us- SD 297 0.0564 22 0.0142
ing MATLAB with GPU acceleration. The experiments are
mate accuracy. We manually set λ = max(C)−min(C) with
conducted on a Windows laptop with Intel Core i7-7700HQ T
T = 500 to get a good estimate of the OT cost, and adjust
CPU, 16 GB memory and NVIDIA GTX 1060Ti GPU.
the step length η to get the best convergence of the proposed
Cost Matrix We test the performance of the algorithm method. For fair comparison, we use the same λ for the
with
Pm different cost matrix. Pn Specifically, we set µ = Sinkhorn algorithm. We summarize the results in Fig. 1,
µ
i=1 i δ(x − x i ), ν = j=1 νj δ(y − yj ). After the set- where the green curves represent the groundtruth computed
tings of µi s and νj s, they are normalized by µi = Pmµi µk , by linear programming, the blue curves are for the Sinkhorn
k=1
ν algorithm, and the red curves give the results of our method.
νj = Pn j νk . To build the cost matrix, we use the squared
k=1
Euclidean distance and the spherical distance. From the figure, we can observe that in both experiments,
our method achieves faster convergence than the Sinkhorn
• For the squared Euclidean distance (SED) experiment, algorithm. Also, −E(ψλ∗ ) gives a better approximate of
like (Altschuler et al., 2017), we randomly choose the OT cost than hPλ , Ci used in the Sinkhorn algorithm
one pair of images from the MNIST dataset (LeCun with the same small λ, especially for the squared euclidean
& Cortes, 2010), and then add negligible noise 0.01 distance cost. Basically, to achieve precision, hPλ∗ , Ci

to each background pixel with intensity 0. µi and needs to set λ = 4 log n (Dvurechensky et al., 2018), which

xi (νj and yj ) are set to be the value and the coordi- is smaller than our requirement for λ = 2 log n.
nate of each pixel in the source (target) image. Then
We also report the iterations and the corresponding running
the squared Euclidean distance between xi and yj are
time with T = 700 in Tab. 1. The stop condition is set to
given by c(xi , yj ) = kxi − yj k2 .
be |E(ψ t+1 ) − E(ψ t )|/|E(ψ t )| < 10−3 . In each iteration,
• For the spherical distance (SD) experiment, we set Sinkhorn requires 2n2 times of operation, and the proposed
m = n = 500. Both µi s and νj s are randomly gen- method requires 3n2 times of operation. From Tab. 1 we
erated from the uniform distribution U ni([0, 1]). xi ’s can see that the proposed method needs less iterations and
are randomly sampled from the Gaussian distribution time for convergence.
N(31d , Id ) and yj ’s are randomly sampled from the
Uniform distribution U ni([0, 1]d ). Then we normalize 5. Conclusion
y
xi and yj by xi = kxxiik2 and yj = kyjjk2 . As a result,
both xi ’s and yj ’s are located on the sphere. The spher- In this paper, we propose a novel algorithm to improve the
ical distance is given by c(xi , yj ) = arccos (hxi , yj i). accuracy for solving the discrete OT problem based on Nes-
terov’s smoothing technique. The c-transform of the Kan-
torovich potential is approximated by the smooth Log-Sum-
Evaluation Metrics W mainly use two metrics to evaluate
Exp function, and the smoothed Kantorovich functional can
the proposed method: the first is the total transport cost,
be solved by the fast proximal gradient algorithm (FISTA)
which is defined by Eqn. (5) and given by −E(ψ); and the
efficiently. Theoretically, the computational
√ complexity of
second is the L1 distance from the computed transport plan
the proposed method is given by O(n2.5 log n/ε), which
Pλ to the admissible distribution space π(µ, ν) defined as
is lower than current estimation of the Sinkhorn method.
D(Pλ ) = kPλ 1 − µk1 + kPλT 1 − νk1 .
Empirically, our experimental results demonstrate that the
Then we compare with the Sinkhorn algorithm (Cuturi, proposed method achieves faster convergence and better
2013) w.r.t both the convergence rate and the approxi- accuracy than the Sinkhorn algorithm.
References Horn, T. A. and Johnson, C. R. Topics in Matrix Analysis.

Cambridge, 1991.
Altschuler, J., Niles-Weed, J., and Rigollet, P. Near-linear
time approximation algorithms for optimal transport via Jordan, R., Kinderlehrer, D., and Otto, F. The variational
sinkhorn iteration. In Advances in Neural Information formulation of the fokker–planck equation. SIAM Journal
Processing Systems 30. 2017. on Mathematical Analysis, 29(1):1–17, 1998.
An, D., Guo, Y., Lei, N., Luo, Z., Yau, S.-T., and Gu, X. Kusner, M., Sun, Y., Kolkin, N., and Weinberger, K. From
Ae-ot: A new generative model based on extended semi- word embeddings to document distances. In Proceed-
discrete optimal transport. In International Conference ings of the 32nd International Conference on Machine
on Learning Representations, 2020a. Learning, pp. 957–966, 2015.
An, D., Guo, Y., Zhang, M., Qi, X., Lei, N., and Gu, X. LeCun, Y. and Cortes, C. MNIST handwritten digit
AE-OT-GAN: Training GANs from data specific latent database. 2010. URL http://yann.lecun.com/
distribution. In European Conference on Computer Vision exdb/mnist/.
(ECCV), pp. 548–564, 2020b.
Lin, T., Ho, N., and Jordan, M. On efficient optimal trans-
Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein port: An analysis of greedy and accelerated mirror de-
generative adversarial networks. In ICML, pp. 214–223, scent algorithms. In International Conference on Ma-
2017. chine Learning, 2019.
Beck, A. and Teboulle, M. A fast iterative shrinkage- Nesterov, Y. Smooth minimization of non-smooth functions.
thresholding algorithm for linear inverse problems. SIAM Mathematical programming, 2005.
J. IMAGINGSCIENCES, 2008. Nguyen, X. Convergence of latent mixing measures in finite
and infinite mixture models. Ann. Statist, 41, 2013.
Benamou, J.-D., Carlier, G., Cuturi, M., Nenna, L., and
Peyré, G. Iterative bregman projections for regularized Parikh, N. and Boyd, S. Proximal Algorithms. Foundations
transportation problems. SIAM Journal on Scientific Com- and Trends in Optimization, 2014.
puting, 2015.
Peyré, G. and Cuturi, M. Computational Optimal Transport.
Courty, N., Flamary, R., Tuia, D., and Rakotomamonjy, A. https://arxiv.org/abs/1803.00567, 2018.
Optimal transport for domain adaptation. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 39 Schiebinger, G., Shu, J., Tabaka, M., Cleary, B., Subra-
(9):1853–1865, 2017. manian, V., Solomon, A., Gould, J., Liu, S., Lin, S.,
Berube, P., Lee, L., Chen, J., Brumbaugh, J., Rigollet, P.,
Cuturi, M. Sinkhorn distances: Lightspeed computation of Hochedlinger, K., Jaenisch, R., Regev, A., and Lander, E.
optimal transportation distances. In International Confer- Optimal-transport analysis of single-cell gene expression
ence on Neural Information Processing Systems, 2013. identifies developmental trajectories in reprogramming.
Cell, 2019.
Dvurechensky, P., Gasnikov, A., and Kroshnin, A. Com-
putational optimal transport: Complexity by accelerated Schmitzer, B. Stabilized sparse scaling algorithms for en-
gradient descent is better than by sinkhorn’s algorithm. tropy regularized transport problems. SIAM Journal on
In Proceedings of the 35th International Conference on Scientific Computing, 2019.
Machine Learning. PMLR, 2018. Villani, C. Optimal transport: old and new, volume 338.
Springer Science & Business Media, 2008.
Galichon, A. Optimal Transport Methods in Economics.
Princeton University Press, 2016. Yurochkin, M., Claici, S., Chien, E., Mirzazadeh, F., and
Solomon, J. M. Hierarchical optimal transport for docu-
Glimm, T. and Oliker, V. Optical design of single reflector
ment representation. In Advances in Neural Information
systems and the monge–kantorovich mass transfer prob-
Processing Systems 32. 2019.
lem. Journal of Mathematical Sciences, 117(3):4096–
4108, Sep 2003.
Guo, W., Ho, N., and Jordan, M. Fast algorithms for com-
putational optimal transport and wasserstein barycenter.
In International Conference on Artificial Intelligence and
Statistics (AISTATS), 2020.
A. The Proofs Proof of Theorem 8. Assume ψ ∗ and ψλ∗ are the minimiz-
ers of E(ψ) and Eλ (ψ) respectively. Then by the inequality
In this section, we give the proofs of the lemmas, theorems in Eqn. (A)
and propositions in the main paper.
Eλ (ψ ∗ ) ≤ E(ψ ∗ ) ≤ E(ψλ∗ ) ≤ Eλ (ψλ∗ ) + λ log n
Proof of Lemma 5. We have ∀x ∈ Rn , Eλ (ψλ∗ ) ≤ Eλ (ψ ∗ ) ≤ E(ψ ∗ ) ≤ Eλ (ψ ∗ ) + λ log n
fλ (x) ≤ λ log(n max exj /λ ) − λ log(n) = f (x) This shows |Eλ (ψ ∗ ) − Eλ (ψλ∗ )| ≤ λ log n. Removing the
j
n
X indicator functions, we can get
f (x) = λ log max exj /λ < λ log exj /λ = fλ (x) + λ log n
j
j=1 |Ẽ(ψ ∗ ) − Ẽλ (ψλ∗ )| = |E(ψ ∗ ) − Eλ (ψλ∗ )|
≤ |E(ψ ∗ ) − Eλ (ψ ∗ )| + |Eλ (ψ ∗ ) − Eλ (ψλ∗ )|
Furthermore, it is easy to prove that fλ (x) is λ1 -smooth.
Therefore, fλ (x) is an approximation of f (x) with parame- ≤ 2λ log n.
ters (1, log n).
Proof
of Lemma 6. From Eqn. (12), we see ∇2 Eλi =
1 1 1 T Proof of Corollary 9. Under the settings of the smoothed
λ 1T Vi Λi − (1T Vi )2 Vi Vi has O = {k1 : k ∈ R} as
Kantorovich problem Eqn. (10), Eλ (ψ) is convex and dif-
its null space. In the orthogonal complementary space of
ferentiable with ∇2 Eλ (ψ) λ1 I, IH (ψ) is convex and its
O, ∇2 Eλi is diagonal dominant, therefore strictly positive
proximal function is given by ΠH (v). Thus, directly apply-
definite.
ing Thm 4.4 of (Beck & Teboulle, 2008) and set L = λ1 ,
Weyl’s inequality (Horn & Johnson, 1991) states that the we can get Ẽλ (ψ t ) − Ẽλ (ψλ∗ ) ≤ λ(t+1)
2 ∗ 0 2
2 kψλ − ψ k . Set
eigen value of A = B + C is no greater than the maximal 2
ψ 0 = 0 and λ(t+1) ∗ 2
2 kψλ k ≤ ε, then we get that when
eigenvalue of B minus the minimal eigenvalue of C, where q ∗
2kψλ k2 t ∗
B is an exact matrix and C is a perturbation matrix. Hence t≥ λε , we have Ẽλ (ψ ) − Ẽλ (ψλ ) ≤ ε.
the maximal eigenvalue of ∇2 Eλi , denoted as λi , has an
upper bound,
Proof of Theorem 10. We set the initial condition ψ 0 = 0.
1 1 1 For any given ε > 0, we choose iteration step t, such that
0 ≤ λi ≤ max{Kij vj } ≤ . √ ∗ k2 log n
λ 1T Vi j λ 2 ∗ 2 ε 8kψλ
where ψλ∗ is the
λ(t+1)2 kψλ k ≤ 2, t ≥ ,
optimizer of Ẽλ (ψ). By Thm 4.4 of (Beck & Teboulle,
Thus
Pn the maximal eigenvalue of ∇2 Eλ (ψ) is no greater than
1 2008), we have
i=1 µi λi = λ . It is easy to find that Eλ (ψ) ≤ E(ψ) ≤
Eλ (ψ)+λ log n. Thus,Eλ (ψ) is a λ1 -smooth approximation
Ẽλ (ψ t ) − Ẽ(ψ ∗ ) ≤ Ẽλ (ψ t ) − Ẽλ (ψ ∗ )
of E(ψ) with parameters (1, log n).
≤ Ẽλ (ψ t ) − Ẽλ (ψλ∗ )
Proof of Lemma 7. By the gradient formula Eqn. (11) and 2
≤ kψ ∗ k2
the optimizer ψλ∗ , we have λ(t + 1)2 λ
ε
∂Eλ (ψλ∗ ) X ∗
m ≤
= (Pλ )ij − νj = 0 ∀ j = 1, · · · , n. 2
∂ψj i=1
By Eqn. (14), we have
On the other hand, by the definition of Pλ∗ , we have
n
X Ẽ(ψ t ) − Ẽ(ψ ∗ ) = E(ψ t ) + IH (ψ t ) − E(ψ ∗ ) − IH (ψ ∗ )
(Pλ∗ )ij = µi , ∀ i = 1, . . . , m, = (E(ψ t ) − Eλ (ψ t )) + (Eλ (ψ t ) + IH (ψ t ))
j=1
− (E(ψ ∗ ) + IH (ψ ∗ ))
compare the above two equations, we obtain that Pλ∗ ∈ ≤ |E(ψ t ) − Eλ (ψ t )| + (Ẽλ (ψ t ) − Ẽ(ψ ∗ ))
π(µ, ν) is the approximate OT plan. ε ε ε
≤ λ log n + = +
The optimal solvers of Eλ (ψ) has the form ψλ∗ + 1, all of 2 2 2
them induce the same approximate transport plan given by =ε
Eqn. (13). Thus, Pλ∗ is unique.
Next we show that kψλ∗ k2 ≤ nkC̄k2 by proving |(ψλ∗ )j | ≤
C̄ ∀j ∈ [n]. According to Eq. (13),

m ∗
X e((ψλ )j −cij )/λ
νj = µi Pn ((ψλ∗ ) −c )/λ
k=1 e
k ik
i=1
m ∗
(17)
X e(ψλ )j /λ
= µi Pn ∗ ) /λ
(ψλ
i=1 k=1 e
k e(cij −cik )/λ
Assume (ψλ∗ )max is the maximal element of ψλ∗ , we have

Pn ∗ ∗
)k /λ (cij −cik )/λ
k=1 e
(ψλ
e ≥ e(ψλ )max /λ e−Cmax /λ , where
Cmax is the maximal element of the matrix of C. Thus,
m ∗
X e(ψλ )j /λ
νj ≤ µi (ψ∗ ) /λ −C /λ
i=1
e λ max e max
∗
e(ψλ )j /λ
= ∗
e(ψλ )max /λ e−Cmax /λ
Then, (ψλ∗ )max ≤ (ψλ∗ )j + Cmax − λ log νj and
n
1X
(ψλ∗ )max ≤ {(ψλ∗ )j + Cmax − λ log νj }
n j=1 (18)
≤ Cmax − λ log νmin
According to the P
inequality of arithmeticPand geometric
n ∗ 1 n ∗
means , we have k=1 e(ψλ )k /λ ≥ ne n ( k=1 (ψλ )k /λ) =
∗
e(ψλ )j /λ
n. Thus, νj ≤ ne−Cmax /λ
.
(ψλ∗ )j ≥ λ log n − Cmax + λ log νj

(19)
≥ λ log νmin − Cmax
Combine Eqn. (18) and (19) together, we have |(ψλ∗ )j | ≤

√max − λ log νmin = C̄. Hence, we obtain when t ≥
C
8kC̄k2 n log n
, Ẽ(ψ t ) − Ẽ(ψ ∗ ) < ε.
For each iteration in Eqn. (15), we need O(n2 ) times of
operations, thus total the computational
√
complexity of the
n2.5 log n
proposed method is O( ε ).

Efficient Optimal Transport Algorithm by Accelerated Gradient Descent

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Efficient Optimal Transport Algorithm by Accelerated Gradient Descent

Uploaded by

Copyright:

Available Formats

Efficient Optimal Transport Algorithm by Accelerated Gradient descent

Dongsheng An 1 Na Lei 2 Xianfeng Gu 1

Abstract nonparametric models (Nguyen, 2013), computer vision (Ar-

its λ1 -smooth approximation with parameters (1, log n) (14)

References Horn, T. A. and Johnson, C. R. Topics in Matrix Analysis.

C̄ ∀j ∈ [n]. According to Eq. (13),

Assume (ψλ∗ )max is the maximal element of ψλ∗ , we have

(ψλ∗ )j ≥ λ log n − Cmax + λ log νj

Combine Eqn. (18) and (19) together, we have |(ψλ∗ )j | ≤

You might also like