Approximation of Optimization Problems With Constraints Through Kernel Sum-Of-Squares

Approximation of optimization problems with constraints through
kernel Sum-Of-Squares
Pierre-Cyril Aubin-Frankowski, Alessandro Rudi∗
arXiv:2301.06339v1 [math.OC] 16 Jan 2023
January 18, 2023
Abstract
Handling an infinite number of inequality constraints in infinite-dimensional spaces occurs in many
fields, from global optimization to optimal transport. These problems have been tackled individually
in several previous articles through kernel Sum-Of-Squares (kSoS) approximations. We propose here
a unified theorem to prove convergence guarantees for these schemes. Inequalities are turned into
equalities to a class of nonnegative kSoS functions. This enables the use of scattering inequalities to
mitigate the curse of dimensionality in sampling the constraints, leveraging the assumed smoothness
of the functions appearing in the problem. This approach is illustrated in learning vector fields with
side information, here the invariance of a set.
Keywords: Reproducing kernels, nonconvex optimization, constraints, sum-of-squares.
1 Introduction
In this study, we are interested in solving approximately the following family of convex optimization
problems
f̄ ∈ arg min L(f ) (1a)

f = [f1 ; . . . ; fP ] ∈ C 0 (X , RP )
s.t. f (x) ∈ F(x), ∀ x ∈ X , (1b)
f ∈ E, (1c)
where X is a complete separable metric space, E is an affine subspace of the space of continuous func-
tions C 0 (X , RP ), and, for every x ∈ X , F(x) is a non-empty closed convex subset of RP . The pointwise
constraint (1b) accounts for convex constraints over the value of f (x) whereas the global constraint (1c)
can for instance force f to be harmonic, group-invariant, and more generally takes into account linear
equality constraints. We refer to Sec. 2 for examples of problem (1). The space C 0 (X , RP ) being hard to
manipulate, we approximate it by a vector-valued reproducing kernel Hilbert space (vRKHS). This choice
is motivated by our main constraints of interest, (1b), which are defined pointwise. RKHSs are precisely
the Hilbert spaces of functions for which the Hilbertian topology is tighter than pointwise convergence,
so that we can ensure the satisfaction of the constraints when taking limits. This is a key property of
such spaces for learning tasks as recalled in Canu et al. (2009, Section 3).
∗
INRIA and Département d’Informatique, École Normale Supérieure, PSL Research University, pierre-
cyril.aubin@inria.fr, alessandro.rudi@inria.fr
1
For applications, we shall consider problems of the form
min L(f ) (2a)

f ∈ HK
s.t. ci (x)> f (x) + di (x) ≥ 0, ∀ x ∈ Ki , ∀ i ∈ [I] := {1, . . . , I}, (2b)
f ∈ E, (2c)
where the sets (Ki )i∈[I] are compact subsets of X , ci ∈ C 0 (X , RP ), di ∈ C 0 (X , R), and the search
space C 0 (X , RP ) is restricted to a vRKHS HK ⊂ C sK (X , RP ) (sK ∈ N), denoting by C sK the space
of sK -smooth functions if X is a Euclidean space. Leveraging the compactness of Ki , we shall sample
elements within, through finite subsets {x̃i,m }m∈[Mi ] ⊂ Ki . Similarly if the description of E requires
an infinite number of equality constraints, we can either hard encode them in the choice of kernel K
(as done for translation and rotation invariant kernels in Micheli and Glaunès (2014)) or subsample the
collection of equalities to define an approximating affine vector space Ê. Moreover L could be replaced
by a surrogate function L̂, as when moving from population to empirical risk minimization. For the
pointwise constraints, the general approach of kernel Sum-Of-Squares (kSoS) is to replace each inequality
constraint by an equality constraint. This is done by adding an extra degree of freedom based on a second
real-valued RKHS Hφ ⊂ C sφ (X , R) (sφ ∈ N) and its feature map φ : x 7→ φ(x) ∈ Hφ , considering the
problem:
X
min L(f ) + λtr Tr(Ai ) (3a)
f ∈HK ,
i∈[I]
(Ai )i∈[I] ∈S + (H)I
s.t. ci (x̃i,m )> f (x̃i,m ) + di (x̃i,m ) = hφ(x̃i,m ), Ai φ(x̃i,m )iH , ∀ m ∈ [Mi ], ∀ i ∈ [I], (3b)
f ∈ Ê, (3c)
where λ ≥ 0, λtr ≥ 0, and S + (Hφ ) denotes the cone of positive semidefinite operators over Hφ . For
sφ ≥ 2, the extra smoothness enables the application of scattering inequalities (Wendland, 2005, Chapter
11) which improve the statistical bounds over the approximation of (2) by (3). More precisely, define the
fill distance as hX̂ := supx∈K minm∈[M ] kx − xm k, describing how tightly K is covered for I = 1. Our main
result can be informally stated as follows:
Theorem (Informal main result). Under generic assumptions on the smoothness s ∈ N of the functions,
non-degeneracy of the constraints and a well-behaved objective function, we have that the error between the
AF F ) of (2) and the optimal value L(f̄ SDP ) of its approximation (3) is upper bounded:
optimal value L(f̄(2) (3)
AF F SDP
|L(f̄(2) ) − L(f̄(3) )| ≤ C min(hX̂ , δs hsX̂ )
where C > 0 is an explicit constant, and δs ∈ {1, +∞}, in particular δs = 1 if there exists (Ai )i∈[I] ∈
S + (H)I such that ci (x)> f̄(2)
AF F (x) + d (x) = hφ(x), A φ(x)i
i i H
The precise statement is to be found in Thm. 3. In other words, our result is that sampling the
constraints through kSoS as in (3) always leads to a consistent approximation of (2), at worst at a slow
rate, since hX̂ ∝ M −1/d for uniform sampling. However, if the constraints of the optimal solution f̄(2)
AF F
can be represented by a kSoS function, then this enables a fast rate of O(M −s/d ) which mitigates the
curse of dimensionality in sampling constraints.
2
Comparison with previous work. The approach and analysis we present generalizes the idea and
the method proposed in Rudi et al. (2020) for constrained convex problems resulting from the reformula-
tion of non-convex optimization, to more general kind of inequalities. Concerning the aspect of hierarchies
of problems,1 we show below in Lemma 1 that, for strictly positive definite kernels φ, (2b) implies the
existence of A satisfying (3b). So problem (3) for λtr = 0 is a relaxation of problem (2). An alternative
framework is the tightening proposed by Aubin-Frankowski and Szabó (2020, 2022)
ci (x̃i,m )> f (x̃i,m ) + di (x̃i,m ) ≥ ηm,i kf kHK , ∀ m ∈ [Mi ], ∀ i ∈ [I] (4)
where the parameters ηm,i > 0 only depend on the covering of Ki by the discrete samples {x̃i,m }m∈[Mi ] , on
the kernel K and on the constraint functions ci and di . Although they are in finite number, the constraints
in (4) are shown to be tighter than (2b). Nevertheless the added term ηm,i kf kHK guaranteeing that the
constraints (2b) are satisfied also hampers the speed of convergence. The kSoS constraints (3b) leverage
instead the smoothness of the functions to achieve an improved asymptotical rate as presented in Thm. 3.
This is done at the price of not ensuring the satisfaction of the original constraints, since (3) is but an
approximation and not a tightening of (2). Numerically, to gain convergence speed (3b) or to achieve
guarantees (4) while only manipulating finitely many points, one has to pay the computational price of
moving to semidefinite programming (SDP) (3b) or second-order cone programming (SOCP) (4) instead
of just discretizing the affine constraints (2b).
While in general we would have Hk 6= Hφ , and kSoS functions are an extra, secondary, variable,
as in (3), there are situations when f is only required to be nonnegative, i.e. (2b) corresponds to
f (x) ≥ 0. In this case, which we do not cover, kSoS functions can be used as the primary variable,
f (x) = hφ(x), Aφ(x)iH ≥ 0, and there is no need to discretize the constraints since kSoS functions are
nonnegative. This approach was applied to estimating probability densities and then sample from the
estimates (Marteau-Ferey et al., 2020, 2022).
Extensions. We can similarly consider within our framework constraints over the derivatives of f .
In this case, for the derivatives to be well-defined we assume X to be a subset of Rd that is contained in
the closure of its interior for the Euclidean topology. We would then replace (2b) by
ci (x)> ∂ ri f (x) + di (x) ≥ 0, ∀ x ∈ Ki , ∀ i ∈ [I] (5)
for a multi-index ri ∈ Nd , and look for f ∈ C r (X , RP ) with r the largest index appearing in the (ri )i∈[I] .
Derivatives in the constraints for instance appear in the application of the kSoS approach to estimating
the value function in optimal control (Berthier et al., 2022). If one wanted f to be convex, derivatives
and SDP constraints would also appear as extensively considered in Aubin-Frankowski and Szabó (2022);
Muzellec et al. (2021). Our approach could easily be extended to cover these cases.
The article is constructed as follows. In Sec. 2, we present some examples where the presented theory
applies. Sec. 3 contains our main theoretical results. We illustrate these on learning vector fields in
Sec. 4. Further technical results on closedness of sets and scattering inequalities are regrouped in Sec. 5.
Finally the appendix consisting of Sec. 6 details more general results on selection theorems for lower
semicontinuous set-valued maps.
1
We say that problem (P1 ) is a relaxation (resp. tightening) of problem (P2 ) if they have the same objective function and
the search space of (P1 ) contains (resp. is contained in) that of (P2 ). This dichotomy is paramount to our approach.
3
2 Examples
After introducing our notation and vRKHSs, we present some interesting problems (global optimization,
estimation of optimal transport cost, learning vector fields) that fit our framework.
Notation: Let N = {0, 1, . . .}, N∗ = {1, 2, . . .} and R+ denote the set of natural numbers, positive
integers and non-negative reals, respectively. We write [[n1 , n2 ]] = {n1 , n1 + 1, . . . , n2 } for the set of
integers between n1 , n2 ∈ N (not to be confused with the closed interval [a, b]) and use the shorthand
[N ] := [[1, N ]] with N ∈ N, with the convention that [0] is the empty set. The interior of a set S ⊆ F is
denoted by S̊, its closure by S̄, its boundary by ∂S. The ith canonical basis vector is ei ; the zero matrix is
0d1 ×d2 ∈ Rd1 ×d2 ; the identity matrix is denoted by Id ∈ Rd×d . The set of d × d symmetric (resp. positive
semi-definite) matrices is denoted by Sd (resp. Sd+ ), Tr(·) denotes their trace. The closed ball in a normed
space F with center c ∈ F and radius r > 0 is BF (c, r) = {f ∈ F : kc − f kF ≤ r}. Given a multi-index
r ∈ Nd let |r| = j∈[d] rj be its length, and let the rth order partial derivative of a function f be denoted
P
∂ |r| f (x) ∂ |r|,|q| f (x,y)

by ∂ r f (x) = r r .
∂x11 ...∂xdd
Similarly for multi-indices r, q ∈ Nd , let ∂ r,q f (x, y) = r r q q .
∂x11 ...∂xdd ∂y11 ...∂ydd
For a
fixed s∈ N, let d
the set of R -valued functions on X with continuous derivatives up to order s be denoted
by C X , R . The set of Rd1 ×d2 -valued functions on X × X for which ∂ r,r f exists and is continuous up
s d

to order |r| ≤ s ∈ N is denoted by C s,s X × X , Rd1 ×d2 . The domain Dom(f ) of a function is the set of
points where it has finite values.
vRKHS: A function K : X × X → RQ×Q is called a positive semidefinite matrix-valued kernel on X
if K(x, x0 ) = K(x0 , x)> for all x, x0 ∈ X and i,j∈[N ] vi> K(xi , xj )vj ≥ 0 for all N ∈ N∗ , {xn }n∈[N ] ⊂ X
P
and {vn }n∈[N ] ⊂ RQ . For x ∈ X , let K(·, x) be the mapping x0 ∈ X 7→ K (x0 , x) ∈ RQ×Q . Let FK
denote the vRKHS associated to the kernel K (see Carmeli et al., 2010, and references therein); we use
the shorthand k·kK := k·kFK and h·, ·iK := h·, ·iFK for the norm and the inner product on FK . The
Hilbert space FK consists of X → RQ functions for which (i) K(·, x)c ∈ FK for all x ∈ X and c ∈ RQ ,
and (ii) hf , K(·, x)ciK = hf (x), ci for all f ∈ FK , x ∈ X and c ∈ RQ . The first property of vRKHSs
describes the basic elements of FK , the second one is called the reproducing property; this property
can be extended
n to function derivatives
o (Aubin-Frankowski and Szabó, 2022, Lemma 1). Constructively,
FK = span K(·, x)c : x ∈ X , c ∈ R Q where span denotes the linear hull of its argument and the bar
stands for closure w.r.t. k · kK .
2.1 Finding global minima

We follow here the construction in Rudi et al. (2020) which develops a subset of the framework elaborated
in this paper. The goal in this context is to solve
g∗ = inf g(x),
x∈Ω
where g ∈ C s (Rd ) with s ≥ 2, Ω is an open bounded subset of Rd , with d ∈ N and there exists at least a
minimizer x∗ ∈ Ω. A simple convex reformulation of the problem above, is as follows
min −c
c∈R
s.t. g(x) − c ≥ 0, ∀ x ∈ Ω
The problem above gives exactly g∗ as solution, indeed c = g∗ satisfies the constraints and all c > g∗
would violate the constraints in a neighbourhood of x∗ which is in Ω. The problem above is convex, since
4
both the functional to minimize and the constraints are linear in c, and fits our problem of interest (2),
by setting k(x, y) = 1 encoding the constant functions, I = 1, c(x) = −1, d(x) = g(x). By applying
the construction we propose in this paper, we obtain the following convex approximation of the original
non-convex minimization problem
min − c + λtr Tr(A)

c∈R,
A∈S + (Hφ )
s.t. g(x̃m ) − c = hφ(x̃m ), Aφ(x̃m )iHφ , ∀ m ∈ [M ]
where x̃1 , . . . , x̃M are a list of points chosen appropriately (e.g. uniformly at random), to cover Ω and Hφ
is a rich enough RKHS (as a Sobolev space). Under some assumptions on g, such as isolated minima with
positive Hessians, and on Hφ , the solution of the problem above converges to g∗ as M goes to infinity and
with convergence rates that are almost optimal, as discussed in Rudi et al. (2020).
Our general analysis covers the algorithm above as a particular case, and allows to derive slightly
better rates. Indeed, solving the global optimization problem g∗ = inf x g(x) up to error is very expensive.
Typical algorithms to solve global optimization have computational complexity of the order O(−d/2 ) (see
Section 1 Rudi et al., 2020, for a discussion). However, the known computational lower-bound is in the
order of O(−d/s ) (Novak, 1988), breaking the curse of dimensionality when g is very regular, i.e. s 1.
For example, if s ≥ d, it means that an optimal algorithm would take time at most O(−1 ), in other words
to double the precision it would just need double the time. Rudi et al. (2020) show that, by applying the
technique under discussion, the algorithm has computational complexity O(−3.5d/(s−d/2−2) ). While this
rate is not optimal (mainly for the factor 3.5 in the exponent), again it has the same qualitative effect of
breaking the curse of dimensionality, indeed for g very regular, e.g. s ≥ 4d + 2 the algorithm has a rate
of O(−1 ).
2.2 Optimal transport with kSoS

Another problem that fits perfectly the framework under study is the one of estimating the optimal
transport cost Wc (µ, ν) between two smooth probability densities µ, ν defined over a bounded set Ω ⊂ Rd
with d ∈ N. The Kantorovich dual formulation reads as follows (see Chapter 5 Villani, 2009, for an
introduction)
Z Z
Wc (µ, ν) = min − u(x)dµ(x) − v(y)dν(y)
u, v ∈ C(Rd )
s.t. u(x) + v(y) ≤ c(x, y), ∀ x, y ∈ Ω
where c(x, y) is the cost of transporting a unit of mass from x to y and is assumed to be a smooth function,
e.g. c(x, y) = kx − yk2 (in general x and y can be taken in different sets with few modifications). In this
case, applying the proposed method to HK = Hk × Hk , K(x, y) = k(x, y) Id2 would result in
Z Z
min − u(x)dµ(x) − v(y)dν(y) + λtr Tr(A)
u,v∈Hk ,
A∈S + (Hφ )
s.t. u(x̃m ) + v(ỹm ) − c(x̃m , ỹm ) = hφ(x̃m , ỹm ), Aφ(x̃m , ỹm )iH , ∀ m ∈ [M ]
where Hφ is a RKHS on Ω × Ω and Hk is a RKHS on Ω. Moreover, if we cannot perform the integral

in µ, ν, but we have some samples x1 , . . . , xn and y1 , . . . , yn , that are i.i.d. from µ and ν, we can also
5
R 1
u(x)dµ(x) ≈
P
approximate the integral as n i u(xi ), obtaining the following problem,
n n
1X 1X
min − u(xi ) − v(yi ) + λtr Tr(A)
u,v∈Hk , n i=1 n i=1
A∈S + (Hφ )
s.t. u(x̃m ) + v(ỹm ) − c(x̃m , ỹm ) = hφ(x̃m , ỹm ), Aφ(x̃m , ỹm )iH , ∀ m ∈ [M ]
The solution of the problem above converges to Wc (µ, ν) as M and n go to infinity, as discussed in Vacher
et al. (2021), that first presented this application of kSoS techniques. Our general analysis covers the
algorithm above as a particular case, and provides the same optimal rates, but with explicit constants.
Moreover, as in the non-convex optimization case of Sec. 2.1, computing the Wasserstein distance is also
in general a difficult problem suffering from the curse of dimensionality. However, the paper (Vacher
et al., 2021) showed that the approach under study achieves an error with a computational complexity
that breaks the curse of dimensionality in the exponent, depending on the degree of regularity of the
problem (in this case, µ, ν should have s-times smooth densities w.r.t. the Lebesgue measure). When
s d, the resulting complexity is then dimension-independent, even the constants in the rate may
depend exponentially on d.
2.3 Learning dynamical systems with shape constraints

Here our goal is to learn a continuous-time autonomous dynamical system of the form
ẋ(t) = f (x(t)) (6)
over a set X ⊆ Rd , based on a few noisy observations of some trajectories of the system (6), forming a
training set
D := {(xn , yn ), n ∈ [N ]} ⊂ (X × Rd )N . (7)
If the velocity vectors yn are not directly available from measurements, they can be estimated based on
the observed trajectories. However we expect that the modeler has access to some extra information on
the properties of the desired f . The vector field f may for instance be known to leave invariant some
subsets of Rd , or to be the gradient of some unknown potential function. For controlled systems, f as
in (6) can correspond to a given closed-loop control, i.e. for a given f̃ : X × U → Rd , with a control set
U ⊂ Rm , there exists a function ū : X → U such that
f (x) = f̃ (x, ū(x)), ∀ x ∈ X . (8)
We thus look for an estimation procedure of a vector field f̂ that is both compatible with the observations,
in the sense that yn ≈ f̂ (xn ), and with the qualitative priors imposed on its form, such as (8). These
priors are a form of side information that is incorporated to improve both the estimation procedure of f̂
and its interpretability in the context of dynamical systems (Ahmadi and El Khadir, 2023). In statistics,
they are known as shape constraints, such as assuming non-negativity, monotonicity or convexity of some
components of the estimation (Guntuboyina and Sen, 2018). Many of these constraints can be written as
requiring that f̂ (x) belongs to a set F0 (x) ⊆ Rd ,2
f̂ (x) ∈ F0 (x), ∀ x ∈ X . (9)

2
We then say that f̂ is a selection of the set-valued map R : X ; Rd .
6
For instance, if we want to interpolate some points, we can take F0 (xi ) = {yi }; if we want the first
component fˆ1 to be nonnegative over a set K ⊂ Rd , then we can pose F0 (x) = {y | y1 ≥ 0} for all x ∈ K
and F0 (x) = Rd elsewhere; for controlled systems as in (8), F0 (x) = f̃ (x, U). Constraints of the form
(9) correspond to local pointwise requirements of f̂ . We could also consider nonlocal constraints such as
parity of f̂ , i.e. f̂ (−x) = f̂ (x), or global requirements such as f̂ = ∇V for some V : X → R. Each of these
side information corresponds to constraining f̂ to belong to a set of functions E.
Moreover we require f̂ to belong to a set of functions f such as the spaces of Lipschitz or of C 1 -
smooth functions over X in order to perform some minimal, Cauchy-Lipschitz style, analysis of (6).3
Ahmadi and El Khadir (2023) proposed to use polynomial sum-of-squares to fit the constraints and
polynomial functions for f . However polynomials of degree two and above induce finite-time explosion of
trajectories. Moreover the optimization problem becomes quickly computationally intractable when the
degree of the polynomials increases, due to the combinatorial number of monomials. Instead kernels are
much smoother, and often chosen so as to vanish at infinity. Applying matrix-valued kernels to recover
vector fields has been already considered (see e.g. Baldassarre et al., 2012, and references within), but, up
to our knowledge, not with infinitely many constraints to the exception of Aubin-Frankowski and Szabó
(2022). To assess the performance of the estimator, one can consider the L2 or L∞ error between vector
fields or an angular error, the distance between reconstructed trajectories, the amount of violation of
constraints,. . . We present a numerical application to estimating veector fields in Sec. 4.
3 Theoretical results
In what follows, to simplify the presentation, we take Ê = E = HK , L̂ = L and take a single constraint
set Ki = K (∀i ∈ [I]), focusing only on the approximation of the constraints. We shall also consider
the following assumptions, which essentially require some smoothness of the functions, non-degenerate
constraints and a well-behaved objective function,
(A1 ) The set X is contained d

in R and in the closure of its interior. The kernels satisfy K(·, ·) ∈
C sK ,sK X × X , RP ×P , kφ (·, ·) ∈ C sφ ,sφ (X × X , R) for some sK , sφ ∈ N.

(A2 ) The constraint functions satisfy C(·) ∈ C s X , RI×P and d(·) ∈ C s X , RI for some s ∈ N.
(A3 ) The set K is a compact set with samples X̂ = {xm }m∈[M ] ⊂ K chosen such that Kφ,X̂ =
[kφ (xi , xj )]i,j∈[M ] is invertible.
(A4 ) For any x ∈ K there exists zx ∈ RP such that C(x)zx > 0 and the functions of HK when restricted
to K are dense in C 0 (K, RP ).
(A5 ) The objective L is lower bounded over HK , has full domain, i.e. HK ⊂ Dom(L), and its sublevel
sets are weakly compact4 in HK .
3
Given some trajectories, there does not always exist an autonomous system such as (6) if time is involved. For instance
a railroad switch may change with time bringing trains to different directions even though they pass through the same point.
Non-autonomous systems can be seen as a succession of autonomous systems on very short time intervals and would require
many observations at every given time to be learnt.
4
A set is weakly compact if every sequence in it has a weakly converging subsequence to an element of the set (see Attouch
et al., 2014, Section 2.4.4).
7
Remarks on the assumptions. If no derivatives are considered, i.e. s = 0, then X in Assumption (A1 )
can be any complete separable metric space, and bounds on derivatives can be replaced by Lipschitz
bounds, provided the functions under study are Lipschitz on K. Smoothness as in Assumptions (A1 )
and (A2 ) is required for lower semicontinuity of the objective and to trigger the fast approximation be-
haviour of the kSoS estimate. Compactness in Assumption (A3 ) allows to consider the filling distance and
to ensure that sampling eventually covers the whole set. Any strictly positive definite (resp. universal)
kernel satisfies the second part of Assumption (A3 ) (resp. Assumption (A4 )). The Gaussian kernel is
for instance both strictly positive definite and universal. Our assumptions are more general and lighter
than those of Rudi et al. (2020) which essentially restricted the kernel φ to be of the Sobolev type, as
their proofs require Hφ to be a multiplicative algebra. Finally, Assumption (A5 ) is a classical assumption
in infinite-dimensional optimization to ensure existence of solutions, as it is satisfied by convex lower
semicontinuous coercive functions (Attouch et al., 2014, Definition 3.2.4).
We shall manipulate various forms of inequality constraints. We thus write our generic problem (P∗ ),
with objective L and constraints defined through the inequality constraint set V ∗ , as follows
f̄ ∗ (·) ∈ arg min L(f (·))

f (·)∈HK (P∗ )
s.t. f (·) ∈ V ∗ .
We assumed all the equality constraints (1c) on f were linear, and moved to the definition of HK . The
inequality constraints (1b) should remain convex even when turned into equalities, so we will stick to the
case of affine constraints (2b). Unlike in (3), we will constrain Tr(A) rather than penalize it. Both problems
are of course equivalent, but constraints are more amenable for theoretical results, while penalization is
more practical computation-wise.
For any ∈ RI , Mf ≥ 0, MA ≥ 0, given X̂ ⊂ K ⊂ X as in Assumption (A3 ), we shall consider the
following constraints
V0AF F := {f (·) ∈ HK | C(x)f (x) + d(x) ≥ 0, ∀ x ∈ K} (10)

VSDP := {f (·) ∈ HK | ∃(Ai )i∈[I] ∈ S (Hφ ) ,
+ I
C(xm )f (xm ) + d(xm ) = + [hφ(xm ), Ai φ(xm )iHφ ]i∈[I] , ∀ m ∈ [M ]}, (11)

SDP
V,rest := {f (·) ∈ VSDP with max Tr(Ai ) ≤ MA , kf kK ≤ Mf }. (12)
i∈[I]
We also introduce three quantities which will be useful to study the properties of the SDP approximation
(3). They are defined for s ∈ N such that s ≤ sφ , a collection of points X̂ = {xm }m∈[M ] ⊂ X , a bounded
set Ω, and any function g ∈ C s (X , R), as follows
(fill distance) hX̂,Ω := sup min kx − xm k; (13)

x∈Ω m∈[M ]
(C s seminorm) |g|Ω,s := sup max |∂ α g(x)|; (14)
x∈Ω α∈Nd , |α|=s
(bound on kernel derivatives) 2

DΩ,s := sup max |∂xα ∂yα kφ (x, y)|. (15)
x,y∈Ω α∈Nd , |α|≤s
Roadmap of the results. We first prove in Thm. 2 that our assumptions ensure the well-posedness, i.e.
existence of solution, of each of our three problems. We then show in Thm. 3 that, for smooth positive
definite kernels, the SDP approximation (PSDP 0 ) is a relaxation whereas, for well-chosen (Mf , Ma , ),
8
the SDP approximation (P,rest ) is a tightening. Exhibiting a nested sequence allows to derive rates of
convergence. The amount of perturbation of the constraints proves to be crucial to bound easily the
approximation error. More precisely, the result concerning the relaxation is based on Lemma 1 which
can be seem as a converse of Rudi et al. (2020, Lemma 3, p15). The weak closedness of the sets, which
is used in proving the well-posedness, is stated in the more technical Lemma 4 in Sec. 5. Sec. 6 contains
two further results on smooth selections in constraint sets, Lemma 8 and Cor. 9, which ensure that the
constraints are not empty.5
Lemma 1 (kSoS interpolation). Given (xm , ym )m∈[M ] ∈ (X × R+ )M s.t. Kφ,X̂ = [kφ (xi , xj )]i,j∈[M ] is in-
vertible, then there exists A ∈ S + (Hφ ) such that ym = hφ(xm ), Aφ(xm )iHφ , and Tr(A) = Tr(diag(y)K−1φ,X̂
)≤
min(kyk∞ Tr(K−1
φ,X̂
), kyk1 λmin (Kφ,X̂ )) where λmin (Kφ,X̂ ) is the smallest eigenvalue of Kφ,X̂ .
Proof. Let Sb : Hφ → RM be the linear operator acting as follows

b = hφ(x1 ), gi , . . . , hφ(xM ), gi
Sg Hφ
M
Hφ ∈ R , ∀g ∈ Hφ
b Sb∗ : RM → Hφ , defined consequently as Sb∗ β = M

m∈[M ] βm φ (xm ) for β ∈ R .
P
Consider the adjoint of S,
Note, in particular, that Kφ,X̂ = SbSb∗ and that Sb∗ em = φ (xm ), where em is the m-th element of the
canonical basis of RM . Fix the continuous linear operator A := Sb∗ K−1
φ,X̂
diag(y)K−1
φ,X̂
Sb acting from Hφ to
Hφ . The operator A is clearly self-adjoint as the kernel is positive semidefinite, hence Kφ,X̂ ∈ S + (RM ).
Moreover, for any g ∈ Hφ , hg, AgiHφ = hV, diag(y)ViRM with V := K−1
φ,X̂
b and y = (ym )m ≥ 0. So A
Sg
is positive semidefinite. In particular,
hφ(xm ), Aφ(xm )iHφ = hSb∗ em , ASb∗ em iHφ = hem , SA

b Sb∗ em i M = hem , diag(y)em i M = ym .
R R
Using the cyclicity of the trace, we obtain that Tr(A) = Tr(SbSb∗ K−1
φ,X̂
diag(y)K−1
φ,X̂
) = Tr(diag(y)K−1
φ,X̂
).
To prove that Tr(A) ≤ min(kyk∞ Tr(K−1
φ,X̂
), kyk1 λmin (Kφ,X̂ ), we just use that for any matrices B, C ∈
S + (RM ), Tr(BC) ≤ λmax (B) Tr(C) (e.g. Wang et al., 1986, Lemma 1).
Theorem 2 (Well-posedness of the problems). We have the following:

1) Provided Assumption (A3 ) is satisfied, we have the inclusion V0AF F ⊂ V0SDP and we can fix Mf (f̄0AF F ) ∈
R+ and some MA (f̄0AF F , X̂) ∈ R+ such that, if f̄0AF F exists, then f̄0AF F ∈ V0,rest
SDP .
2) If Assumptions (A1 ), (A2 ) and (A4 ) hold, then we can fix g1 = argming∈H1 kgkHK where H1 := {g ∈
HK | C(x)g(x) ≥ 1, ∀x ∈ K}. Moreover the set V0AF F is then non-empty and, if Assumption (A3 ) also
holds, so is V0SDP .
3) Under Assumptions (A2 ) to (A5 ), f̄0AF F exists and, for Mf (f̄0AF F ) and MA (f̄0AF F , X̂) as in 1), f̄0,rest
SDP
also exists.
Proof. Proof of 1): If Kφ,X̂ = [kφ (xi , xj )]i,j∈[M ] is invertible, by Lemma 1, for every given f ∈ V0AF F and
i ∈ [I], there exists an operator Ai ∈ S + (Hφ ) interpolating the values ym,i = ci (xm )> f (xm ) + di (xm ) ≥ 0
5
More generally, returning to the original problem (1) of which (2) is an instance, well-posedness is guaranteed for
E = C 0 (X , RP ), as detailed in Sec. 6, provided that i) the set-valued constraint map F : X ; RP is lower semicontinuous
with closed, convex values, that ii) the objective L is lower semicontinuous and lower bounded on the considered set of
constraints.
9
as a kernel Sum-of-Squares. Thus V0AF F ⊂ V0SDP . In particular, if it exists, f̄0AF F ∈ V0AF F ⊂ V0SDP .
Thus we just have to set Mf = kf̄ 0 kK and MA (f̄0AF F , X̂) = Tr(K−1 ) maxm∈[M ],i∈[I] |ym,i |, as discussed in
Lemma 1, in order to have f̄0AF F ∈ V0,rest
SDP .
Proof of 2):6 Cor. 9 written with ζ = 1 provides a f1 ∈ C 0 (X , RP ) such that f1 (x) + BRP (0, 1) ⊂
{y | C(x)y ≥ 1} for all x ∈ K. The latter is equivalent to saying that 1 ≤ inf u∈BRP (0,1) C(x)(f1 (x) + u).
Since we assumed the functions of HK when restricted to K to be dense in C 0 (K, RP ), we can find g ∈ HK
such that supx∈K kf1 (x) − g(x)k∞ ≤ √1P . Consequently, for any x ∈ K, (f1 − g)(x) ∈ BRP (0, 1), i.e. it is
an element over the set over which the infimum is taken, so
C(x)g(x) ≥ C(x)(g(x) − f1 (x)) + 1 − inf C(x)u ≥ 1,

u∈BRP (0,1)
hence g ∈ H1 which is thus non-empty. Since K is a reproducing kernel, for all x ∈ X , H1,x := {g ∈
HK | C(x)g(x) ≥ 1} is a closed convex subset of HK . So is H1 = ∩x∈K H1,x as an intersection of such
sets. The element g1 thus exists as it is the projection of 0 onto H1 .
For any R > 0 define gR = Rg1 ∈ HK . For R = supx∈K kd(x)k∞ ,
C(x)gR (x) ≥ sup kd(x)k∞ 1 ≥ −d(x),

x∈K
so gR ∈ V0AF F which is thus non-empty. If Assumption (A3 ) also holds, as shown in 1), we obtain that
V0SDP ⊃ V0AF F 6= ∅.
Proof of 3): Under Assumptions (A2 ) to (A5 ), ∅ 6= V0AF F ⊂ V0SDP and V0,rest SDP 6= ∅. Moreover
Lemma 4 gives that these sets are all weakly closed in HK . Denote generically by V ∗ any of these sets
as in (P∗ ). Since L has full domain and is lower-bounded, it has a finite smallest value v̄ ∗ over V ∗ . By
definition of the infimum, we can fix (fn )n∈N ∈ (V ∗ )N such that the sequence L(fn ) ∈ [v̄ ∗ , v̄ ∗ + 1] converges
to v̄ ∗ . By Assumption (A5 ), the sublevel set F ∗ = {f ∈ HK | L−1 (v̄ ∗ + 1)} is weakly compact. As V ∗
is weakly closed again by Lemma 4, F ∗ ∩ V ∗ is then also weakly compact, whence (fn )n∈N ∈ (F ∗ ∩ V ∗ )N
has a weakly converging subsequence to some f̄ ∗ which achieves the infimum, i.e. L(f̄ ∗ ) = v̄ ∗ . So all the
infima exist as claimed.
For a small fill distance hX̂,Ω , and thus a large number M of sample points, we show in Thm. 3 below
that there is a nested sequence between Vσ,restSDP , V AF F and V SDP , defined respectively in (12), (10) and
0 0
(11), and where σ ∈ R+ is related to the parameters of the problem and decreases with hX̂,Ω . The role
of σ to derive rates of convergence was partly anticipated in Rudi et al. (2020, Theorem 4), whereas the
idea of using nestedness for studying similar constraints perturbation was advocated in Aubin-Frankowski
(2021, Proposition 1).
Theorem 3 (Main result). Suppose Assumptions (A1 ) to (A4 ), with 1 ≤ s ≤ min(sK , sφ ). For Mf (f̄0AF F )
and MA (f̄0AF F , X̂) are given by Thm. 2-1), define
!
σslow := CK,1 Mf (f̄0AF F ) + max |di |K,1 · hX̂,K . (16)
i∈[I]
6
In Aubin-Frankowski (2021, Appendix), the space HK was given by linearly controlled functions. Hence some more
specific arguments had to be given, i.e. inward-pointing conditions, since one could not leverage generic density properties
of the function space.
10
Moreover, for s ≥ 2, if K = ∪x∈S B(x, r) for a given bounded set S ⊂ X and some r > 0, and, setting
1
Ω := ∪x∈S B̊(x, r), we have that X̂ = {xm }m∈[M ] ⊂ Ω with hX̂,Ω ≤ r min(1, 18(s−1)2 ), then we define
!
σf ast := C0 CK,s Mf (f̄0AF F ) + max |di |K,s + 2 s 2
DΩ,s MA (f̄0AF F , X̂) · (hX̂,K )s (17)
i∈[I]
√ √
2s
where C0 = 3 max( d,3 s!2d(s−1)) and CK = maxx∈K,i∈[I] maxα∈Nd , |α|=s k∂xα [K(·, x)ci (x)]kHK .7 Then we
have the nested sequence
SDP
Vσ,rest ⊂ V0AF F ⊂ V0SDP for σ := min(σslow , σf ast ). (18)
If Assumption (A5 ) also holds, then we have the bound

SDP SDP
L(f̄0,rest ) ≤ L(f̄0AF F ) ≤ Lσ,rest , (19)
SDP
where Lσ,rest may be infinite, but is finite provided f̄0AF F + σg1 ∈ V0,rest
SDP . If L(·) is also β-Lipschitz on
SDP + σkg k ), with g as in Thm. 2-2), then we have the bound
BK (0, f̄0,rest 1 K 1
SDP
L(f̄0,rest ) ≤ L(f̄0AF F ) ≤ L(f̄0,rest
SDP
) + βkg1 kHK σ. (20)
Discussion of the results of Thms. 2 and 3. By Thm. 2-1) and the definition of Vσ,rest SDP in (12),
∗ +
it is clear that if there exists some A ∈ S (Hφ ) such that c(x) f̄0> AF F (x) + d(x) = hφ(x), A∗ φ(x)iH for
all x ∈ K then f̄0AF F ∈ Vσ,rest
SDP for M (f̄ AF F ) = kf̄ AF F k
f 0 0
AF F , X̂) = Tr(A∗ ) < ∞. In every
K and MA (f̄0
previous article on kSoS such as Rudi et al. (2020); Vacher et al. (2021), the proof of fast convergence
is divided into two steps: first the existence of A∗ is shown under some regularity assumptions, then a
similar bound to Thm. 3 is proven. We focused here on the second part, proposing a general approach to
ensuring convergence. The latter occurs indeed at a faster rate σf ast if the problem exhibits some form
of smoothness. Nevertheless the scheme always converges, at worst at rate σslow .
Proof. Here we set Mf = Mf (f̄0AF F ) and MA = MA (f̄0AF F , X̂) to shorten the notations and we deal
separately with the cases σslow and σf ast since the assumptions on K and X̂ differ.
Case of σf ast : Take f ∈ VσSDP f ast ,rest
and i ∈ [I]. By the definition of the constraint set, see (11), the
traces of the corresponding Ai ∈ S (Hφ ) are bounded by MA . The extra assumptions on K allow to
+
>
apply Lemma
f (·) + di (·) − σf ast and τ = 0. Hence, ξi (x) ≥ −0,i for all x ∈ Ω with
5 to ξi (·) = ci (·)
0,i := C0 |ξi |Ω,s + 2s DΩ,s
2 M s d
A · (hX̂,K ) . Now we just have to upper bound |ξi |Ω,s . Let α ∈ N , |α| = s.
Since |ξi |Ω,s ≤ |ci (·)> f (·)|Ω,s + |d|Ω,s , the focus goes on ∂ α [ci (·)> f (·)](x). By the reproducing property
for derivatives (Lemma 1 Aubin-Frankowski and Szabó, 2022, with a direct extension to non-constant
coefficients c(x)), and using the definition of CK,s in (17)
|∂ α [ci (·)> f (·)](x)| = |hf , ∂xα [K(·, x)ci (x)]iHK | ≤ CK,s Mf . (21)

Consequently, ξi (x) ≥ −C0 CK,s Mf + supi∈[I] |di |Ω,s + 2s DΩ,s
2 M s
A hX̂,K = −σf ast for all x ∈ Ω. Hence,
by definition of ξi , ci (x)> f (x) + di (x) ≥ 0, and f ∈ V0,rest
AF F .
Case of σslow : In this case we just use a Lipschitzianity argument without applying Lemma 5. Take
f ∈ VσSDP
slow ,rest
and i ∈ [I]. Define ξi as above, so ξi (xm ) ≥ 0 for all m ∈ [M ], again by definition (11).
7
See (13), (14) and (15) for the definition of the quantities
11
Then, for any x ∈ K, there exists m ∈ [M ] such that kx − xm k ≤ hX̂,K . Applying (21) for |α| = 1, and
then using that f ∈ VσSDP
slow ,rest
, we obtain that
ci (x)> f (x) + di (x) ≥ ci (xm )> f (x) + di (xm ) − hX̂,K CK,1 Mf − hX̂,K max |di |K,1 ≥ σslow − σslow = 0,
i∈[I]
AF F .
so f ∈ V0,rest
SDP is either equal to V SDP

As σ = min(σslow , σf ast ), we conclude that Vσ,rest SDP
σslow ,rest or to Vσf ast ,rest , whence,
SDP ⊂ V AF F . By Thm. 2-1), V AF F ⊂ V SDP , so (18) holds. Putting the two
by the proof above, Vσ,rest 0 0 0
inclusions together, we obtain that
SDP AF F SDP
Vσ,rest ⊂ V0,rest ⊂ V0,rest . (22)
By Thm. 2-3), we have the existence of f̄0AF F and of f̄0,rest
SDP . By Thm. 2-1), we have f̄ AF F ∈ V AF F so it is
0 0,rest
AF F ⊂ V AF F . We then directly deduce (19) from the nestedness (22) of the constraint
also optimal for V0,rest 0
sets.
SDP + σg ∈ V AF F .8 By definition of g , defining ξ (·) as

We will now show (24), first proving that f̄0,rest 1 0 1 i
follows, we get for any x ∈ K
ci (x)> (f̄0,rest
SDP
+ σg1 )(x) + di (x) =: ξi (x) + σci (x)> g1 (x) ≥ ξi (x) + σ. (23)
With similar computations to the two cases discussed above, either applying a Lipschitzianity argument
SDP + σg ∈ V AF F for which f̄ AF F is optimal. Since
or Lemma 5, we derive that ξi (x) ≥ −σ. Hence f̄0,rest 1 0 0
SDP + σkg k ), we derive the bounds
L(·) is now also assumed to β-Lipschitz on BK (0, f̄0,rest 1 K
SDP
L(f̄0,rest ) ≤ L(f̄0AF F ) ≤ L(f̄0,rest
SDP SDP
+ σg1 ) ≤ L(f̄0,rest ) + βkg1 kHK σ. (24)
SDP may be empty as the restriction depends on f̄ AF F rather than f̄ AF F (the bounds
Note that in (22), Vσ,rest 0 σ
SDP
on the functions may be too tight). Consequently Lσ,rest may be infinite. If f̄0AF F + σg1 ∈ V0,rest
SDP , then,
SDP
proceeding as in (23), we can show that f̄0AF F + σg1 ∈ Vσ,rest
AF F ⊂ V SDP , which implies that L
σ,rest σ,rest is finite
as the constraint set is non-empty and due to Assumption (A5 ).
4 Numerical application
We illustrate our methodology on the problem of learning vector fields under constraints, since the two
other examples mentioned in Sec. 2 were already showcased in earlier works (Rudi et al., 2020; Vacher
et al., 2021). Inspired by the example in Ahmadi and El Khadir (2023), we consider a competitive
Lotka-Volterra model in two dimensions over the set X := [0, 1]2 ,
!
2 x1 (1 − x1 − 0.2 · x2 )
ẋ(t) = f̄ (x(t)), where x(t) ∈ R and f̄ (x) = (25)
x2 (1 − x2 − 0.4 · x1 )
The structure of the model incites to introduce the following interpolation and invariance constraints, for
K := ∂X ,
f (0) = 0; (Interp)
→
−
h n (x), f (x)iR2 ≤ 0, ∀ x ∈ K; (Inv)
8 SDP AF F
Note that f̄0,rest + σg1 does not belong to V0,rest in general due to the restriction of the constraint set by Mf and MA .
12
where →−
n (x) is the outer-pointing normal to K = ∂([0, 1]2 ) at point x (we consider two normals on corners).
As discussed above, we discretize the inequality constraint at M points {x̃m }m∈[M ] ⊂ K (with possible
repetition), defining the sampled and kSoS invariance constraints (for some A ∈ S + (Hφ )):
h→
−n (x̃m ), f (x̃m )iR2 ≤ 0, ∀ m ∈ [M ]; (Sampled-Inv)
→
−
− h n (x̃m ), f (x̃m )iR2 = hφ(x̃m ), Aφ(x̃m )iHφ , ∀ m ∈ [M ]; (kSoS-Inv)
Suppose we are given a few noisy snapshots forming a training dataset

n o
D := {(xn , yn )}n∈[N ] = xn , f̄ (xn ) + 10−2 · (εn,1 , εn,2 )
n∈[N ]
where the noises εn,j are independently sampled from the standard normal distribution. We then solve
the kernel ridge regression (KRR) problem
X
min kyn − f (xn )k2R2 + λK kf k2K + λφ Tr(A)
f ∈HK , (26)
A∈S + (Hφ ) n∈[N ]
+constraints
either without constraints (λφ = 0), or with sampled constraints (λφ = 0, f satisfying (Interp) and
(Sampled-Inv)), or with kSoS-sampled constraints (λφ > 0, f satisfying (Interp)
and (kSoS-Inv)).

-kx−yk2 2
Denoting the Gaussian kernel of bandwidth σK > 0 by kσK (x, y) = exp 2
2σK
R
, we choose K to
be a diagonal Gaussian Kernel, K(x, y) = kσK (x, y) Id2 , and kφ (x, y) = kσφ (x, y) for some σK , σφ > 0.
We thus have up to four hyperparameters to select. We start with σK that is optimized by gridsearch over
[0.1, 2] for λK deterministically obtained by generalized cross-validation (Golub et al., 1979). We then
use the heuristic σφ = σK /2, motivated by the sum-of-squares formulation of (kSoS-Inv), and fix λφ by
gridsearch over the logarithmic interval [10−6 , 10−1 ]. In the experiment we set N = 5 and draw {xn }n∈[N ]
uniformly at random in X . The M sample points are taken on a regular grid on K, consequently there
are M/4 constraints per side of X . The optimal hyperparameters resulted in σK = 1 and λφ = 10−3 .
In Sec. 3, we studied constraints over Tr(A), however these require to have a good estimate beforehand
of the bound MA , facing the risk of the problem having no solution for MA too small. The penalization
hyperparameter λφ is instead much easier to tune, and (26) always has a solution.
We report the performance of our reconstruction on Fig. 2 based on the following four criteria: i)
hf̂ (x),f̄ (x)iRP

1 R
an angular measure of error |X | X arccos dx which reflects that the phase portraits as
kf̂ (x)k·kf̄ (x)k
in Fig. 1 are similar since it disregards the magnitude of the fields; ii) the L2 reconstruction error
1 R 2 ∞ measure of violation of constraints - min >
|X | X kf̂ (x) − f̄ (x)k dx; iii) the L x∈K,i∈I (0, ci (x) f̂ (x) + di (x));
-1
iv) the L1 measure of violation of constraints |K| >
R
K mini∈I (0, ci (x) f̂ (x) + di (x))dx. All the integrals are
approximated by a uniform grid with a 100 points per dimension. The first quartile, the median and
the third quartile of the values computed are obtained by repeating a 100 times each experiment when
varying the number M of sampled constraints.
On Fig. 1b, we observe that the solution of unconstrained KRR fails to satisfy the invariance assump-
tion (Inv). On the contrary, both the constrained solutions satisfy it. Moreover the constraints (kSoS-Inv)
provides a streamplot (Fig. 1d) looking more alike the true f̄ (Fig. 1a) than the result obtained through
samples constraints (Sampled-Inv) (Fig. 1c). This is further ascertained numerically on Fig. 2 where the
kSoS constraint has a better reconstruction error in both angular and L2 measures (Figs. 2a and 2b).
Fig. 2c shows that the constraint Eq. (Inv) is nevertheless violated in most experiments with either
13
(kSoS-Inv) or (Sampled-Inv). However Fig. 2d illustrates that the L1 measure of violation decreases
faster for the kSoS sampling w.r.t. the sampled constraints. The kSoS constraint indeed leverages the
smoothness of the functions. If the constraints were satisfied in an experiment, the error was then set to
machine precision to be able to take the logarithm, since the corresponding violation was null.
1 1
0.5 0.5
0 0
0 0.5 1 0 0.5 1
(a) True vector field (b) Unconstrained KRR
1 1
0.5 0.5
0 0
0 0.5 1 0 0.5 1
(c) Interp ∩ Sampled-Inv (d) Interp ∩ kSoS-Inv
Figure 1: Streamplots of the vector field in (25) (Fig. 1a) and of kernel solutions with different constraints (Figs. 1b to 1c).
Red arrows: N = 5 noisy observations. Black points: M = 20 samples on the constraint set K.
5 Technical results on closedness of sets and on scattering inequalities

Lemma 4 (Closed constraint sets). Fix ∈ RI , Mf ≥ 0 and MA ≥ 0. Then V0AF F is a weakly closed
convex subset of HK , and, if Assumption (A3 ) also holds, so are VSDP and V,rest
SDP .
14
0.8 0.035
0.7
0.03
0.6
0.025
0.5
0.02
0.4
0.015
0.3
0.01
0.2
0.005
0.1
0 0
10 20 30 40 50 60 10 20 30 40 50 60
(a) Angular error (b) L2 error

0.7
-2
0.6
-4
0.5
-6
0.4
-8
0.3
-10
0.2
-12
0.1
-14
0
10 20 30 40 50 60 10 20 30 40 50 60
(c) L∞ violation on K (d) L1 violation on K
Figure 2: Performance criteria for the reconstruction of the vector field in (25) as a function of the number M of inequality
constraints imposed
Proof. The sets V0AF F and VSDP are clearly convex, so is V,rest
SDP using the triangular inequality with the
+
nuclear/trace norm on S (Hφ ). Using Mazur’s lemma (see e.g. Proposition 1 Beauzamy, 1985, p34), it
then suffices to show that the sets are strongly closed to conclude that they are weakly closed in HK .
For VSDP , the proof is quite straightforward provided Assumption (A3 ) is satisfied. Indeed, fix (fn )n∈N ∈
(VSDP )N converging to f ∈ HK . As HK is a RKHS, for any m ∈ [M ], ym,n := C(xm )fn (xm ) + d(xm ) −
converges to ym := C(xm )f (xm ) + d(xm ) − . Since fn ∈ VSDP , we have that ym,n ≥ 0, so ym ≥ 0 and
Lemma 1 gives the (Ai )i∈[I] corresponding to f and f ∈ VSDP . Hence VSDP is strongly closed in HK .
The same nonegativity argument paired with the fact that HK is a RKHS gives the strong closedness of
V0AF F .
SDP instead, we have to bound Tr(A ) by M . For this, we will have to explicitly relate V SDP to
For V,rest i A ,rest
a compact set in finite dimensions. Similarly to the proof of Rudi et al. (2020, Lemma 2), let Sb : Hφ → RM
be the linear operator acting as follows

b = hφ(x1 ), gi , . . . , hφ(xM ), gi
Sg Hφ
M
Hφ ∈ R , ∀g ∈ Hφ
b Sb∗ : RM → Hφ , defined consequently as Sb∗ β = M
m∈[M ] βm φ (xm ) for β ∈ R .
P
Consider the adjoint of S,
Let Kφ,X̂ = [kφ (xi , xj )]i,j∈[M ] = R> R, where R := [Φ1 , . . . , ΦM ] is the Cholesky decomposition of the
15
symmetric Kφ,X̂ ∈ RM ×M and Φm = Rem are its columns. Note, in particular, that Kφ,X̂ = SbSb∗ and
that Sb∗ em = φ (xm ), where em is the m-th element of the canonical basis of RM .
As Kφ,X̂ is invertible by Assumption (A3 ), we define the operator V = R−> Sb and its adjoint V ∗ =
Sb∗ R−1 . By using the definition of V , that Kφ,X̂ = R> R, and that Kφ,X̂ = SbSb∗ , we derive two facts. On
the one hand,
V V ∗ = R−> SbSb∗ R−1 = R−> Kφ,X̂ R−1 = R−> R> RR−1 = IdRM
On the other hand, P := V ∗ V : HK → HK is a projection operator, i.e., P 2 = P, P is positive definite

and its range is span {φ (xm ) | m ∈ [M ]}, implying P φ (xm ) = φ (xm ) for all m ∈ [M ]. Indeed, using the
equation above, P 2 = V ∗ V V ∗ V = V ∗ (V V ∗ ) V = V ∗ V = P , and the positive semi-definiteness of P is
given by construction since it is the product of an operator and its adjoint. Moreover, the range of P is
the same as that of V ∗ which in turn is the same as that of S ∗ , since R is invertible.
Let CX̂ : HK → RM be the linear operator defined as follows
CX̂ f = (C(x1 )f (x1 ), . . . , C(xM )f (xM )) ∈ RI×M , ∀ f ∈ HK .
Consider the subspace E := CX̂ (HK ) ⊂ RI×M . From now on, we restrict the output space, i.e. we focus
on the surjective CX̂ : HK → E. We may thus introduce its injective right inverse CX̂
+
: E → HK defined
as
+
CX̂ V = argmin kf kHK , ∀ V ∈ E.
f ∈HK ,
CX̂ f =V
The continuous operator CX̂ +

is related to a subspace of HK since HK = ker(CX̂ ) ⊕ CX̂
+ +
(E). As CX̂ is
+
linear, CX̂ (E) is finite dimensional. Fix the compact set
+
RK,φ = {V ∈ E | CX̂ V ∈ BK (0, Mf ), ∃(Bi )i∈[I] ∈ S + (RM )I , Tr(Bi ) ≤ MA ,
Vem + d(xm ) = + [Φ>
m Bi Φm ]i∈[I] , ∀ m ∈ [M ]}.
SDP and g ∈ ker(C ), as C (f +g) = C f , the values appearing in the definition of V SDP do
For any f ∈ V,rest X̂ X̂ X̂ ,rest
SDP . Hence, setting G + SDP , we have that V SDP = G
not change, so f +g ∈ V,rest rest := C X̂
(E)∩V ,rest ,rest rest ⊕ker(C X̂ ).
+
We now show that Grest = CX̂ (RK,φ ). Let V ∈ RK,φ , and consider the corresponding (Bi )i∈[I] ∈
S (R ) . Let Ai := V Bi V ∈ S + (Hφ ), then, by the cyclicity of the trace and since V V ∗ = Id,
+ M I ∗
+ +
Tr(Ai ) = Tr(Bi V V ∗ ) = Tr(Bi ) ≤ MA . So CX̂ (V) ∈ CX̂ SDP = G
(E) ∩ V,rest rest . Conversely, let f ∈ Grest
+ I ∗
and consider the corresponding (Ai )i∈[I] ∈ S (Hφ ) . Then Bi := V Bi V ∈ S + (RM ) satisfies Tr(Bi ) =
Tr(Bi V ∗ V ) ≤ kV ∗ V kop Tr(Ai ) ≤ MA as V ∗ V is a projector and, for any operators B, C ∈ S + (Hφ ),
+
Tr(BC) ≤ kBkop Tr(C) (e.g. Wang et al., 1986, Lemma 1). So CX̂ f ∈ RK,φ , and, as CX̂ is injective,
+ +
f ∈ CX̂ (RK,φ ). We thus conclude that Grest = CX̂ (RK,φ ), thus Grest is a compact set for the strong
topology of HK . So V,rest
SDP = G
rest ⊕ ker(CX̂ ) is strongly closed in HK , which concludes the proof.
Rudi et al. (2020) have focused mostly on multiplicative algebra9 , among which Sobolev spaces. Here
instead we tackle more general RKHSs.
9
A vector space Hk of real-valued functions is a multiplicative algebra if there exists c ∈ R+ , for all f, g ∈ Hk , the
pointwise product f · g belongs to Hk , and kf · gk ≤ ckf kkgk
16
Lemma 5 (Modified Rudi et al. (2020, Theorem 4, p14)). Let X = Rd , take sφ ≥ 1 and s such that
1 ≤ s ≤ sφ . Consider a bounded set S ⊂ X and some r > 0, set Ω := ∪x∈S B̊(x, r).10 Take X̂ =
1 s +
{xm }m∈[M ] ⊂ Ω with hX̂,Ω ≤ r min(1, 18(s−1) 2 ). Let g ∈ C (X , R) and assume there exists A ∈ S (Hφ )
and τ ≥ 0 such that
|g(xm ) − hφ(xm ), Aφ(xm )iHφ | ≤ τ, ∀ m ∈ [M ], (27)
then the following statement holds
g(x) ≥ −( + 2τ ) ∀x ∈ Ω, where = C(hX̂,Ω )s (28)
√ √
d,3 2d(s−1))2s

2 Tr(A) with C = 3 max(
and C = C0 |g|Ω,s + 2s DΩ,s .
0 s!
Proof. The proof is mostly similar to that of Rudi et al. (2020, Theorem 4, p14), also relying heavily on
the technical Rudi et al. (2020, Theorem 13, p39), but discarding Rudi et al. (2020, Lemma 9, p41) which
requires multiplicative algebra. Define the functions u, rA : Ω → R as follows
rA (x) = hφ(x), Aφ(x)i, u(x) = g(x) − rA (x), ∀x ∈ Ω
By assumption, |u (xm )| ≤ τ for any m ∈ [M ]. Since hX̄,Ω ≤ kδk∞ , Ω := ∪x∈K B̊(x, r), and we have as-
sumed kδk∞ ≤ r/ max 1, 18(m − 1)2 , we can apply apply Rudi et al. (2020, Theorem 13, p39) (related to

classical results on functions with scattered zeros (Wendland, 2005, Chapter 11)) to bound supx∈Ω |u(x)|,
whence
sup |u(x)| ≤ 2τ + , = cRs (u)hsX̄,Ω
x∈Ω
s 1
where c = 3 max 1, 18(s − 1)2 supz∈Ω |∂ α v(x)| for any v ∈ C s (Ω). Since rA (x) =
P
and Rs (v) = |α|=s α!
hφ(x), Aφ(x)i ≥ 0 for any x ∈ Ω as A ∈ S + (Hφ ), we have that
g(x) ≥ g(x) − rA (x) = u(x) ≥ −|u(x)| ≥ −(2τ + ε), ∀x ∈ Ω (29)
The last step is bounding Rs (u). Recall the definition of |u|Ω,s from (14). Since A is trace class, it
admits a singular value decomposition A = i∈N σi ui ⊗ vi . Here, (σm )j∈N is a non-increasing sequence
P
of non-negative eigenvalues converging to zero, and (um )j∈N and (vm )j∈N are two orthonormal families of
corresponding eigenvectors, (a family (em ) is said to be orthonormal if for i, j ∈ N, hei , em i = 1 if i = j
and hei , em i = 0 otherwise). Note that we can write rA using this decomposition as, by the reproducing
property, rA (x) = i∈N σi ui (x).vi (x) with ui , vi ∈ HK ⊂ C sφ (X , R).
P
Since kφ (·, ·) ∈ C sφ ,sφ (X × X , R), um , vm ∈ C sφ (X , R), so we can take derivatives of rA . By Leibniz’s

formula extended to partial derivatives of products, also using the reproducing property for derivative
(Aubin-Frankowski and Szabó, 2022, Lemma 1) and Cauchy-Schwarz inequality, for any α ∈ Nd with
|α| ≤ sφ , denoting by P([α]) the set of parts of [[0, α1 ]] × . . . [[0, αd ]]
X X X ∂ |S| ui ∂ |α|−|S| vi
∂ α rA (x) = σi ∂ α (ui · vi )(x) = σi Q (x) Q (x)
i∈N i∈N S∈P([α]) j∈S ∂xj / ∂xj
j ∈S
X X ∂ |S| kφ (x, ·) ∂ |α|−|S| kφ (x, ·)

≤ σi hui , Q iHφ hvi , Q iHφ
i∈N S∈P([α]) j∈S ∂xj / ∂xj
j ∈S
X
≤ σi 2s DΩ,s
2
kui kHφ kvi kHφ .
i∈N
10
Scattering inequalities such as Rudi et al. (2020, Theorem 4, p14) unfortunately require to prescribe the form of K. If K
is any compact, we cannot get away with just a dilation. Indeed if X̂ ⊂ K with K + B̊(0, r) ⊂ ∪m∈[M ] BRd (xm , δm ) then we
1
face a problem, because r ≤ kδk∞ but r ≤ kδk∞ ≤ r min(1, 18(s−1) 2 ) cannot hold.
17
Since kui kHφ = kvi kHφ = 1, we obtain that |rA |Ω,s ≤ Tr(A)2s DΩ,s
2 . To conclude, note that, by the
multinomial theorem,
X 1 X 1 ds
Rs (u) = sup |∂ α u(x)| ≤ |u|Ω,s = |u|Ω,s
|a|=s
α! x∈Ω |a|=s
a! s!
Since |u|Ω,s ≤ |g|Ω,s + |rA |Ω,s , combining all the previous bounds, we obtain
s

s
ds max 1, 18(s − 1)2
ε ≤ C0 |g|Ω,s + 2 2
DΩ,s Tr(A) hsX̂,Ω , C0 = 3
s!
The proof is concluded by bounding in (29) with the inequality above. Since g ∈ C 0 (X , R), (29) also
holds for x ∈ Ω by continuity.
6 Appendix: Selection theorem and lower semicontinuity of a set-

valued map
In this section we return to (1) without the affine subspace constraint, and study the existence of a
solution in more generality:
f̄ ∈ min L(f )
f ∈ C 0 (X , RP )
s.t. f (x) ∈ F(x), ∀ x ∈ X .
For this problem to have a solution, we need to exhibit at least one f ∈ C 0 (X , RP ) satisfying f (x) ∈ F(x).
There are general methods from set-valued analysis to do so, known as selection theorems such as the famed
Michael (1956, Theorem 1). In Proposition 7 below, we show that one can even select f ∈ C m (X , RP )
when allowing to violate the constraints by some small . The key notion in order to select a function
is the lower semicontinuity of the set-valued mapF. It is then easy to specialize the notion to affine
constraints, for which F(x) is a polytope, as in Lemma 8 and Cor. 9.
Definition 1 (Lower semicontinuity of a set-valued map). Let X be a Hausdorff topological space. A

set-valued map F : X ; RP is said to be lower semicontinuous at a point x ∈ Dom F (i.e. F(x) 6= ∅) if
for any open subset Ω ⊂ X such that Ω ∩ F(x) 6= ∅ there exists a neighborhood U of x such that
∀ x0 ∈ U, Ω ∩ F(x0 ) 6= ∅. (30)
We say that F is lower semicontinuous if it is lower semicontinuous at every x ∈ X .
An equivalent definition is that for any sequence (xn )n∈N converging to x and y ∈ F(x), there is a
sequence yn ∈ F(xn ) converging to y. It is not always easy to check that a set-valued map is lower
semicontinuous. A sometimes easier criterion is that, for any > 0, there exists a neighborhood U of x
such that:
∀ x0 ∈ U, F(x) ⊂ F(x0 ) + BRP . (31)
This statement is equivalent to lower semicontinuity if F(x) is a compact set (Aubin and Cellina, 1984,
pp43-45). For instance it allows to check the lower semicontinuity of maps defined by inequalities.
18
Lemma 6. Fix g ∈ C 0,1 (X × RP , RI ) such that for any (x, y) satisfying g(x, y) ≤ 0, component-wise,
there exists a unit-norm v ∈ RP for which Jacy g(x, y)v < 0. Then F(x) := {y | g(x, y) ≤ 0} is lower
semicontinuous if it is non-empty.
y kJac g(x,y)k
Proof. Fix 0 such that: ∀ ∈ (0, 0 ], g(x, y + v)v < − 2 . Let ∈ (0, 0 ] and fix η such that
kJacy g(x,y)k
η< 3 . Let U be a neighborhood of x such that g(x , y + v) ≤ g(x, y + v) + η1 for x0 ∈ U.
0
kJacy g(x,y)k
Then for all x0 ∈ U, g(x0 , y + v) ≤ − 6 < 0, hence F satisfies (31).
Proposition 7 (Smooth selections). Assume that X is a metric space and F : X ; RP is a lower

semicontinuous set-valued map with non-empty, closed, convex values. Then there exists f ∈ C 0 (X , RP )
such that f (x) ∈ F(x) for all x ∈ X . If, for a given m ∈ N ∪ {∞}, X is a manifold of class C m ,
modeled on a separable Hilbert space E, then, for every > 0, there exists f ∈ C m (X , RP ) such that
f (x) ∈ F(x) + BRP (0, 1) for all x ∈ X .
Proof. The first statement is only an application of a renowned theorem by Michael (Michael, 1956,
Theorem 1) which gives a continuous selection under these assumptions on F for any paracompact space X ,
metric spaces being paracompact (Stone, 1948, Corollary 1). Actually X could just be taken paracompact,
while RP can be replaced by a Banach (Michael, 1956) or even Fréchet space (Aliprantis and Border, 2006).
The second statement is an extension of a result by Aubin and Cellina (1984, Proof of Theorem 1, p.
82) where it was shown for f ∈ C 0 (X , RP ) and X a metric space. We combine the construction of the
selection with another result (Lang, 1995, Corollary 3.8, p. 36) on smooth partitions of unity. Since F is
lower semicontinuous, for every x ∈ X , we can fix yx ∈ F(x) and δx > 0 such that (yx + B̊) ∩ F(x0 ) 6= ∅
for any x0 ∈ B(x, δx ). Since X is a manifold modeled on E, we can also find a chart (Gx , γx ) at every
x ∈ X (i.e. a C m -diffeomorphism γx from an open neighborhood Gx of x ∈ X to an open subset of E)
with Gx ⊂ B(x, δx ). As any open ball of E of finite radius is C ∞ -isomorphic to E, by composing γx with
such a C ∞ -diffeomorphism, without reindexing, we can take γx (Gx ) = E. By definition, {Gx }x∈X is an
open covering of X . Since X is a metric space, it is paracompact (Stone, 1948, Corollary 1), so we can
find an open refinement11 {Ux }x∈X of the covering {Gx }x∈X (i.e. each Ux is an open set contained in Gx
and X = x∈X Ux ) which is locally finite (i.e. for every x0 ∈ X , there exists a neighborhood of x0 which
S
intersects only a finite number of Ux ). Let ϕx be the restriction of γx to Ux . Again by paracompactness

of X , applying (Lang, 1995, Proposition 3.3, p. 31), we can find two locally finite open refinements of
{Ux }x∈X , namely {Vx }x∈X and {Wx }x∈X , , such that
W̄x ⊂ Vx ⊂ V̄x ⊂ Ux
the bar denoting closure in X . Each V̄x being closed in X , as ϕx is a diffeomorphism, we consequently
have that ϕx (V̄x ) is closed in E, and so is ϕx (W̄x ). Using Lang (1995, Theorem 3.7, p. 36), for each x ∈ X ,
we can fix a C ∞ -function φx : E → [0, 1] such that φx (ϕx (W̄x )) = {1} and φx (E\ϕx (Vx )) = {0} (which
can be constructed based on bump functions). Setting ψx = φx ◦ ϕx ∈ C m (X , [0, 1]), ψx is equal to 1 on
W̄x and 0 on X \Vx . Thus the mapping ψx : X → RP defined as
P 0
0 x∈X ψx (x )yx
f (x ) := P 0
x∈X ψx (x )
belongs to C m (X , RP ) since it is locally a finite sum of C m functions (this also justifies the definition of f
since only a finite number of indexes are active). For any given x0 ∈ X , ψx (x0 ) > 0 implies that x0 ∈ Vx .
11
We can choose for {Ux }x∈X the same index set as {GX }x∈X owing to (Lang, 1995, Proposition 3.1, p. 31).
19
However Vx ⊂ Ux ⊂ Gx ⊂ B(x, δx ). Hence, by definition of yx , yx ∈ F(x0 ) + B̊. Since F is convex-valued,
the convex combination which defines f leads to f (x0 ) ∈ F(x0 ) + B̊, so f (·) is the approximate selection
of class C m which was sought for.
Lemma 8 (l.s.c. affine maps). For X a metric space, K a non-empty closed subset of X and C(·) ∈
C 0 (K, RI×P ), assume that for any x ∈ K there exists zx ∈ RP such that C(x)zx > 0 (i.e. the constraints
are non-degenerate). Then the set-valued map F : x ∈ K ; {y | C(x)y ≥ 1} is lower semicontinuous.
Proof. Note that the considered F is not in general compact-valued so we will show that, given x ∈ K,
y ∈ F(x), and a sequence (xn )n∈N ∈ KN converging to x, there is a sequence yn ∈ F(xn ) converging to y.
zx
Fix zx ∈ RP such that C(x)zx > 0, thus yx = 2 mini (C(x)z x )i
satisfies C(x)yx > 2 · 1 Then for any
s ∈ (0, 1),
C(xn )(syx + (1 − s)y) = C(x)(syx + (1 − s)y + (C(xn ) − C(x))(syx + (1 − s)y)

≥ 2 · 1 + (C(xn ) − C(x))(syx + (1 − s)y),
whence, by continuity of C(·), there exists N such that syx +(1−s)∈F(xn ) for all n ≥ N and all s ∈ (0, 1).
Take a sequence (sn )n∈N ∈ (0, 1)N converging to 0. Then yn := sn yx + (1 − sn )y ∈ F(xn ) and converges
to y, showing that F is lower semicontinuous at x.
Corollary 9 (Selection in affine maps). Under the assumptions of Lemma 8, if C is bounded over K,
for every ζ ≥ 0, there exists fζ ∈ C 0 (X , RP ) such that fζ (x) + ζBRP (0, 1) ⊂ F(x) for all x ∈ K. If C is
unbounded over K, then the result still holds for ζ = 0. If, for a given m ∈ N ∪ {∞}, X is a manifold of
class C m , modeled on a separable Hilbert space E, then fζ can be taken in C m (X , RP ).
Proof. Fix ζ ≥ 0, set Mc := maxx∈K kC(x)k and define the set-valued map Fζ : x ∈ X ; {y | C(x)y ≥
(1 + 2ζMc )1}. This map satisfies the same assumptions as F, so Fζ is also lower semicontinuous over
K and has non-empty, closed, convex values. By Proposition 7, there thus exists f0 ∈ C 0 (K, RP ) such
that f0 (x) ∈ Fζ (x) for all x ∈ K. By definition, Fζ (x) + 2ζBRP (0, 1) ⊂ F(x) for all x ∈ K. So
f0 (x) + ζBRP (0, 1) ⊂ F(x) for all x ∈ K. Since a metric space is normal and K is closed, we can apply
Urysohn-Brouwer lemma to continuously extend f0 to some fζ defined over the whole X . If Mc = +∞,
then, for ζ = 0, we can still apply Proposition 7 which yields directly the result.(
RP if x ∈ /K
If X is more regular, we again apply Proposition 7 but to the map F̃(x) := . Since
f0 (x) if x ∈ K
f0 ∈ C 0 (K, RP ) and K is closed, the latter is also lower semicontinuous with non-empty, closed, convex
values.
References
Ahmadi, A. A. and El Khadir, B. (2023). Learning dynamical systems with side information. SIAM
Review. (to appear).
Aliprantis, C. D. and Border, K. C. (2006). Infinite Dimensional Analysis. Springer-Verlag.
Attouch, H., Buttazzo, G., and Michaille, G. (2014). Variational Analysis in Sobolev and BV Spaces.
Society for Industrial and Applied Mathematics.
20
Aubin, J.-P. and Cellina, A. (1984). Differential Inclusions. Springer Berlin Heidelberg.
Aubin-Frankowski, P.-C. (2021). Linearly constrained linear quadratic regulator from the viewpoint of
kernel methods. SIAM Journal on Control and Optimization, 59(4):2693–2716.
Aubin-Frankowski, P.-C. and Szabó, Z. (2020). Hard shape-constrained kernel machines. In Advances in
Neural Information Processing Systems (NeurIPS), volume 33, pages 384–395.
Aubin-Frankowski, P.-C. and Szabó, Z. (2022). Handling hard affine SDP shape constraints in RKHSs.
Journal of Machine Learning Research, 23(297):1–54.
Baldassarre, L., Rosasco, L., Barla, A., and Verri, A. (2012). Multi-output learning via spectral filtering.
Machine Learning, 87(3):259–301.
Beauzamy, B. (1985). Introduction to Banach spaces and their geometry. North-Holland Sole distributors
for the U.S.A. and Canada, Elsevier Science Pub. Co.
Berthier, E., Carpentier, J., Rudi, A., and Bach, F. R. (2022). Infinite-dimensional sums-of-squares for
optimal control. In Control and Decision Conference (CDC).
Canu, S., Mary, X., and Rakotomamonjy, A. (2009). Functional learning through kernels.
Carmeli, C., Vito, E. D., Toigo, A., and Umanitá, V. (2010). Vector valued reproducing kernel Hilbert
spaces and universality. Analysis and Applications, 8:19–61.
Golub, G. H., Heath, M., and Wahba, G. (1979). Generalized cross-validation as a method for choosing
a good ridge parameter. Technometrics, 21(2):215–223.
Guntuboyina, A. and Sen, B. (2018). Nonparametric shape-restricted regression. Statistical Science,

33(4):568–594.
Lang, S., editor (1995). Differential and Riemannian Manifolds. Springer New York.
Marteau-Ferey, U., Bach, F., and Rudi, A. (2020). Non-parametric models for non-negative functions. In
Advances in Neural Information Processing Systems (NeurIPS), pages 12816–12826.
Marteau-Ferey, U., Bach, F., and Rudi, A. (2022). Sampling from arbitrary functions via psd models.
In International Conference on Artificial Intelligence and Statistics (AISTATS), volume 151, pages
2823–2861. PMLR.
Michael, E. (1956). Selected selection theorems. The American Mathematical Monthly, 63(4):233–238.
Micheli, M. and Glaunès, J. A. (2014). Matrix-valued kernels for shape deformation analysis. Geometry,
Imaging and Computing, 1(1):57–139.
Muzellec, B., Bach, F., and Rudi, A. (2021). Learning psd-valued functions using kernel sums-of-squares.
Novak, E. (1988). Deterministic and Stochastic Error Bounds in Numerical Analysis. Springer Berlin
Heidelberg.
Rudi, A., Marteau-Ferey, U., and Bach, F. (2020). Finding global minima via kernel approximations.
21
Stone, A. H. (1948). Paracompactness and product spaces. Bulletin of the American Mathematical Society,
54(10):977 – 982.
Vacher, A., Muzellec, B., Rudi, A., Bach, F. R., and Vialard, F. (2021). A dimension-free computational
upper-bound for smooth optimal transport estimation. In Conference on Learning Theory, COLT 2021,
volume 134 of Proceedings of Machine Learning Research, pages 4143–4173. PMLR.
Villani, C. (2009). Optimal Transport. Springer Berlin Heidelberg.
Wang, S.-D., Kuo, T.-S., and Hsu, C.-F. (1986). Trace bounds on the solution of the algebraic matrix
riccati and lyapunov equation. IEEE Transactions on Automatic Control, 31(7):654–656.
Wendland, H. (2005). Scattered data approximation. Cambridge University Press.
22

Approximation of Optimization Problems With Constraints Through Kernel Sum-Of-Squares

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Approximation of Optimization Problems With Constraints Through Kernel Sum-Of-Squares

Uploaded by

Copyright:

Available Formats

Approximation of optimization problems with constraints through

January 18, 2023

f̄ ∈ arg min L(f ) (1a)

min L(f ) (2a)

ci (x̃i,m )> f (x̃i,m ) + di (x̃i,m ) ≥ ηm,i kf kHK , ∀ m ∈ [Mi ], ∀ i ∈ [I] (4)

ci (x)> ∂ ri f (x) + di (x) ≥ 0, ∀ x ∈ Ki , ∀ i ∈ [I] (5)

∂ |r| f (x) ∂ |r|,|q| f (x,y)

2.1 Finding global minima

min − c + λtr Tr(A)

s.t. g(x̃m ) − c = hφ(x̃m ), Aφ(x̃m )iHφ , ∀ m ∈ [M ]

2.2 Optimal transport with kSoS

where Hφ is a RKHS on Ω × Ω and Hk is a RKHS on Ω. Moreover, if we cannot perform the integral

2.3 Learning dynamical systems with shape constraints

ẋ(t) = f (x(t)) (6)

f (x) = f̃ (x, ū(x)), ∀ x ∈ X . (8)

f̂ (x) ∈ F0 (x), ∀ x ∈ X . (9)

(A1 ) The set X is contained d

f̄ ∗ (·) ∈ arg min L(f (·))

V0AF F := {f (·) ∈ HK | C(x)f (x) + d(x) ≥ 0, ∀ x ∈ K} (10)

C(xm )f (xm ) + d(xm ) =  + [hφ(xm ), Ai φ(xm )iHφ ]i∈[I] , ∀ m ∈ [M ]}, (11)

(fill distance) hX̂,Ω := sup min kx − xm k; (13)

(bound on kernel derivatives) 2

Proof. Let Sb : Hφ → RM be the linear operator acting as follows

b Sb∗ : RM → Hφ , defined consequently as Sb∗ β = M

hφ(xm ), Aφ(xm )iHφ = hSb∗ em , ASb∗ em iHφ = hem , SA

Theorem 2 (Well-posedness of the problems). We have the following:

C(x)g(x) ≥ C(x)(g(x) − f1 (x)) + 1 − inf C(x)u ≥ 1,

C(x)gR (x) ≥ sup kd(x)k∞ 1 ≥ −d(x),

If Assumption (A5 ) also holds, then we have the bound

SDP is either equal to V SDP

SDP + σg ∈ V AF F .8 By definition of g , defining ξ (·) as

Suppose we are given a few noisy snapshots forming a training dataset

5 Technical results on closedness of sets and on scattering inequalities

(a) Angular error (b) L2 error

(c) L∞ violation on K (d) L1 violation on K

On the other hand, P := V ∗ V : HK → HK is a projection operator, i.e., P 2 = P, P is positive definite

CX̂ f = (C(x1 )f (x1 ), . . . , C(xM )f (xM )) ∈ RI×M , ∀ f ∈ HK .

The continuous operator CX̂ +

Since kφ (·, ·) ∈ C sφ ,sφ (X × X , R), um , vm ∈ C sφ (X , R), so we can take derivatives of rA . By Leibniz’s

X X ∂ |S| kφ (x, ·) ∂ |α|−|S| kφ (x, ·)

6 Appendix: Selection theorem and lower semicontinuity of a set-

Definition 1 (Lower semicontinuity of a set-valued map). Let X be a Hausdorff topological space. A

We say that F is lower semicontinuous if it is lower semicontinuous at every x ∈ X .

Proposition 7 (Smooth selections). Assume that X is a metric space and F : X ; RP is a lower

intersects only a finite number of Ux ). Let ϕx be the restriction of γx to Ux . Again by paracompactness

C(xn )(syx + (1 − s)y) = C(x)(syx + (1 − s)y + (C(xn ) − C(x))(syx + (1 − s)y)

Aliprantis, C. D. and Border, K. C. (2006). Infinite Dimensional Analysis. Springer-Verlag.

Guntuboyina, A. and Sen, B. (2018). Nonparametric shape-restricted regression. Statistical Science,

Villani, C. (2009). Optimal Transport. Springer Berlin Heidelberg.

Wendland, H. (2005). Scattered data approximation. Cambridge University Press.

You might also like

(A1 ) The set X is contained d

C(xm )f (xm ) + d(xm ) = + [hφ(xm ), Ai φ(xm )iHφ ]i∈[I] , ∀ m ∈ [M ]}, (11)