Professional Documents
Culture Documents
2, FEBRUARY 2023
Abstract— This article introduces a neural approximation- replaced by deterministic constraints imposed for finite sets of
based method for solving continuous optimization problems with independently extracted samples of random variables. The sce-
probabilistic constraints. After reformulating the probabilistic nario approach ensures that the solution of the approximately
constraints as the quantile function, a sample-based neural
network model is used to approximate the quantile function. formulated deterministic program satisfies the original proba-
The statistical guarantees of the neural approximation are bilistic constraints with a determined bound of probability. The
discussed by showing the convergence and feasibility analysis. bound of probability is determined by the number of extracted
Then, by introducing the neural approximation, a simulated samples. Afterward, the scenario approach has been improved
annealing-based algorithm is revised to solve the probabilistic to ensure more tight confidence bounds of probability by
constrained programs. An interval predictor model (IPM) of wind
power is investigated to validate the proposed method. a sampling-and-discarding approach [8]. In the sampling-
and-discarding scenario approach, a certain proportion of the
Index Terms— Neural network model, nonlinear optimiza- extracted samples are used to define the deterministic con-
tion, probabilistic constraints, quantile function, sample average
approximation. straints, while the rest ones are discarded. With a less sample
number, the violation probabilities of the original probabilistic
I. I NTRODUCTION constraints are still preserved. However, the scenario approach
Authorized licensed use limited to: UNIVERSITY OF ALBERTA. Downloaded on February 21,2023 at 05:33:25 UTC from IEEE Xplore. Restrictions apply.
SHEN et al.: SAMPLE-BASED NEURAL APPROXIMATION APPROACH FOR PCPs 1059
Authorized licensed use limited to: UNIVERSITY OF ALBERTA. Downloaded on February 21,2023 at 05:33:25 UTC from IEEE Xplore. Restrictions apply.
1060 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 2, FEBRUARY 2023
where M =
(1−α)N and h̄
M (u) denotes the Mth smallest C. Convergence and Feasibility Analysis
observation of the values H N for a given u. We make the following assumptions.
We can use the (1 − α)−empirical quantile as an approxi- Assumption 1: h̄(u, δ) is a Carathéodory function. For
mation of the probabilistic constraints and form the following every fixed u ∈ U, h̄(u, δ) is a continuous random variable,
approximate problem: which is measurable and has a continuous distribution. For
min J (u) every fixed δ ∈ , h̄(u, δ) is a continuous function of u.
u∈U
Besides, for every u ∈ U, Fh̄ (h̄(u, δ)), the CDF of h̄(u, δ),
s.t. Q̃ 1−α h̄(u, δ) ≤ 0, δ ∈ , α ∈ (0, 1). (11) is continuously differentiable and it has strictly positive deriva-
tive of h̄(u, δ) (strictly monotonically increasing), f h̄ (h̄(u, δ)),
Using the (1 − α)-empirical quantile as an approximation
over the domain of h̄(u, δ): (−∞, +∞).
of the probabilistic constraints has two drawbacks.
Remark 1: The CDF Fh̄ (h̄(u, δ)) is continuously and
1) Q̃ 1−α (h̄(u, δ))) is usually not differentiable. M changes strictly monotonically increasing over (−∞, +∞). Thus,
for different values of u. Thus, even if the constraint the quantile function Q(h̄(u, δ)) : [0, 1] → Dh̄(u,δ) is the
h(u, δ) is smooth for a fixed value of δ, Q̃ 1−α (h̄(u, δ))) inverse of Fh̄ and thus continuous on [0, 1]. Moreover, for
does not have to be smooth. a fixed δ ∈ , h̄(·, δ) is continuous on U. Thus, the values
2) For small N, the uncertainty of Q̃ 1−α (h̄(u, δ))) is large of the CDF Fh̄ (h̄(u, δ)) and the quantile function Q(h̄(u, δ))
and the inward kinks of the feasible boundary will be vary continuously on U due to the fact that continuous
very nonsmooth. Large N can bring some smoothness functions’ function composition leads to continuous function.
back, and however, this also increases the computation Thus, Assumption 1 implies that the 1 − α level quantile
burden. Q 1−α (h̄(u, δ)) is a continuous function of u on U.
Thus, it is necessary to look for the smooth approximation of Assumption 2: The problem expressed by (1) has a globally
(1 − α)-empirical quantile. optimal solution u ∗ , such that for any u , there is u ∈ U such
Given the sample set N , the (1 − α)-empirical quantile that u − u ∗ ≤ u and F(0, u) > 1 − α (or equivalently
Q̃ 1−α (h̄(u, δ))) is essentially a function of u. Q 1−α (h̄(u, δ))) Q 1−α (h̄(u, δ)) < 0).
is also a function of u. By using Q̃ 1−α (h̄(u, δ))) as an Remark 2: Assumption 2 implies that there exists a
approximation of Q 1−α (h̄(u, δ))), a smooth function can be sequence {u k }∞ k=1 ⊆ U that converges to an optimal solution
used to approximate the (1 − α) quantile Q 1−α (h̄(u, δ))). u ∗ such that F(0, u k ) > 1 − α (or equivalently Q 1−α (h̄(u k ,
In this study, single-layer neural network model is used δ)) < 0) for all k ∈ N.
to approximate the (1 − α) quantile Q 1−α (h̄(u, δ))). Using By applying [14, Corollary 21.5], we have the following
N independently extracted samples of δ, the neural approxi- lemma about the convergence of Q̃ 1−α (h̄(u, δ)).
mation of Q 1−α (h̄(u, δ))) with S hidden nodes and activation Lemma 1: Suppose that Assumption 1 holds. For any fixed
function g(·) is defined as α ∈ (0, 1) and any fixed u ∈ U, if N samples of δ, δi ,
S i = 1, . . . , N, are independently extracted, then
Ĥ S (u) = βi g(u, ai , bi ) (12)
1 I h̄ i ≤ Q 1−α − (1 − α)
N
i=1 Q̃ 1−α
−Q 1−α
=− + o(1)
where βi denotes the weight vector connecting the i th hidden N i=1 f Q 1−α
node and the output nodes, ai = [ai,1 , . . . , ai,k ] represents the (16)
weight vector toward u, and bi is the scalar threshold of the
where Q̃ , Q , and h̄ i are shorts for Q̃ (h̄(u, δ)),
1−α 1−α 1−α
i th hidden node. Here, the activation function adopts
Q 1−α (h̄(u, δ)), and h̄(u, δi ), respectively.
the sigmoid function expressed as
Consequently, the sequence { Q̃ 1−α − Q 1−α } is asymptotically
g(u, ai , bi ) =
1
. (13) normal with mean 0 and variance (α(1 − α))/(N · f 2 (Q 1−α )).
1+ e−ai u+bi As N increases to ∞, the variance also vanishes to zeros.
T
The purpose is to achieve For the convergence of Ĥ S (u) to Q 1−α (h̄(u, δ)), we have
the following theorem.
Ĥ S (u) ≈ Q 1−α h̄(u, δ) . (14) Theorem 1: Suppose that Assumption 1 holds, as N → ∞,
Then, Ĥ S (·) maps the decision variable u ∈ R into the k ∀ H > 0 there exists S, a, b, and δ such that
image of Q 1−α (h̄(u, δ))) which is R. In order to find solutions sup Ĥ S (u) − Q 1−α h̄(u, δ) < H , w.p.1 (17)
for (1) and (6), it is equivalent to solve the following problem: u∈Uc
min J (u) where Uc ⊆ U represents any compact set inside the feasible
u∈U area.
s.t. Ĥ S (u) ≤ 0. (15) Proof (Theorem 1): Due to Assumption 1 and Remark 1,
Q 1−α (h̄(u, δ)) is the continuous function of u. Then, according
The above formulation is a sample-based neural approxi-
to the universal approximation theorem [15], [16], ∀ H > 0,
mation of the original PCP, which is obtained by a two-layer
∀Q 1−α (h̄(u, δ)), ∃S ∈ N+ , β, b ∈ R S , and a ∈ R S×k such
approximation: sample approximation and neural approxima-
that
tion. The convergence and feasibility of the two-layer approx-
imation should be analyzed. Ĥ S (u) − Q 1−α h̄(u, δ) < H (18)
Authorized licensed use limited to: UNIVERSITY OF ALBERTA. Downloaded on February 21,2023 at 05:33:25 UTC from IEEE Xplore. Restrictions apply.
SHEN et al.: SAMPLE-BASED NEURAL APPROXIMATION APPROACH FOR PCPs 1061
for all u ∈ U. However, Q 1−α (h̄(u, δ)) is not used directly and J (û) ≥ J ∗ . Moreover, J (û l ) → J (û) w.p.1, which means
for fitting the quantile function. The used one is the approx- that liml→∞ JN,l S,l
≥ J ∗ . The above is true for any cluster point
∞
imate value Q̃ 1−α (h̄(u, δ)), which has bias compared to of {û k }k=1 in the compact set U, and we have
Q 1−α (h̄(u, δ)). The bias is defined by (16). Denote d Q 1−α =
Q̃ 1−α − Q 1−α . According to Lemma 1, d Q 1−α is normal with
S,k
lim inf JN,k ≥ J ∗ , w.p.1. (21)
k→∞
mean 0 and variance (α(1 − α))/(N · f 2 (Q 1−α )). As N → ∞,
(α(1 − α))/(N · f 2 (Q 1−α )) → 0. Namely, d Q 1−α → 0 with Now, by Assumption 2 and Remark 2, there exists an
probability 1. optimal solution u ∗ and a sequence {û l }l=1 ∞
converging to u ∗
Since we can only obtain Q̃ 1−α (h̄(u, δ)), instead of (18) with Q (h̄(û l , δ)) < 0. Note that Ĥ S,l (û l ) converges to
1−α
which needs Q 1−α (h̄(u, δ)), the following inequality: Q 1−α (h̄(û l , δ)) w.p.1, and thus, there exist K (l) such that
Ĥ S,k (û l ) ≤ 0 for every k ≥ K (l) and every l, w.p.1. Assume
Ĥ S (u) − Q̃ 1−α h̄(u, δ) < H (19) that K (l) < K (l + 1) for every l without loss of generality
and define the sequence {ū k }∞ k=K (1) by setting ū k = u l for all k
can be obtained ∀ H > 0, ∀Q 1−α (h̄(u, δ)), ∃S ∈ N+ , β, b ∈
and l with K (l) ≤ k < K (l + 1). We then have Ĥ S,k (ū k ) ≤ 0,
R S , and a ∈ R S×k . Since Ĥ S (u) − Q 1−α (h̄(u, δ)) = S,k
which implies that JN,k ≤ J (ū k ) for all k ≥ K (1). Since f is
Ĥ S (u) − Q̃ 1−α (h̄(u, δ)) + Q̃ 1−α (h̄(u, δ)) − Q 1−α (h̄(u, δ)) ≤
continuous and ū k also converges to u ∗ , we have
Ĥ S (u) − Q̃ 1−α (h̄(u, δ)) + d Q 1−α , we have
S,k
lim sup JN,k ≤ J u ∗ = J ∗ , w.p.1. (22)
Ĥ S (u) − Q 1−α h̄(u, δ) < H + d Q 1−α . (20) k→∞
Namely, for any H > 0, we have parameters to satisfy (20). Thus, JN,k S,k
→ J (u ∗ ) w.p.1 when k → ∞. Namely,
Notice that d Q 1−α becomes 0 with probability 1 as N → ∗
JNS →J .
∞. Besides, for any N, there exist S, a, b, and δ to satisfy (20) For the proof of D(A SN , A) → 0 w.p.1, it can refer
for any H > 0. Thus, (17) is proven. to [18, Th. 5.3].
Let A and A SN denote the sets of optimal solutions for For the finite sample feasibility analysis of the approxi-
problems (1) and (15), respectively. Notice that problem (6) mation problem solutions, we will make use of Hoeffiding’s
shares the same set of optimal solutions as problem (1). inequality [18], [19]:
Besides, denote J ∗ and JNS the optimal values of cost functions Theorem 3: Denote Z 1 , . . . , Z N for independent random
in problems (1) and (15), respectively. The convergences of JNS variables with bounded sample spaces, namely Pr{Z i ∈
and A SN as N and S increase are summarized in the following. [z i,min , z i,max ]} = 1, ∀i ∈ {1, . . . , N}. Then, if s > 0
Theorem 2: Suppose that U is compact, the function J (·) N
is continuous, and Assumptions 1 and 2 hold. Then, JNS → J ∗ − 2N 2 s 2
Pr (Z i − E{Z i }) ≥ s N ≤ e
N
i=1 ( z i,max −z i,min )
2
. (23)
and D(A SN , A) → 0 w.p.1.
i=1
Proof (Theorem 2): Due to Assumption 2, the set A is
nonempty and there exists u ∈ U such that Q 1−α (h̄(u, δ)) < 0. Based on Hoeffding’s inequality, probabilistic feasibility
According to Theorem 1, Ĥ S (u) converges to Q 1−α guarantee of the sample approximation method is proven in [9]
(h̄(u, δ)) < 0, and thus, we can find S0 , a0 , b0 , β0 , and and summarized here as:
N0 such that Ĥ S (u) − H − d Q 1−α ≤ 0. ( H can be Theorem 4: Let u ∈ U be such that u ∈ / U f . Then,
any small value to 0 and d Q 1−α decreases to 0 with
N
Pr F̃ (−γ , u) ≥ 1 − β ≤ e−2N τu
2
probability 1 as N → ∞.) Because Ĥ S (u) is continuous (24)
in u and U is compact, the feasible set of the approxima-
where γ > 0, β ∈ [0, α], and τu > 0 is written as
tion problem (15) is compact as well. Besides, A SN is not
empty w.p.1 for all N ≤ N0 and all S, a, b, β, and N for τu = F(0, u) − F(−γ , u) + (α − β). (25)
smaller error bound (with higher probability) H + d Q 1−α
of (20). Here, we denote Ĥ S,0 (u) for Ĥ S (u) with parameters Theorem 4 shows an important property of approximating
{S0 , a0 , b0 , β0 , N0 } and input u. Analogously, Ĥ S,k (u k ) is for the cumulative probability function by samples. Set γ ≈ 0
Ĥ S (u) with parameters {Sk , ak , bk , βk , Nk } and input u k . With and β ≈ α, if the sample number goes to ∞ and
parameters {Sk , ak , bk , βk , Nk }, the corresponding notations for Pr{ F̃ N (−γ , u) ≥ 1 − β} goes to 0. Namely, the sample-based
the optimal solution set and optimal cost value are generally neural approximation problem has the solution, which satisfies
defined as A S,k S,k
N,k and J N,k . the original probabilistic constraint in a higher probability with
∞
Let {Nk }k=1 ≥ N0 and {Sk , ak , bk , βk , Nk } be two sequences more sample numbers.
such that the corresponding s,k as in (20) decreases to 0.
Let û k ∈ A S,k N,k , which means that û k ∈ U, Ĥ S,k (û k ) ≤ 0,
S,k IV. P ROPOSED A LGORITHMS FOR PCP S
and JN,k = J (û k ). Let û ∈ U be any cluster point of
{û k }∞ ∞
k=1 . Define {û l }l=1 a subsequence converging to û. Since
Two algorithms are used to solve PCPs. First, sample-based
Ĥ S,l (u) defined by {Sl , al , bl , βl , Nl } is continuous and con- algorithm is summarized to train the deterministic constraints,
verges uniformly to Q 1−α (h̄(u, δ)) on U w.p.1, we have which has a neural network form. Then, a simulated annealing
that Q 1−α (h̄(û, δ)) = liml→∞ Ĥ S,l (û l ) w.p.1. Therefore, algorithm is summarized to solve the deterministic program
Q 1−α (h̄(û, δ)) ≤ 0 and û is feasible for the true problem, obtained by the neural approximation approach.
Authorized licensed use limited to: UNIVERSITY OF ALBERTA. Downloaded on February 21,2023 at 05:33:25 UTC from IEEE Xplore. Restrictions apply.
1062 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 2, FEBRUARY 2023
Authorized licensed use limited to: UNIVERSITY OF ALBERTA. Downloaded on February 21,2023 at 05:33:25 UTC from IEEE Xplore. Restrictions apply.
SHEN et al.: SAMPLE-BASED NEURAL APPROXIMATION APPROACH FOR PCPs 1063
I : ϕ → I (ϕ) ⊆ Y ⊂ R (34)
where ϕ ∈ Rn ϕ is a regression vector containing input variables Fig. 4. Results of interval predictions by scenario approach using different
u ∈ Rn u or decision variables on which the system output y choices of sample number.
depends. Given an observed ϕ, I (ϕ) is an informative interval
or the prediction interval, containing y with a given guaranteed
probability 1 − α, α ∈ [0, 1). Output intervals are obtained by Due to the above discussions, the IPM identification prob-
considering the span of parametric families of functions. Here, lem of wind power prediction is to find IPM I(c,r,ε) (v) (or find
we consider the family of linear regression functions defined (c, r, ε)) such that
by
min wr + ε (39)
(c,r,ε)
M = y = θ T ϕ + e, θ ∈ ⊆ Rn θ , e ≤ ε ∈ R . (35) subject to r, ε ≥ 0
Then, a parametric IPM is obtained by associating to each Pr p ∈ Ic,r,ε (v) ≥ 1 − α, α ∈ [0, 1). (40)
ϕ the set of all possible outputs given by θ T ϕ + e as θ varies Notice that w is a weight and often chosen as w =
over and e varies over E = {e ∈ R|e ≤ ε}, which is E{ϕ} [31]. In this study, w is chosen as 20.
defined as
C. Results and Discussion
I (ϕ) = y : y = θ T ϕ + e ∀θ ∈ ∀e ∈ E . (36)
Fig. 3 shows one example of the estimation results of
A possible choice for the set is a ball with center c and quantile function by using neural networks; 5000 different
radius r combinations of (c, r, ε) are considered. The dataset was
divided into a train set and a test set. The blue circles show
= Bc,r = θ ∈ Rn θ : θ − c ≤ r . (37) the quantile function value given by the trained fitting model.
Then, the interval output of the IPM can be explicitly The red dotted line shows the result of the empirical quantile
computed as function value calculated by using the data of the test set. The
mean value of error is −0.0421 and the mean of the abstract
I (ϕ) = c T ϕ − (r ϕ + ε), c T ϕ + (r ϕ + ε) . (38) value of mean is 12.5882.
We compare the proposed method with scenario approach,
For the wind power prediction problem, the input is wind which is proposed in [7]. The required α is set as 0.05.
speed and output is wind power as u = v and y = p. Besides, Figs. 4 and 5 show the results of interval predictions by
ϕ is chosen as ϕ = [u u 2 ]T . Then, ϕ = [v v 2 ]T in the wind scenario approach and the proposed method, respectively.
power prediction problem. Identifying the IPM is essentially The number of used data samples is 150, 750, or 7500.
identifying (c, r, ε). Without loss of the generality, we use As the number of samples increases, the scenario approach
I(c,r,ε) (v) for I (ϕ) in the latter part for IPM identification gives more conservative results, while the proposed method
problem of wind power prediction. performs with better robust on sample numbers.
Authorized licensed use limited to: UNIVERSITY OF ALBERTA. Downloaded on February 21,2023 at 05:33:25 UTC from IEEE Xplore. Restrictions apply.
1064 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 2, FEBRUARY 2023
Authorized licensed use limited to: UNIVERSITY OF ALBERTA. Downloaded on February 21,2023 at 05:33:25 UTC from IEEE Xplore. Restrictions apply.
SHEN et al.: SAMPLE-BASED NEURAL APPROXIMATION APPROACH FOR PCPs 1065
[30] M. G. Khalfallah and A. M. Koliub, “Suggestions for improving wind Nan Yang was born in Xiangyang, China.
turbines power curves,” Desalination, vol. 209, nos. 1–3, pp. 221–229, He received the B.S. degree in electrical engineering
Apr. 2007. from the College of Electrical and Power Engi-
[31] M. C. Campi, G. Calafiore, and S. Garatti, “Interval predictor models: neering, Taiyuan University of Technology, Taiyuan,
Identification and reliability,” Automatica, vol. 45, no. 2, pp. 382–392, China, in 2009, and the Ph.D. degree in electrical
Feb. 2009. engineering from Wuhan University, Wuhan, China,
in 2014.
Xun Shen (Member, IEEE) was born in Yueyang, He is currently an Associate Professor with the
China. He received the B.S. degree in electri- Department of Hubei Provincial Collaborative Inno-
cal engineering from Wuhan University, Wuhan, vation Center for New Energy Microgrid, China
Three Gorges University, Yichang, China. His major
China, in 2012, and the Ph.D. degree in mechanical
engineering from Sophia University, Tokyo, Japan, research interests include optimal operation of power system, planning of
in 2018. He is currently pursuing the Ph.D. degree power system, uncertainty modeling, and artificial intelligence technology.
with the Department of Statistical Sciences, Gradu-
ate University for Advanced Studies, Tokyo.
He has been a Post-Doctoral Research Associate
with the Department of Engineering and Applied
Sciences, Sophia University, since April 2018. Since
June 2019, he has been an Assistant Professor with the Department of
Mechanical Systems Engineering, Tokyo University of Agriculture and Tech-
nology, Fuchu, Japan. His current research interests include chance con- Jiancang Zhuang received the Ph.D. degree in
strained optimization, point process, and their industrial applications. statistics from the Department of Statistical Science,
Graduate University for Advanced Studies, Tokyo,
Tinghui Ouyang received the B.S. and Ph.D. Japan, in 2003.
degrees in electrical engineering and automation From 2003 to 2004, he was a Lecturer with The
from Wuhan University, Wuhan, Hubei, China, Institute of Statistical Mathematics, Tokyo. From
in 2012 and 2017, respectively. 2004 to 2006, he was a Post-Doctoral Fellow at The
He is currently a Research Scientist at the Artificial Institute of Statistical Mathematics, funded by the
Intelligence Research Center (AIRC), National Insti- Japanese Science Promotion Society. From 2006 to
tute of Advanced Industrial Science and Technology 2007, he was a Post-Doctoral Fellow at the Depart-
(AIST), Tokyo, Japan. Before joining AIST, he was ment of Earth and Space Sciences, University of
a Research Fellow with Nanyang Technological Uni- California at Los Angeles, Los Angeles, CA, USA. In 2007, he returned
versity (NTU), Singapore, from 2018 to 2019, and to The Institute of Statistical Mathematics, where he had been an Assistant
a Post-Doctoral Fellow at the University of Alberta, Professor until 2011. Since 2011, he has been an Associate Professor with
Edmonton, AB, Canada, from 2017 to 2018. His major research interests The Institute of Statistical Mathematics and the Department of Statistical
include computational intelligence, data mining, machine learning, and deep Sciences, Graduate University for Advanced Studies. His research interests
learning. include statistics, probability, and modeling.
Authorized licensed use limited to: UNIVERSITY OF ALBERTA. Downloaded on February 21,2023 at 05:33:25 UTC from IEEE Xplore. Restrictions apply.