Professional Documents
Culture Documents
Abstract: - In this paper, we discuss how to approximate the conditional expectation of a random variable Y
given a random variable X, i.e. E(Y|X). We propose and compare two different non parametric methodologies
to approximate E(Y|X). The first approach (namely the OLP method) is based on a suitable approximation of
the ı-algebra generated by X. A second procedure is based on the well known kernel non-parametric regression
method. We analyze the convergence properties of the OLP estimator and we compare the two approaches with
a simulation study.
ͳ σ
సభ ௬ ቀ
ೣషೣ
ቁ
ܧሺܽො ሻ ൌ ܧሺܻ ͳ אೕ ሻ ൌ ݃ො ሺݔሻ ൌ
ሺሻ
, (11)
݊ೕ
σసభ ቀ
ೣషೣ
ቁ
ሺሻ
௬ௗ ՜ஶ
אಲ
ൌ
ೕ
ሱۛۛሮ ܧሺܻȁܣ ሻ. (8) where ܭሺݔሻ, denoted by kernel, is a density function
ଵȀ (typically unimodal and symmetric around zero)
such that i) ܭሺݔሻ ൏ ܥ൏ λ; ii) ୶՜േஶ ȁܭݔሺݔሻȁ ൌ
Therefore, we are always able to approximate
Ͳ (see, among others, [1] and [2]). Moreover, ݄ሺ݊ሻ
ܧሺܻȁԱ ሻ, and thereby the conditional expectation
is the smoothing parameter, often referred to as the
ܧሺܻȁܺሻ, by using the following estimator :
bandwidth of the kernel, and it is a positive number
such that ݄ሺ݊ሻ ՜ Ͳ when ݊ ՜ λ. When the kernel
ͳ
݃ොை ሺܺሻ ൌ ͳאೕ ݕ ൌ ܭis the probability density function of a standard
ୀଵ ௫ אೕ ݊ೕ normal distribution, then the bandwidth is the
standard deviation. It was proved in [1] that if ܻ is
ൌ σୀଵ ͳאೕ ܽො . (9) quadratically integrable (see also [12]) then ݃ො ሺݔሻ is
a consistent estimator for ݃ሺݔሻ. In particular,
where X is assumed independent from the i.i.d. observe that, if we denote by ݂ሺݔǡ ݕሻ the joint
observations ሺݔ ǡ ݕ ሻ. Note that ݃ොை is a simple density of ሺܺǡ ܻሻ, the denominator of (11) converges
Ա function, and it is conceptually to the marginal density of ܺ, while the numerator
different from the classical estimators, which are ஶ
converges to ିஶ ሼୀ௫ሽ ܲݕሺ݀ݔǡ ݀ݕሻ. As a
generally aimed at estimating an unknown
parameter rather than a random variable. A further consequence, we know that ݃ො ሺܺሻ ՜Ǥ௦Ǥ ݃ሺܺሻ.
From a practical point of view, if we apply the
property of the OLP estimator is that ܧሺ݃ොை ሺܺሻሻ ൌ
kernel estimator to the bi-variate random sample
ܧሺܻሻ:
ሺݔଵ ǡ ݕଵ ሻǡ ǥ ǡ ሺݔ ǡ ݕ ሻ we obtain the vector
ሺ݃ଵ ǡ ǥ ǡ ݃ ሻ ൌ ሺ݃ො ሺݔଵ ሻǡ Ǥ Ǥ ǡ ݃ො ሺݔ ሻሻ. In other words,
ܧሺ݃ොை ሺܺሻሻ ൌ ܧቀͳאೕ ܽො ቁ ൌ each value ݃ is a weighted average of kernels,
ୀଵ centered at each sample observation ݔ . Since we
ൌ ܲሺܣመ ሻ ܧ൫ܽො ൯ ൌ know that ݃ ՜ ܧሺܻȁܺ ൌ ݔ ሻ when ݊ ՜ λ, then we
ୀଵ can also estimate the distribution function (say
ܨሺݔሻ ൌ ܲሺ݃ሺܺሻ ݔሻ) of ݃ሺܺሻ with any parametric
ൌ ܲ൫ܣመ ൯ σୀଵ ܧሺܻȁܣመ ሻ ൌ ܧሺܻሻǡ (10) or non-parametric method, based on the outcomes
ሺ݃ଵ ǡ ǥ ǡ ݃ ሻ.
distributed), then we also know the true distribution continuous distribution is typically better estimated
of ݃ሺܺሻ. The main motivation of this study is that, if by a large number of distinct observations.
we show that a method provides a good estimate in However, it should be stressed that the OLP method
the normal case, when ݃ሺܺሻ is known, then we has several other advantages compared to the Kernel
argue that it can provide similar results in many method. While the OLP method only requires that ܻ
other cases, when the distribution of ݃ሺܺሻ is is an integrable random variable, the kernel method
unknown. This is especially true for the OLP is suitable only for continuous random variables and
method, which does not depend on any particular requires also the assumption of finite variance.
specification except from the choice of the number Moreover, for the OLP method we only need to
of intervals ݇. Hence, assuming that specify how to determine the number of intervals k,
ሺܺǡ ܻሻ̱ܰሺߤǡ ȭሻ, (whereߤ ൌ ሺߤ ǡ ߤ ሻ and ȭ ൌ while, for the kernel method, we have to choose the
൫ሺߪ ଶ ǡ ߩߪ ߪ ሻǡ ሺߩߪ ߪ ǡ ߪ ଶ ሻ൯) we propose to “best” kernel density and bandwidth parameter. In
simulate a bivariate random sample from ሺܺǡ ܻሻ and particular, for the proposed analysis we used the
to apply the OLP and the Kernel methods to the following specifications.
observed data, just as described in section 2, in i) OLP - number of intervals
order to evaluate which method yields a better Obviously, the selected number of intervals k
approximation of the distribution of ݃ሺܺሻ. Indeed, can vary between 1 and n and, in order to
in both cases we obtain n outcomes ሺ݃ଵ ǡ ǥ ǡ ݃ ሻ, improve the accuracy of the estimate it must
which are used to estimate the probability generally be an increasing function of n (we
distribution ܨሺݔሻ ൌ ܲሺ݃ሺܺሻ ݔሻ of the r.v. ݃ሺܺሻ shall discuss this point in the next section).
(i.e. the Gaussian distributionܰሺߤ ǡ ȁߩȁߪ ሻ in this Note that, for ݇ ൌ ͳ we approximate the
particular case). For this purpose, we simply apply random variable ݃ሺܺሻ with a number, i.e. the
the empirical distribution function to the vector sample mean ݕത, which is obviously not
ሺ݃ଵ ǡ ǥ ǡ ݃ ሻ. The empirical distribution is actually appropriate. On the other hand, for ݇ ൌ ݊ we
the natural consistent estimator of ( ܨsee e.g. [13]), approximate ݃ሺܺሻ with the marginal
and it is defined by distribution of ܻ, given by ݕଵ ǡ Ǥ Ǥ ǡ ݕ , which is
also generally inappropriate. Hence, in order to
ଵ maximize i) the number of intervals, and ii) the
ܨ ሺݔሻ ൌ σୀଵ ͳሼஸ௫ሽ , (12)
number of observations in each interval (݊ೕ ),
where ͳ is the indicator function for the set ܣ. In in this analysis we propose to use:
this paper we propose to use ܨ as a non-parametric ݇ ൌ ඃξ݊ඇ, (15)
estimator for the distribution of ݃ሺܺሻ. Obviously, where ۀݔڿis the integer part of ݔ. By doing so,
according to how the outcomes ሺ݃ଵ ǡ ǥ ǡ ݃ ሻ are we obtain ݇ intervals containing
generated (OLP or Kernel) we obtain two different (approximately) ݇ observations. If we do not
empirical distributions ܨ ’s, which approximate the have any information about the dependence
true distribution, given by ܰሺߤ ǡ ȁߩȁߪ ሻ. Then, in between ܺ and ܻ, this method is actually the
order to evaluate which method provides the best most robust, in that it provides the largest
fitting distribution, we compute two different possible number of conditional averages ܽො ,
descriptive measures of fit based on probability where each ܽො is computed based on the largest
distances, namely the Kolmogorov-Smirnov (or possible number of values (݊ೕ ). In the next
uniform) metric, defined by
section we prove that the rule identified by (14)
is actually appropriate and yields a consistent
ܦ൫ܨ ǡ ܨ൯ ൌ ௫אԹ ȁܨ ሺݔሻ െ ܨሺݔሻȁ, (13) estimator. Moreover, in section 4 we propose a
new rule for the choice of ݇, based on the
and the Kantorovich metric [14] (i.e. the ܮଵ metric correlation value between ܺ and ܻ, that might
for distribution functions), defined by further enhance the performance of the OLP
ஶ estimator.
ܭ൫ܨ ǡ ܨ൯ ൌ ିஶ ȁܨ ሺݔሻ െ ܨሺݔሻȁ݀ݔ. (14) ii) Kernel – density and bandwidth
Generally, the choice of the kernel density and
Generally, provided that both methods can capture especially the bandwidth parameter can be
the shape of ܨ, we would expect that the kernel really troublesome and this could be a
method yields a better fit, because of its larger drawback of the method. Indeed, there are
number of distinct outcomes ݃ ’s. Indeed, a several sophisticated techniques to choose the
optimal bandwidth, which is still an open the results of the OLP method are surprisingly
problem in the literature (see e.g. [15]). Since valiant.
we know that ሺܺǡ ܻሻ is jointly normally
distributed, we simply use the normal kernel,
which is obviously the most appropriate choice Kernel OLP
in this particular case. As for the bandwidth ߩ D K D K
parameter, we can use the Sturge’s or the 0 0.784 0.059 0.565 0.207
Scott’s rule (see [16] and [17]) which are 0.09
especially suitable under normality assumptions 0.894 0.070 0.609 0.203
(see also [18]). In particular, in what follows we 0.18 0.255 0.044 0.284 0.088
shall show just the results obtained by applying 0.27 0.227 0.073 0.122 0.060
the Scott’s rule, because it provided better 0.36 0.225 0.105 0.177 0.085
approximations of ܨin our analyses. We recall 0.45 0.059 0.031 0.118 0.085
that the optimal bandwidth, according to the
0.54 0.076 0.057 0.111 0.069
Scott’s rule, is given by ͵Ǥͷߪ ݊ିଵȀଷ.
Note that, in the Gaussian case, the distribution of 0.63 0.111 0.103 0.169 0.122
the conditional expectation, that is, 0.72 0.103 0.106 0.072 0.078
݃ሺܺሻ̱ܰሺߤ ǡ ȁߩȁߪ ሻ, mainly depends on the 0.81 0.098 0.129 0.094 0.091
correlation between the variables. Hence, we 0.9 0.067 0.132 0.077 0.082
generated several random samples of different sizes
Table 1: simulations with ݊ ൌ ͷͲͲ from a multivariate
from ሺܺǡ ܻሻ̱ܰሺߤǡ ȭሻ (where we assume the
normal distribution
marginals be standard normal random variables,
i.e.,ߤ ൌ ߤ ൌ Ͳǡ ߪ ൌ ߪ ൌ ͳሻǡfor different
(positive) values of ߩ, and analyzed the results Kernel OLP
accordingly. The Table 1 shows the results in terms ߩ D K D K
of the probability metrics defined above. First, note
that the K-S distance is quite high (about 0.5) for 0 0.601 0.009 0.502 0.045
ߩ ൌ Ͳ. The obvious reason is that ߩ ൌ Ͳ yields 0.09 0.067 0.008 0.062 0.015
݃ሺܺሻ ൌௗ ܧሺܻሻ, which is a degenerate distribution 0.18 0.038 0.011 0.030 0.009
(at 0, in this case) and therefore the Kolomogorov- 0.27 0.034 0.013 0.033 0.011
Smirnov metric ܦ൫ܨ ǡ ͳሼ௫ஹሽ ൯ is generally close to 0.36 0.017 0.008 0.029 0.010
0.5, while the Kantorovich metric, which is based 0.45
on the area between the functions, better captures 0.017 0.009 0.017 0.010
the distance between the distributions in this 0.54 0.012 0.010 0.017 0.013
particular case. Note also that the consistency of 0.63 0.013 0.010 0.015 0.013
both methods is apparent from tables 1 and 2, in that 0.72 0.009 0.010 0.014 0.012
increasing the sample size (from ݊ ൌ ͷͲͲ in Table 1 0.81 0.007 0.009 0.015 0.013
to ݊ ൌ ͳͲହ in Table 2) the distance between the true 0.9
and the estimated distribution approaches zero, for 0.005 0.008 0.014 0.014
any fixed value of ߩ (except for the Kolomogorov- Table 2: simulations with ݊ ൌ ͳͲହ from a multivariate
normal distribution
Smirnov distance at ߩ ൌ Ͳ, as explained above).
However, we observe that the kernel method
Similarly, we performed a simulation analysis also
generally outperforms the OLP method. Although in
for the Student’s t distribution. We generated
several cases the results are similar (the OLP
several random samples of different sizes from
method is better only in some rare cases), as
ሺܺǡ ܻሻ̱ܵݐሺߤǡ ߑǡ Ͷሻ (withߤ ൌ ሺͲǡʹሻ and ߪ ൌ ߪ ൌ
expected and discussed above. In particular, note
ͳ): in this case, we know that ܧሺܻȁܺሻ̱ܵݐሺͲǡ ȁߩȁǡ Ͷሻ.
that the kernel method generally provides more
The results confirms what discussed above, as
accurate estimates than the OLP for small or large
values of ߩ: indeed, the value of ߩ will be critical shown in Table 3, for ݊ ൌ ͳͲହ . In particular,
observe that the OLP estimator outperforms the
for the optimal choice of ݇, as discussed in the next
section. Nevertheless, if we consider that the kernel kernel estimator for values of ߩ between 0.18 and
method has been calibrated just to provide the best 0.45, but in the other cases the kernel estimator is
possible estimates under assumption of normality, more accurate.
Finally, Fig. 1 and Fig. 2 show that the OLP
estimator well captures the shape of the distribution
ߩ D K D K
0 0.670 0.023 0.590 0.057 0.2
0.2
1.0
1 2 3 4
Fig. 3ሺܺǡ ܻሻ̱ܵݐሺߤǡ ߑǡ Ͷሻ,ߤ ൌ ሺͲǡʹሻǡ ߪ ൌ ߪ ൌ ͳ and
0.8
ߩ ൌ ͲǤͳ. n=1000, Green=true dist ,Red= estimated dist
(Kernel), Blue=estimated dist (OLP)
0.6
0.4
4 On the Optimal Number of
0.2 Intervals
The simulation comparison in section 3 was
apparently “rigged” in favor of the kernel method.
1 1 2 3 4 5 Indeed, we used the information about the
Fig. 1. ሺܺǡ ܻሻ̱ܰሺߤǡ ߑሻ, ߤ ൌ ሺͲǡʹሻ, ߪ ൌ ߪ ൌ ͳ and distributional assumptions (normality) to improve
ߩ ൌ ͲǤͺ. Green=true dist ,Red= estimated dist (Kernel), the results of the kernel method as much as possible,
Blue=estimated dist (OLP) i.e. using the normal kernel and the optimal
bandwidth. On the other hand, this information was
not used also to enhance the performance of the
OLP estimator. However, in this adverse situation,
we obtained that the OLP estimator yields surprising
results. Differently, in this section we propose to use
the information about the joint distribution in order
to further improve the OLP estimator, under some
particular conditions.
As discussed in section 3, the number of intervals k
for the OLP method can vary between 1 and n.
However, if we assume that the random vector intervals ܣ , ݆ ൌ ͳǡ ǥ ǡ ݇. Then, the mean squared
ሺܺǡ ܻሻ is jointly normally distributed, then we know error of the OLP estimator is given by:
that, for ߩ ൌ Ͳ, we obtain ݃ሺܺሻ ൌௗ ܧሺܻሻ, and ଶ
ܧቂ൫݃ොை ሺܺሻ െ ݃ሺܺሻ൯ ቃ ൌ
therefore the distribution of ݃ሺܺሻ can be better
estimated by a number, that is, the sample mean ݕത.
ൌ ߩଶ ቊͳ െ ݇ σୀଵ ቂ݂ ൬ܨିଵ ቀ
ିଵ
ቁ൰
Hence, in this particular case, the optimal value of k
ଶ
is exactly 1, rather than ඃξ݊ඇ. On the other hand, for ଵ
െ݂ ൬ܨିଵ ቀቁ൰ቃ ቀͳ ቁൠ, (16)
|ߩȁ ൌ ͳ (i.e. ܻ ൌ ܽ ܾܺ), we obtain ݃ሺܺሻ ൌ
ܧሺܽ ܾܺȁܺሻ ൌ ܽ ܾܺ ൌ ܻ, and therefore the where ݂ is the (Gaussian) marginal density of X.
distribution of ݃ሺܺሻ can be estimated with the
marginal observations of ܻ, ሺݕଵ ǡ ǥ ǡ ݕ ሻ. Thus, when Proof
ȁߩȁ ൌ ͳ the optimal value of k is exactly n: in this We know that: ܧሺܻȁܺሻ ൌ ߩ̱ܺܰሺͲǡ ȁߩȁሻ. Moreover,
ଶ
case we would get the maximum possible number ܧቂ൫݃ොை ሺܺሻ െ ݃ሺܺሻ൯ ቃ ൌ ܧሾሺ݃ොை ሺܺሻ െ ߩܺሻଶ ሿ ൌ
(݊ ඃξ݊ඇ) of distinct observations, ൌ ܧሺ݃ොை ሺܺሻሻଶ ߩଶ െ ʹߩܧሺܺ݃ොை ሺܺሻሻ, where
From these considerations, we argue that, also in the ଶ
case that ߩ אሺͲǡͳሻ, the dependence between ܺ and ܧሺ݃ොை ሺܺሻሻଶ ൌ ܧቆ ͳאೕ ܽො ቇ ൌ
ܻ should influence the choice of the number of ୀଵ
ଶ
intervals k. In particular, k should be chosen ൌ ܧቆ ͳאೕ ൫ܽො ൯ ቇ ൌ
according to the mean-dependence structure ୀଵ
between the random variables. We recall that ͳ
generally stochastic independence implies mean- ൌ ܧ൫ܽො ଶ ൯
݇ ୀଵ
independence, which in turn implies uncorrelation because σஷ ܽො ܽො ͳאתೕ ൌ Ͳ.
(nevertheless, if ሺܺǡ ܻሻ is jointly Gaussian the three
conditions are equivalent). In the following
Note that
proposition we derive the formula of the mean
݇ ଶ ଶ
squared error (MSE) between the OLP estimator ܧሺܽො ଶ ሻ ൌ ൬ ൰ ܧቆ ܻ ͳאೕ ൨ ቇ
and the conditional expectation ݃ሺܺሻ in the ݊
σୀଵ ܧቀܻ ଶ ͳאೕ ቁ ൌ σୀଵ א ݕଶ ݂ ሺݕሻ ൌ ܸሺܻሻ ൌ slower than the number of observations. Corollary 1
ೕ gives necessary and sufficient conditions for the
ͳ. Then: consistency of the OLP estimator in the Gaussian
ͳ
case. We argue that the same rule holds also if the
ܧ൫ܽො ଶ ൯ ൌ
݇ ୀଵ joint distribution is not normal. Obviously, the rule
݇ ߩଶ ݆െͳ
ଶ ݇ ൌ ݊ڿ ۀwhere ܽ אሺͲǡͳሻ (e.g. ݇ ൌ ඃ݊Ǥହ ඇ, used in
ൌ ቌͳ ሺ݊ െ ͳሻ ൝െ ቆܨିଵ ൬ ൰ቇ Ȁʹൡ the simulation analysis) satisfies the conditions of
݊ ʹߨ ୀଵ ݇
Corollary 1. In order to further increase the
ଶ
݆
ଶ convergence rate of the OLP estimator, we argue
െ ൝െ ቆܨିଵ ൬ ൰ቇ Ȁʹൡ൩ ൱Ǥ that, when ȁߩȁ is close to 0 the exponent ܽ should be
݇
close to 0, and when ȁߩȁ is close to 1 the exponent ܽ
Furthermore, since should be close to ͳ. Indeed, observe that the second
ଶ
݇ߩ ݆ term in the MSE expression approaches zero only as
ିଵ
ܧሺܽො ሻ ൌ ൭ ൝െ ቆܨ ൬ ൰ቇ Ȁʹൡ ݇ tends to infinity, but it can be negligible for small
ξʹߨ ݇
ଶ
values of ȁߩȁ, while in this case the asymptotic
ିଵ
݆ͳ behavior of the first term is critical. On the other
െ ൝െ ቆܨ ൬ ൰ቇ Ȁʹൡ൱ǡ
݇ hand, we obtain the exactly opposite situation when
we obtain the variables are highly correlated, thus, in this case,
a larger value of ݇ would increase the convergence
ܧ൫ܺ݃ොை ሺܺሻ൯ ൌ ܧ൫ܽො ൯ ܧቀܺͳאೕ ቁ ൌ rate. Hence, as a rule of thumb, we propose to use
ୀଵ మȀయ
ଶ ݇ ൌ ݇ כൌ ቒ݊ȁȁ ቓ (17)
݇ߩ ݆
ൌ ൭ ൝െ ቆܨିଵ ൬ ൰ቇ Ȁʹൡ (where ݎis the value of the empirical correlation
ʹߨ ݇
between the data) which yields ݇ ൌ ͳ for ȁݎȁ ؆ Ͳ,
ଶ ଶ
݆ͳ ݇؆݊ for ȁݎȁ ؆ ͳ, and ensures that
െ ൝െ ቆܨିଵ ൬ ൰ቇ Ȁʹൡ൱ ǡ ܧܵܯ൫݃ොை ሺܺሻ൯ ՜ Ͳ for any different value of ߩ.
݇
The possible usefulness of this simple rule is well
which yields the thesis.
described by the following examples.
Obviously, in practical cases we do not know the
Example
true percentiles, but we estimate them with the
Let ሺܺǡ ܻሻ̱ܰሺߤǡ ߑሻ with ߤ ൌ ሺͲǡʹሻǡ ߪ ൌ ߪ ൌ ͳ
empirical distribution: these estimates are
and ߩ ൌ ͲǤ, which yields that ݃ሺܺሻ̱ܰሺʹǡͲǤሻ. We
consistent, that is, for fixed ݇ and for ݊ ՜ λ the
generate a bivariate sample of size 1000 from the
sample percentiles converge to the true ones (as an
obvious consequence of the law of large numbers). random vector ሺܺǡ ܻሻ and estimate the distribution
Nevertheless, we also need that ݇ ՜ λ in order to function ܨof ݃ሺܺሻ with the OLP method, using the
మȀయ
obtain a consistent estimator of ݃ሺܺሻ, therefore if following values of ݇: 1) ݇ ൌ ݇ כൌ ቒ݊ȁȁ ቓ ൌ ʹʹ͵
the number of estimands grows as fast as ݊ does, (where ݎൌ ͲǤͻ); 2) ݇ ൌ ݊ڿǤଷ ۀൌ ; 3) ݇ ൌ
then the MSE of OLP method will not converge to ඃξ݊ඇ ൌ ͵ͳ; and 4) ݇ ൌ ඃ݊Ǥଽହ ඇ ൌ Ͳ. Fig. 4 below
0. In view of Proposition 1, it is apparent that, in the shows that the best performance of the OLP
case ȁߩȁ ് ͳ, the necessary and sufficient conditions estimator is obtained for 1) ݇ ൌ ඃ݊ȁȁ ඇ, as in this
for the convergence of the OLP estimator are: i)
case the estimated distribution is incredibly well-
݇ሺ݊ሻ ՜ λ; iii) ݇Ȁ݊ ՜ Ͳ. This is stated in the
fitting to the true one. The other methods surely
following corollary, which is a straightforward
provide inferior performances. Indeed, the estimated
consequence of Proposition 1.
distributions yielded by 2) and 3) seem to capture
the shape of the reference distribution ( ܨthat is,
Corollary 2. Let ሺܺǡ ܻሻ be a bivariate Gaussian
ܰሺʹǡͲǤሻ) but approximate it with a “lower
vector, i.e. ሺܺǡ ܻሻ̱ܰሺߤǡ ߑሻ͕ where ȁߩȁ אሺͲǡͳሻ, and
resolution”, in that the number of intervals (i.e.
let ሺݔଵ ǡ ݕଵ ሻǡ ǥ ǡ ሺݔ ǡ ݕ ሻ be a random sample of distinct values ܽො ) is quite poor (especially in case
independent observations from ሺܺǡ ܻሻ. A necessary
2)). Conversely, in 4) we have a higher value of
and sufficient condition for ܧܵܯ൫݃ොை ሺܺሻ൯ ՜ Ͳ is distinct values of ܽො and consequently a “higher
that ݇ ՜ λ and ݇Ȁ݊ ՜ Ͳ.
resolution” in the plot, but the estimated distribution
has apparently a different shape from the true one.
In other words, the general rule is that the number of
Nevertheless these results also confirm that with
percentiles (which have to be estimated) must grow
݇ ൌ ඃξ݊ඇ we generally obtain a good compromise We finally argue that the proposed rule can be
between closeness to the true distribution and appropriate for dealing with distributions which are
number of distinct observations ܽො . approximately Gaussian, and therefore can be
usefully applied to several kinds of data, because of
case 1 case 2 the central limit theorem for random vectors.
1.0 1.0 Indeed, the multidimensional central limit theorem
0.8 0.8 states that the standardized sum of i.i.d random
0.6 0.6 vectors (with finite variance) converges to
0.4 0.4 a multivariate normal distribution. Therefore, just
0.2 0.2 based on the assumption that ሺܺǡ ܻሻ have a joint
0 1 2 3 4 0 1 2 3 4 distribution (i.e. they are defined on the same
probability space), we can generally use the
case 3 case 4
1.0 1.0 empirical correlation between the observations of ܺ
0.8 0.8 and ܻ in order to get information about the
0.6 0.6 dependence between the variables, and thereby to
0.4 0.4 determine the optimal number of intervals for the
0.2 0.2 OLP estimator.
0 1 2 3 4 0 1 2 3 4
Fig. 4. Estimated (blue) and true (red) distribution
functions for different values of k in the Gaussian case. 5 Conclusion
In this paper, we proposed two different procedures
Furthermore, we repeat the same experiment for a (OLP and kernel) to estimate the distribution
bivariate t distribution, that is ሺܺǡ ܻሻ̱ܵݐሺߤǡ ߑǡ ͷሻ function of a conditional expectation, based on a
with ߤ ൌ ሺͲǡʹሻǡ ߪ ൌ ߪ ൌ ͳ and ߩ ൌ ͲǤ͵. In this bivariate random sample. In particular, the
case we obtain ݃ሺܺሻ̱ܵݐሺʹǡͲǤ͵ǡͷሻ. We generate a properties of the OLP estimator have been studied
bivariate sample of size 10000 from the random thoroughly. It has been shown that the method can
vector ሺܺǡ ܻሻ and estimate the distribution function provide a consistent approximation of the random
ܨof ݃ሺܺሻ with the OLP method, using the variable ܧሺܻȁܺሻ, based on a suitable choice of a
following values of ݇: 1) ݇ ൌ ݇ כൌ ʹ (where parameter ݇. Both the OLP and kernel methods
make it possible to estimate the distribution function
ݎൌ ͲǤ͵ʹ); 2) ݇ ൌ ݊ڿǤଵ ۀൌ ͳ; 3) ݇ ൌ ඃξ݊ඇ ൌ ͳͲͲ;
ofܧሺܻȁܺሻ non parametrically. Our simulation
and 4) ݇ ൌ ݊ڿǤ ۀൌ ͳʹͷ. Fig. 5. shows that in case
results show that the methods are comparable.
1) and 3) we apparently obtain the best
However, it should be stressed that the OLP method
approximations. However, the Kantorovich metric
presents several advantages compared to the kernel,
(ͲǤͳͺ for case 1 and ͲǤͳͺ for case 3) confirms that
in that it does not require any particular assumption
the best choice is ݇ כ. in order to be applied. Conversely, the kernel
method requires on many restrictive conditions,
case 1 case 2
1.0 1.0 such as continuity and finite variance, and its
0.8 0.8 performance depends on a suitable choice of the
0.6 0.6 kernel function and bandwidth parameter. Finally,
0.4 0.4 since the performance of the OLP estimator depends
0.2 0.2 just on the chosen number of intervals ݇, we
provided some general criteria for the choice of ݇
1 0 1 2 3 4 5 1 0 1 2 3 4 5 under normal assumptions. Consequently, we
case 3 case 4 proposed a practical rule in order to optimize the
1.0 1.0 performance of the method.
0.8 0.8 Both estimators (kernel and OLP) can be used in
0.6 0.6 optimization procedures as required in several
0.4 0.4 financial applications. In particular, we can use the
0.2 0.2 conditional expectation estimators to i) order the
1 0 1 2 3 4 5 1 0 1 2 3 4 5 investors’ choices or ii) evaluate and exercise
Fig. 5. Estimated (blue) and true (red) distribution arbitrage strategies in the market (see [10]). In the
functions for different values of k in the Student’s t case. first case, we know that any non satiable risk averse
investor prefers the future wealth WT at time T with
respect the wealth Wt at time t (t<T), only if
ܧሺܹ௧ ȁ்ܹ ሻ ்ܹ Ǥ ǤǤ Thus, by using the proposed 2003 Winter Simulation Conference, S. Chick,
approximation of the conditional expected value, we P. J. Sánchez, D. Ferrin, and D. J. Morrice, eds,
can attempt to order and optimize the choices of non 1003, pp.383-391.
satiable risk averse investors, as suggested by [8]. In [7] A.N. Avramidis, A cross-validation approach
the second case, as a consequence of the to bandwidth selection for a kernel-based
fundamental theorem of arbitrage, we know that estimate of the density of a conditional
there exists no arbitrage opportunity in the market if expectation, Proceedings of the 2011 Winter
there exists a risk neutral martingale measure under Simulation Conference, S. Jain, R. R. Creasey,
which the discounted price process results to be a J. Himmelspach, K. P. White, and M. Fu, eds.,
martingale. So, if we consider the augmented 2011, pp.439-443.
filtration ሼԱ௧ ሽ௧ஹ associated to the Markov price [8] S. Ortobelli, F. Petronio, T. Lando, A portfolio
process ሼܺ௧ ሽ௧ஹ , then we obtain ݏ ݐthat return definition coherent with the investors
ܧሺܺ௧ ȁԱ௦ )=ܧሺܺ௧ ȁܺ௦ ). Therefore, the conditional preferences, under revision in IMA-Journal of
expected value estimator and the fundamental Management Mathematics.
theorem of arbitrage can be used to estimate the risk [9] M. Roth, On the multivariate t distribution,
neutral measure and to optimize arbitrage strategies Technical report from Automatic Control at
in the market. Linköpings universitet, 2013.
[10] S. Ortobelli, T. Lando, On the use of
Acknowledgements conditional expectation estimators, New
Developments in Pure and Applied
This paper has been supported by the Italian funds Mathematics, ISBN: 978-1-61804-287-3, 2015,
ex MURST 60% 2014, 2015 and MIUR PRIN pp. 244-246.
MISURA Project, 2013–2015, and ITALY project [11] W. Hardle, M. Muller, S. Sperlich, A.
(Italian Talented Young researchers). The research Werwatz, Nonparametric and semiparametric
was supported through the Czech Science models. Springer Verlag, Heidelberg, 2004.
Foundation (GACR) under project 15-23699S and [12] V. A. Epanechnikov, Non-parametric
through SP2015/15, an SGS research project of estimation of a multivariate probability density,
VSB-TU Ostrava, and furthermore by the European Theory of Probability and its Applications, vol.
Social Fund in the framework of 14, no. 1, 1965, pp. 153-158.
CZ.1.07/2.3.00/20.0296. [13] T. Lando, L., Bertoli-Barsotti, Statistical
Functionals Consistent with a Weak Relative
Majorization Ordering: Applications to the
References: Mimimum Divergence Estimation, WSEAS
[1] E. A. Nadaraya, On estimating regression, Transactions on Mathematics, Vol 13, 2014,
Theory of Probability and its Applications, vol. pp. 666-675.
9, no. 1, 1964, pp. 141-142. [14] S. T. Rachev, Probability metrics and the
[2] G. S. Watson, Smooth regression analysis, stability of stochastic models. Wiley, New
Sankhya, Series A, vol. 26, no. 4, 1964, pp. York, 1991.
359-372. [15] C. Raykar, R. Duraiswami, Fast optimal
[3] D. Ruppert, M. P. Wand, R. J. Carroll, bandwidth selection for kernel density
Semiparametric Regression. Cambridge estimation, Proceedings of the Sixth SIAM
University Press, 2003. International Conference on Data Mining,
[4] G, Mzyk, Generalized kernel regression for the 2006, pp. 522-526.
identification of Hammerstein series, [16] D.V. Scott, On Optimal and Data-Based
International journal of applied mathematics Histograms, Biometrika,Vol. 66, No. 3, 1979,
and computer science, Vol. 17., No. 2, 2007, pp. 605-610.
pp. 189.197. [17] H. A. Sturges, The Choice of a Class Interval,
[5] S. Ortobelli, F. Petronio, T. Lando, Portfolio Journal of the American Statistical Association,
problems based on returns consistent with the Vol. 21, No. 153, 1926, pp. 65-66.
investor's Preferences, Advances in Applied and [18] E. Bura, A. Zhmurov, V. Barsegov,
Pure Mathematics, ISBN: 978-960-474-380-3, Nonparametric density estimation and optimal
2014, pp. 340-347. bandwidth selection for protein unfolding and
[6] S. G. Steckey, S.G. Henderson, A kernel unbinding data, The Journal of chemical
approach to estimating the desity of a physics, Vol. 130, 2009, article no. 015102.
conditional expectation, Proceedings of the