ConnectionistNonparametricRegression White90

Neural Networks, Vol. 3, pp. 535 549, 1990 1~893-t~08()/911$3,110 ÷ .
00
Printed in the USA. All rights reserved. ('opyright ~ I990 Pergamon Press pie
ORIGINAL CONTRIBUTION
Connectionist Nonparametric Regression: Multilayer

Feedforward Networks Can Learn Arbitrary Mappings
HALBERT WHITE
University of California. San Diego
(Received 27 February 1989: revised and accepted 31 January 1990)
Abstract--It has been recently shown (e.g., Hornik, Stinchcombe & White. 1989, 1990) that st(~)ciently comph'x
multilayer feedJbrward networks are capable of representing arbitrarily accurate approximations to arbitra©'
mappings. We show here that these approximations are learnable by proving the consistency (~f a class ()f
eonnectionist nonparametric regression estimators,for arbitrary (square integrable) regression fhnctions. The
consistency property ensures that as network "experience" accumulates (as indexed bv the size (~1 the training
set), the probability of network approximation error exceeding any spec(fied level tends to zero. A k~3, j~'ature
~)[ the demonstration of consistency is the proper control of the growth of network complexity a,s a fimction 01
network experience. We give specific growth rates fi)r network complexity compatible with consistency. We also
consider automatic and semi-automatic data-driven methods for determining network complexi(v in applications.
based on minimization of a cross-validated average squared error measure of network pe@~rmance. We rec-
ommend cross-validated average squared error as a generally applicable criterion/?~r comparing relative per*
f?)rmance of dif[ering network architectures and configurations.
geywords--Learning, Feedforward networks, Nonparametric, Regression, Convergence.
1. I N T R O D U C T I O N erty of consistency for a certain class of learning

methods for such networks. The consistency prop-
Recently, Hornik, Stinchcombe and White (1989,
erty ensures that as network experience accumulates
1990) (HSWa,b) have demonstrated that sufficiently
complex multilayer feedforward networks (e.g., Ru- (as indexed by the size of the training set n) the
melhart, Hinton, & Williams, 1986) are capable of probability of network approximation error exceed-
arbitrarily accurate approximations to arbitrary map- ing any specified level tends to zero. This establishes
the nonparametric regression capability of multilayer
pings (i.e., measurable or continuous functions). (See
feedforward network models.
also Carroll and Dickinson. 1989; Cybenko, 1989;
Funahashi, 1989; Hecht-Nielsen, 1989; and Stinch- A key feature of our analysis is the proper control
of network complexity as a function of network ex-
combe & White, 1989a). An unresolved issue is that
of "learnability", that is, whether there exist meth- perience n. Arbitrarily accurate approximation to an
ods allowing the network weights corresponding to arbitrary function will generally require an arbitrarily
these approximations to be learned from empirical complex network. In practice, only networks of finite
observation of such mappings. It is the purpose of complexity can be implemented. Too complex a net-
this paper to prove that these approximations are work (relative to n) learns too much--it memorizes
indeed learnable by establishing the statistical prop- the training set and therefore generally performs
poorly at generalization tasks. Too simple a network
learns too little--it fails to extract the desired rela-
Acknowledgments: This research was supported by a grant tionships from the training set and thus also performs
from the GuggenheimFoundation and by NSF grant SES-8806990. poorly at generalization tasks. Our theory provides
The author is grateful for helpful conversations with Maxwell specific growth rates for network complexity that
Stinchcombe, Whitney Newey, and James Powell, and for helpful asymptotically (i.e., as n ~ ~c) avoid the dangers of
suggestions by the editor and referees. The author is responsible
for any errors. both overfitting and underfitting the training set.
Requests for reprints should be sent to Halbert White, De- Moreover, these rates can be combined with data-
partment of Economics, D-008. University of California, San driven methods for determining network complexity
Diego, La Jolla, CA 92093. in a manner that guarantees their consistency also.
535
530 II. U/'hi:c
The learning methods treated here are extremely where ~:. ~ Y, - E ( Y , X , ) is a rm~dom error with
computationally demanding. Thus. they lay no claim conditional expectation zero given X,. When the re-
to biological or cognitive plausibility. They do, how- lationship between Y, and X, is detcrmimstic. ~ ~
ever, provide some appreciation for the computa- always zero: otherwise, is nonzero with positive
tional effort required to train multilayer feedforward probability. In the situation treated here. the regrc>
networks to approximate arbitrary mappings arbi- sion function tL is assumed m be entirety, unknown.
trarily well. Our methods are feasible in modern I es- Our problem is to learn {estimate. approximate ~ lhc
pecially parallel) computing environments and arc mapping 0. from a realization of the sequence {Z.~
thus of some value to researchers interested in prac- In practice, we observe a realization of ont', :
tical applications. In particular, we recommend gen- finite part ot the sequence ~.,:, ' Z ~ • ":aining ~e,t ~-
eral application of cross-validated measures of net- ' s a m p l e " of size n li.e., a realiza*ion of Z" - IZ _
work performance. ZT,)'t. Because It is all element o~ a space ol
To keep this paper to a manageable size. we limit functions lsay (-)). we have cssenlialtv no hope ot
its scope strictly to the issue of consistency {learna- learning I1 in any complete sense qo~n ,, sample of
bility). This by no means exhausts the relevant issues: fixed finite size. Nevertheless. ~1 ~:< possible to ap-
it leaves many relevant questions unaddressed, and proximate ft to some degree o~ afCCUlacv USlnE tl
indeed makes possible the formulation of further im- sample ot size n, and to construct increasingly ac-
portant questions. We leave these to other work. and curate approximations with mcrea>tng n. [n what fol-
merely identify them as they arise. lows we will refer to such a procedure interchanee-
abh as learning, estimation, or ~q)proximation A
learning rule is correspondingl,v delined in the weak-
2. HEURISTICS est poss!ble sense as a sequence of mappings, say
Consider a sequence {Z,} - ~Z,, t =- 1. 2. . .} ot 1:)-I - 10.- ~l -- I 2 . . . . from the underlying prob-
identically distributed r a n d o m column vectors. Sup- ability space generating the observed phenomenon
pose we are interested in the relationship between of interest to the space ® in whicl~ u . the object o~
some elements of Y, of Z, and the remaimng elements interest describing this p h e n o m e m m , lies. Because
X,. We write Z, = ( Y : , X : I ' . where a prime super- (). is a mapping from a probabiht} ,pace, ~ !~ slo-
script denotes vector transposition. For example, m chastic
classification or pattern recognition problems Y, is a A minimal property for any learning rule is that
binary or multinomial variable designating class ol c o n s i s t e n c y . A stochastic seqt!encc It/,# is consis-
m e m b e r s h i p and X , is a set of variables influencing tent for o if the probability that t/ exceeds any ,,,pcc-
the classification. In forecasting problems. Y, is the ified level of approximation erroT relative to #,. tends
set of variables (digital or analog) that we wish to to zero as the sample size n tends ~o infinity. Pro-
forecast on the basis of variables X,, which may itself cedures that are not consistem will always make
contain past values of Y,. In image enhancement errors in classification, recognitum, forecasting,
problems, 1/, encodes a target image, and X, encodes enhancement, or pattern completion {forms of gen-
a degraded version of the image. In pattern comple- eralization) that are eventually a~.oided b~ a consis-
tion problems, Y, is the missing part of a pattern. tent procedure. The only errors ultmmtelv made b~
and X, is the supplied part of the pattern (we suppose a consistent procedure are the inherentlx unavoid-
the same part is always to be supplied for this ap- able errors (¢:,) arising from an~ m n d a m e n t a l ran-
domness or fuzziness in the true relation between X,
plication).
Regardless of whether a deterministic or stochas- and Y. Consequently, our focus is on using connec-
tic relationship exists between Y, and X,, a natural tionist networks to obtain learning rules {0,} consis-
object of interest in such situations is the conditional tent for an arbitrary regression function fJ,, and we
expectation of Y, given X,, written E ( Y , [ X , ) . (See therefore identify the issue of the "learnability" of
a particular class of mappings with the existence of
White, 1989a,) This can be represented as a regres-
a consistent learning rule for that class. This distin-
sion function,
guishes our focus of attention from such standard
O.(X,) = E(Y, X.t. concept-learmtbility issues as treated for example by
Haussier (1989).
For example, when 1/, can take on only the values tl
We apply the method of sieves, {Grenander- 1981:
or 1, O,(x) gives the probability that Y, = 1 given
G e m a n & Hwang 1982: White & Wooldridge. 1990L
that X, = x. W h e n Y. can assume a continuum of
a general nonparametric statistical procedure m
values. Oo(x) gives the expected value for Y, given
which an object of interest 0~ lying m a general {i.e.,
that X, = x. We may also write
not necessarily finite-dimensional) space e) is ap-
Y,- 0,,(~)-~:.. proximated using a sequence of parametric models
Connectionist Nonparametric Regression 537
in which the dimensionality of the parameter space activation functions ~u. and with weights satisfying
grows along with the sample size. To succeed, the a particular restriction on their sum norm, indexed
approximating parametric models must be capable by A. We construct a sequence of "sieves" {®,,(~,)}
of arbitrarily accurate approximation to elements of by specifying sequences {q,,} and {A} and setting
(-) as the underlying parameter space grows. For this (.),,(~,) = T(q,,, q,,, A,,), n = 1, 2 . . . . . The sieve
reason, Fourier series (e.g., Gallant & Nychka, 1987) e),,(q/) becomes finer (less escapes) as q,, ~ ~ and
and spline functions (e.g., Wahba, 1975; Wahba & A,,--~ ~c. For given sequences {%} and {A,,}, the "con-
Wold, 1975: Cox, 1984) are commonly used in this nectionist sieve estimator" O,, is defined as a solution
context. Among others, HSWa,b establish that mul- to the least squares problem (appropriate for learn-
tilayer feedforward networks also have universal ap- ing E( Y,IX~))
proximation properties. Without this, attempts at
nonparametric estimation using these network models rain n ~2 [Y.. O(X,)]:. n = 1. :."~. . . . . (2.1)
would be doomed from the outset.
For concreteness and simplicity, we consider only Associated with 0,, is an estimator ~,, of dimension
single output single hidden layer feedforward net- p, x 1 (p,, =- q,,(r + 2) + 1) such that 0,,(-) = .f'~"(.,
works. Our approach generalizes straightforwardly (~,,). Sections 3 and 4 are devoted to specifying precise
to the multi-output multi-hidden layer case. Specif- conditions on {Z,}, ~u. {q,,}. and {A,,} that ensure the
ically, we write the output of a q hidden unit feed- consistency of 0,, for O,.
forward network given input x as Before describing these conditions, we must em-
phasize that obtaining at least a near global solution
t"(~, ~") ~ 1~,, + ,~,/;,~,(-f';,,), to (2.1) is central to this approach. Consequently,
optimization methods that deliver only local optima,
w h e r e ff '~ - (flq', ;,,t')' is the p × 1 ( p = q(r + such as the method of back-propagation or iterative
2) + 1) vector of network weights (parameters) Newton methods will be inadequate. Instead, global
(let []'~ :- ([I,,. [~ . . . . . /~,~)', ;'~ =- (,'[ . . . . . i','¢)'. ,'i optimization methods (as described by Baba, 1989;
=- (;',~,. . . . . i'i,)' f o r j = 1. . . . . q). qJ is the (given) or White. 1989a, Section 4.a) are required. This is
hidden unit activation function, and i -= (1, x')'. For one source of the advertised computational burden
simplicity and without loss of generality, we put no of this approach. For simplicity, we suppose for now
squashing function at the output unit. Figure 1 de- that a global solution to (2.1) is available. At the end
picts the corresponding network architecture. of this section, we describe the modifications to our
We construct a sequence of approximations to O,, set-up that permit treatment of cases in which only
by letting network complexity q grow with n at an approximately optimal solutions are available.
appropriate rate, and for given n (hence given q) For simplicity throughout we suppose that {Z,} is
selecting weights (~,, so that 0,, ~ .fq"(., c~,,) provides a bounded stochastic process, without loss of gen-
an approximation to the unknown regression func- erality taking values in the hypercube t'--= x ) ~[0,1].
tion O,, that is the best possible in an appropriate We also assume that 0,, is a square integrable function
sense, given the sample information. on I', r = v - 1. The activation function q/is assumed
To formulate precisely a solution to the problem to be bounded and satisfy a Lipschitz condition. The
of finding 0,,. we define Lipschitz condition holds whenever q/is continuously
differentiable with bounded derivative, as for the
r(~,. q. ,5) ~ {0 C e):O(-) .f',(., 3 % logistic and hyperbolic tangent squashers commonly
used in applications. Non-sigmoid activation func-
tions are also permitted.
,I -< A, i ' i i < qA , We first consider deterministic methods for net-
work complexity growth. We prove that the appro-
the set of output functions of all single hidden layer priate choice for &, is such that A,, ~ ~ as n ~ :~
feedforward networks with q hidden units having and A,, = o(1ll 4), that is. n ~4 A,, ~ Ill as n ~ ~.
The proper choice for {%} depends on {A,,} and on
~ Output the dependence properties of {Z,}. When {Z,} is an
weights independent identically distributed (i.i.d.) sequence,
Hidden Theorem 3.5 below establishes that 0,, is consistent
Tweights for 0,. provided that q, ---, ~c as n --~ z and q, A~ log
% A,, = o(n). When {Z,} is a stationary mixing pro-
Input
X~ X2 X3 X4 cess. Theorem 3.5 establishes that 0,, is consistent for
0,, provided that % --~ ~c as n ~ 7- and q,, A~, log q,,
FIGURE. 1. Single hidden layer feedforward network. A,, = o(n~-~). Mixing processes are a class of time
538 H. 'i}hi.~c
series processes that can exhibit considerable short- TABLE 1

run d e p e n d e n c e , but display a f o r m of asymptotic Relative Network Complexity for Consistency as a
i n d e p e n d e n c e , in that events involving elements of Function of Sample Size, ~ = .01
{Z,} separated by increasingly greater time intervals n 10 2 10 4 10 6 10 8 10"~
are increasingly closer to i n d e p e n d e n c e . The allow-
n' 1 95.50 9 . 1 2 × 103 8.70 1 0 "~ 8 . 3 2 .- 10 ~
able g r o w t h rate for n e t w o r k complexity is slower in
n" " 1 9,77 95.50 9.33 ,~ t 0 2 9 . t 2 ~ 10 a
the case of mixing processes than for the case ol
i n d e p e n d e n t processes because new information ar-
rives at a slower rate. (See White, 1984, for a further
discussion of mixing processes.) fixed and '&, ~ c log n will m c r c a s e with n. the re-
The allowed g r o w t h rates for network complexity strictions imposed on ~, are quite mild. ( In fact. there
are not immediately obvious from the conditions is even r o o m for letting the n u m b e r of inputs r in-
stated. To gain some insight we consider the allowed crease with n, so that greater a m o u n t s of information
rates in m o r e detail. First, we see a clear trade-off can be exploited as n --, z in attempting to hit the
b e t w e e n the allowed rate of increase of q,, and that target: we leave this to o t h e r work for the sake or
of &,: the slower of the rate of increase of &,. the simplicity.) Because of the mildnes~ of the restric-
greater the permissible rate for %. Because the ap- nons on 7. we do not consider t h e m further.
proximation improves as n e t w o r k complexity in- The force o f our condition is borne bv/~. We ~e-
creases, it is desirable to permit A,, to grow' at only quire that £~" /] -<- c l o g n for some 0 < , ~ ",~
a m o d e s t rate in o r d e r to a c c o m m o d a t e rapid growth With q,, increasing at a rate only slightly slower than
in q,. ,~. this inequality can be satisfied only if an increasing
Accordingly, suppose ,k,, -- O ( l o g n) (i.e.. ~,, ~: proportion of the h','s are close to zero. An elemen--
c log n for some 0 < c <- ~c). This satisfies &, ::: tarv calculation yields
o(n~4), because n ['r4 At, ~ C~'I 14 log n ----" 0 as tt --,
\~ ~I 13 \~ I * ! " t_ll,
~c. For the i n d e p e n d e n t case. we now require % A~
tog q,, A = o(n) (i.e.. n ~ q,, A,, ' ~ log q,,&,--, 0). With
&, = O(log n), it suffices that n ' q,, (log n) 4 log(q,, Now
l o g n ) - + 0. We put % = O(n"), ~ > 0, and seek
admissible choices for a: it now suffices that n I ~ dj- log j]' I" - Iog~q,: 1)----logq.,- --
q,,
n"(log n) 4 (c~ log n + log log n). Because log ~l :-
log log n. we need n"-~(Iog n) 5 --> 0. Because n so that
(logn) 5~0forany~:>0,wecantake~x = 1 - ~:
for any s m a l l t : > 0. Thus. q,, = O(n ~ ~ ) a n d k,, = \~ (t - 1) < log q,, - t ......
O ( l o g n) satisfy the required conditions for indepen-
dent observations. Similarly, for mixing processes W h e n q,, = (, i. we obtain from this inequal-
% : O(nll-~+,2) and A,, = O ( l o g n) suffice. ity that ZY'- (j - l) ' = O(log n). H e n c e . for
These allowed growth rates for n e t w o r k complex- Z'L, /fi] = O(log n) it suffices that {/1, = O(j vl.
ity % are fairly rapid, as Table l illustrates. The W h e n / is large, l/ijl will be decreasing at the rate i
entries are scaled so that unit complexity is achieved under this condition: this is a rather rapid a p p r o a c h
at n = 100. Thus, if five hidden units are appropriate to zero
in the i n d e p e n d e n t case w h e n n - 1 0 0 , 4 7 7 hidden If less stringent restrictions are desired for £ / ~
units can be s u p p o r t e d by a sample of 10,000 and t/fil, they can be achieved by trading off restrictions
45,600 hidden units can be s u p p o r t e d by a sample of on Z'/". for restrictions on q,,. For example, if re-
one million. With mixing processes, allowable growth strictions analogous to those e n c o u n t e r e d for ?, in the
rates are slower. If five hidden units are appropriate previous example are desired, we could set £x. ~ q,
in the mixing case when n = 100. then 49 hidden tog ,l. This choice is compatible with q,, =
units can be s u p p o r t e d by a sample of 10,000. A Oln t'5 "~ for the i n d e p e n d e n t case or q,, -
sample of one million supports 477 hidden units in O(n ~'~-'-) for the mixing case. as some algebra will
this case. E v e n t h o u g h the growth rates in the mixing verify. These are m u c h slower growth rates than in
case a p p e a r dramatically slower than for the inde- the previous example.
p e n d e n t case. they are still rather generous. Despite these results for g r o w t h rates, they pro-
These relatively rapid growth rates for network vide no practical m e t h o d for c h o o s i n g n e t w o r k com-
complexity are achieved by taking &,, to grow at the plexity for a sample o f given size n. O n e appealing
m o d e s t rate of log n. It is helpful to explore the m e t h o d for m a k i n g this choice is the m e t h o d of cross-
implications of this choice. Recall that we require validation (Stone. 1974). The rationale for this method
E q'',=0 ,~J,/¢ <_. A and gT'__'~E ,r ~ tTjit -< q,, zX,,. For the latter is straightforward. First. consider a naive approach.
condition, it suffices that ~-,,:,~ ]~,I " ' <- A,,. Because r is Suppose {A,} is given, and put O,,(~, q) = T(tu. q,
Connectionist Nonpararnetric Regression 539
&,). For given n, network performance with q hidden The method of cross-validation has proven widely
units can be (naively) measured by the smallest at- successful in other statistical applications; we may
tainable average squared error over the training set. reasonably expect it or some variant (e.g., White,
1989b) to be similarly successful here.
min n ' ~ [Y, - 0(X,)]:. To ensure the consistency of 0,, for 0(,, we can-
not choose {dx,,} and {Nn} arbitrarily. In both the
Unfortunately. this measure is over-optimistic: it is independent and mixing cases, it is again appro-
downward biased (as a measure of minimum risk. priate to take ~,, = o(n~"~). As a crude but expedient
min,E~,,<~,.q~ E([Y, - 0(X,)]2)) and is increasingly approach we ensure consistency here by taking N,, =
downward biased as q increases. The bias arises be- {q. . . . . . . ~,,}. where q,, ~ ~ and ~,, satisfies the rate
cause the observation for (Y;, X ; ) ' is used in arriving conditions for q,, discussed earlier. These condi-
at t);{ (say), the solution to the above minimization tions control q,, in just the right way to ensure that
problem. This makes the contribution to squared er- O,,(~,, c)n) eventually fills O appropriately. It
ror of the tth observation ([Y, - 05{(X,)]e) too small. may be possible to remove the controls on q,, with-
Indeed. if network complexity were selected by out affecting consistency, but we leave this to other
choosing q to make average squared error as small work.
as possible, average squared error could be driven In applications it is convenient to pick _q, = ,2@,
to zero by choosing q = n, a clearly inappropriate for some ,2 < 1. All of the discussion regarding {q,,}
choice. and {din} for the deterministically determined sieves
The method of cross-validation effectively avoids {O,,(~u)} then applies directly to {~,,} and {4,,}. Ap-
these difficulties. The "delete one" cross-validated plications entail optimization of C,,(q) over N,, =
average squared error averages an estimate of squared [).~,,, ~,,] N N. An advantage of choosing N,, in this
error for each observation (say observation t) that way is that it can greatly reduce the range of values
uses an estimator for 0,, (say 0~{i,)) that ignores the that must be considered for q. Even so, the range of
information contributed by that (tth) observation. N,, may still be large. Exhaustive search over N,, is
(For example, a "brute force" choice for ()~,~ can be thus likely to be expensive; more sophisticated uni-
obtained as the solution to the problem min0eo,,~,q~ variate optimization methods are available (e.g.,
n ~ V~,, [ Y: O(X,)]:.) The cross-validated perform- Scales, 1982).
ance measure is formally defined as A disadvantage of these procedures is that they
are not fully automatic, as we require values for aux-
c,,,(q)= n ' ~ IF, - ",,
0 ..... ( x , ) b
iliary control parameters A,,, ~,, and either _q,, or ,2..
/ I For this reason, we call these consistent procedures
" s e m i - a u t o m a t i c . " Nevertheless. it is typically
Because 01{~,, ignores information from the tth ob- straightforward to pick intuitively plausible values
servation. [Y, - I)',I~,I(X,)]: is a measure of out- for these control parameters for a given problem with
of-sample performance and is thus less biased than given n. Given these, values for all other n can be
[Y, - I);{(X,)] e, an in-sample measure. The cross- readily computed (e.g., using the numbers of Table
validated average squared error inherits this relative 1 or numbers similarly constructed).
lack of bias by construction and consequently pro- In practice, full optimization of network perform-
vides a measure of network performance superior to ance over the sample can be difficult or impossible
average squared error. Because of its relative un- to attain. Instead, estimated weights will be more or
biasedness, we recommend its general use. less optimal depending on the intensity of compu-
A completely automatic method for determining tational effort. Results analogous to the consistency
network complexity appropriate for any specific ap- results just described continue to hold when approx-
plication is given by choosing the number of hidden imate rather than exact optimization is carried out,
units c),, to be the smallest solution to the problem although the strength of the results varies with the
rain C,(q), assumed degree of computational effort, as one should
~lC ~,tt
expect. Our analysis is conducted by letting the data
where N,, is some appropriate choice set, a subset of determine an appropriate (e.g., approximately op-
N ---- {1, 2 . . . . }. For the moment, we can suppose timal) degree of network complexity, say gj,,. As be-
that N,, = {1,2 . . . . . n}. The choice c~,,will be called fore, we require q,, _< gj,, _< ~,,. For this choice of
the "cross-validated complexity" of the network. The complexity, again let the fully optimal solution to the
"cross-validated connectionist sieve estimator" is least squares problem be denoted 0,,, so that (),, now
then the solution 0,, to the problem solves
rain ,, ' ~ [ g - 0(X,)]~. min n J ~ , [ E - o(X,)] 2.

Off e)/~I ~J q,t I t
IJ(I-~,/~.q,,) ,' I
540 kJ. ~Vhll~'
Instead o f obtaining 0~, we suppose that we obtain o f a stochastic process {Z,: ~ ..... ~ , v G ? , t -
an a p p r o x i m a t e solution 0,, to this p r o b l e m , char- 1, 2 . . . . } on ([L F, P), and P is ~t,'h that either
acterized by the closeness of n e t w o r k p e r f o r m a n c e (i)/Z,} is i.i.d. : or
at 0,, to optimal n e t w o r k p e r f o r m a n c e . Specifically, (ii) {Z,} is a stationary mixing ?:ocus.~ w~th ,,'iehu~
given a " t o l e r a n c e level" (,, (some positive n u m b e r / ,
we assume that 0,, is chosen so that q~,,. ~,, : O. It .- I',, <" l. (Formal definitions ol d~ and
c~ arc given in the next section, !
i,, ' E
' 1
Iv, - i),,(xj1 e ,, ' Z Iv, - < :
k'or convenience, we assume that Z, takes values
in the unit cube. W h e n / Z , } is a b o u n d e d sequence.
T h e smaller is ~,, the closer is 0,, to being an exact this can always be ensured b\, appropriate scaling
solution to the n e t w o r k learning problem, and. m and shifting. This a s s u m p u o n cn~ttrcs the existence
general, the m o r e the c o m p u t a t i o n a l effort required of E i Y , ) and hence of E ( Y , X , ) . where as belore wc
to obtain {J,,. W h e n ~,, is. large, we can afford to be partition Z, as Z - ( Y,. X; I'. with Y a scalar r a n d o m
m o r e sloppy in finding 0,,. variable for simplicity. Let B' B[I'3 denote the
To obtain our results, we suppose that a sequence Borel a-field g e n e r a t e d by the ~,pen sets of {'. A n
{~,,} of tolerances is specified, t o g e t h e r with a limit element A of B' is a subset of t' f~ which the prob
¢o to which the tolerances tend as n ~ ~. T h e limiting ability P[X, C ,4 1 is defined Define the measure/~
tolerance fro m a y be zero. in which case we d e m a n d as ,~A~ ~- PlX, ~ A] for each ,~ in B': zL rneasurc~
an exact global solution in the limit. In this case. we the relative trequency with which patterns X b e k m g
can again establish the consistency of 0,, for O. Al- to the set A. Let L:(I'. It) denot~ the collecuon of
ternatively, the limiting tolerance m a y be n o n z e r o : all measurable functions #: i . . . . ;. such that F.{t,I: -
for example, we could require only a fixed tolerance [J't'{O(x) : p ( & ) ] : ~..~ (the ~quarv mtegrablc l u n o
~,, = ~,, > 0 for all n In this case. consistency of (J,, tionsl, and define the metric p:~u . 0:) ~ {~,O, 0:{
for O, c a n n o t be g u a r a n t e e d (so that learning is in- for # . rJ, ~ L:(I', ltl. T h e rnelr~c I): measures the
complete), but n e t w o r k p e r f o r m a n c e m the limit is distance between two lunctions u and 0, belonging
adversely affect to a precise extent g o v e r n e d by _.,. to l , : ( / . / z l in terms of weighted roo~ mean squared
The conditions on {Z,}, ~u. {q,,}. {g/,,}, and {A,,} u n d e r error, where the weighting is with respect to ,. Two
which these results hold are identical to those pre- functions o and #= diftering on ~, set o f / ~ - m c a s u r c
viously described. zero will have IMO,. Oz) - l). Wc droll not distinguish
To summarize, multilaver f e e d f o r w a r d networks between such functions: tcchmcallx, tunctions dif-
can learn arbitrarily accurate approximations to un- Iering only o11 a set Ol n-measurt zero arc takur~ ,~
k n o w n functions in the sense that they admit a learn- forming an equivalence class. I mqueness of func-
ing rule possessing the consistency p r o p e r t y To en- tions of Lq !',/~) is with respect r~ these equivalence
sure consistency, network complexity must be properly classes.
controlled; this control can be data-driven. Exact The space (L-:(I'. p), p_-t ~s , ~omplete scparabic
optimization for each n is not necessary: a p p r o x v metric space We take the object of interest to be
mate optimization is permitted. H o w e v e r . consist- the function 0 : / ~ R such that E{ EiX,) = #..{X ~.
ency is g u a r a n t e e d only when exact optimization oc- and we ~mpose the following structure on .t~
curs in the limit.
ASSUMPTION A.2. With (('). p) I L (i ~. l~!. pc). M,
~.~ the unique element ol (9 such that E(Y,,X,}
O,,(X,).
3. CONSISTENCY OF CONNECTIONIST
SIEVE ESTIMATORS This structure is convenient and plausible in many
applications.
We now state formal results establishing the con- Next we formally define the connectiomst s~evc
sistency of connectionist sieve estimators. We first used to a p p r o x i m a t e elements ¢,t (-)
treat deterministic choice of connectionist sieves {(9,,},
and then treat the cross-validated connectionist sieve DEFINITION 3. I. Let q/: R --, R be a given b o u n d e d
estimators, assuming exact optimization. We next function. For any q E N and A ~ R ' . define a subset
give results assuming a p p r o x i m a t e optimization. O u r T(~v. q, A) o f (9 as
results follow f r o m m o r e general statements given in
the next section.
F(~;,. q, ~ ~¢ { 0 C (HI:0(x) = [f,~ \f /;tu(~-';:0,
T h e first assumption describes the data generating
process.
i ~ II-x')',turatlxm ..
ASSUMPTION A.1. L e t (l't. F. P) be a comptete prob- t
ability space. The observed data are the realization

541
('onnectionist Nonparametric Regression
For given ~, a n d s e q u e n c e s {q,,}, {A,,}, define the se- ASSUMPTION A.3. ®,,(~) = T(~'. q,,, A,,) where ~
q u e n c e q [ (single h i d d e n layer) connectionist sieves L, a n d {q,;} a n d {a,,} are such that q,, and a,, are in-
{(-),,(~,)} as creasing with n. q,, ~ ~c and A,, ~ ~ as n ---, zc
A,, = o(n l'~) a n d either (i) q,A,4 log q,,A,, = o ( n ) or
(-),,(~) ~ T(~,. q,,, &,). n = l..~". . . . . (ii) q,,A~, log q,,A,, = o(n~'3). []
This definition implies that when O E (%(~,), we We can now state the desired consistency result.
have O(x) = .f',,(x. (~';,,) in the notation of section 2.
We consider only single hidden layer connectionist THEOREM 3.3. Given A s s u m p t i o n s A. l(i), A.2 and
sieves for notational simplicity. T h e results of section A.3(i) or A.l(ii). A.2. a n d A.3(ii), there exists a
4 also apply to feedforward networks with an arbi- m e a s u r a b l e connectionist sieve estimator 0,,: f~ ~ 6)
trary n u m b e r of hidden layers: the results of this
such that
section can be extended to such cases.
To obtain the consistency result, we must place I' I 2 [}Z __ On(x,} u = [Tlirl " l ~ [}/r ('(X,)] 2,
specific conditions on ~,. and q,, and A,,, A necessary
condition for consistencv is that UT, ~('),,(q.') isp-dense tl l "~
in (.) (i.e.. U;', : ~('),,(~//) contains an element 0 as close
as we like to any element O* of 6). with distance Further, p(O,,, 0,,) _L~ (t ( i . e . , / b r all ~: > () P[~o E ~:
measured by p). The universal approximation results i,((),,(~,)). 0,,) > ~:] ---, 0 as n --, ~. l~
for single hidden layer feedforward networks estab-
lished by HSW a,b make it possible to specify con- This theorem delivers the fundamental result that
ditions on ~ ensuring the required denseness prop- connectionist network models are capable of consis-
erty. To state these conditions, we first define some tently estimating (learning) an arbitrary conditional
terms, We say that q/is a squashing function if t//: R expectation function 0,, an element of the class of
1. q/(a) --~ (I as a ~ -- z ~,(a) --~ 1 as a --, ~c and functions square integrable on I'. The result gives
~/is monotonic. (In probability theoretic terms, ~//is precise specifications for the growth in network com-
a cumulative distribution function.) We say that ~' is plexity sufficient to achieve consistency, although as
/-finite if ~/~: R ~ R is continuously differentiable of we saw in section 2 these specifications still permit
order l (0 -< l < ~) and 0 < fID;~u(x)l d x < z , where a wide latitude for choice of network complexity.
D;~// denotes the lth derivative of ~u. (D"~y = ~,: ~u The consistency result holds for a wide variety of
is assumed continuous when l = 0.) If 0' is a differ- activation functions such as the logistic or hyperbolic
entiable squashing funclion, then ~ is 1-finite. but tangent squashers, choices c o m m o n in practice.
not 0-finite. A continuous nondifferentiable squash- Consistency of cross-validation-based estimates for
ing function is not/-finite for any l. 0,, obtains under an appropriate modification of As-
Conditions on ~,. q,, and A,, ensuring the required sumption A.3.
denseness are given by the next result.
ASSUMPTION B.3. Let (~),,(~u) ?-'(~u. q,,. A,,) and
~,; = T(q/. FI,,, A,,) where ~ ~ L. a n d st-oppose that
LEMMA 3.2. L e t { q , ~ N} and {A,, ¢ R +} be increasing {q,,}. {~,,} a n d {A,} are such that q,, < F/,,, {q_,,}.{~,,} a n d
sequences tending to infinity with n. If ~u is a s q u a s h i n g {&,} are increasing, q,,--* z a n d A,, ---, ~c as n ~ ~,
function or ill q/ is l-finite .[or any integer l. 0 <- l < ~,, = o(n I'4) and either (i) ~j,,zX~ log ~j,,A,, = o ( n ) or
~c, then U,; 1(0,,(¢/) is /)-dense (p = P2) in 6) =
(ii) ~,,A,, log 0,,A,, = o(nJ'Z), t:
L2(~ ', /;).
The consistency result for the cross-validated sieve

L e m m a 3.2 ensures that by p r o p e r choice of q/. estimator is a straightforward consequence of Theo-
q,,, and A,, we can make (0,;(~,) sufficiently "big," rems 4.2 and 4.1 of the next section, provided that
avoiding underfitting. To avoid overfitting, ®,,(~,) (2),,(~,) ~_ e),,(tu. 0,,) = T(~u, q,,. A,,) has an analytic
must be p r e v e n t e d from becoming too big too fast. graph. This demonstration is straightforward, and is
This will require restrictions on q, and A,,. The der- relegated to the Appendix. We have the following
ivation of these restrictions is rather technical, and result.
is postponed to the next section. A further restriction
on ¢/is also helpful. We say that ¢/satsifies a Lipschitz THEOREM 3.4. Given A s s u m p t i o n s A. l(i). A.2 a n d
condition if I~u(a,) - ~ff(a2)l <- L la, - a21 for all a,, B.3(i) or A.l(ii). A.2 a n d B.3(ii), there exists a mea-
a2 ~ R and some L ~ R +. For convenience, let L surable cross-validated sieve estimator {),,: ~ --~ O such
denote the set of all activation functions ~u: R --> R that
such that ~, is b o u n d e d , satisfies a Lipschitz condi-
tion, and is either a squashing function or is/-finite, n ' ~ [Y, - [),,(X,)]: = min n ' ~ [Y, - O(X,)] 2,
0 -< l < zc. The following condition ensures that ®,(~,) i I ncg),/~ i i
does not permit overfitting in the limit. n = 1,2 . . . . .

542
ti. t,!,'hiu, <
where (a),,(q/) = 0,,(~', q,,) and gl,, is cross-validated tion and definitions are as in Stinchcombe and White
network complexity given in section 2, based on (1989b) (SW). We write B(.) to denote the Borel c~>
0;, u, measurabel-FB
/(O_), t = 1. . . . . n, q = ~_t.... field generated by the open sets of the argument set
q,,, n = 1. 9 gr(.) denotes the graph of the indicated correspon-
Further. p({),,, O,,) ~ O. [: dence, and A(-) is the collection o i analytic set~ cd
the indicated a-field.
The data-driven cross-validation procedure for de-
termination of network complexity thus delivers a TtlEOREM 4. I. (a) Let ([t, F. P) b~~a uornplete pr~#>
consistent estimator for g, (i.e.. a learning procedure abifity space and let ((~), p) be a metric space, t.7,
capable of eventually approximating an arbitrary n == [. 2 . . . . . let (~),, be a c(nnpleu, separable Bore/
conditional expectation to any desired level of ac- subset Of 6) and let (}),: [~ --~ 6) bu a corresT)onden~e
curacy with probability arbitrarily close to one). with gr (a),, ~ A(F @ B(®,,)) such ,,hat f o r each u, in
To obtain results for the case of approximate op- ~ ("),,(v~) C (%, and the set (~),.( ,~ i,~ n o n e m p t y and
timization, we impose the following condition. compact Let Q,,: [~ × (.) -~ 7~ be F @ B(O)-mea-
.~urable, and s'uppose that Q,,(~u, } iv lower setnic(m-
ASSUMPTION B.4. Let q,,: ~ - - , {q . . . . . . . -q,,} be a tinuom' on (9,1or each ~,J in ~_~ n . I "~
measurable mapping, n = l. 2 . . . . . and with [),,: Then f o r each n - 1.2 . . . . tltc~'u .<~tsls a.lumtimt
{~ --~ -0~ denoting a measurable solution to the prob- [),,: [t ~ (% measurable-F/B((%)Ihence-F/B(@))su(:h
tern mln~,~o,,(~.0,,i
- - " ~,, I [Y, - O(X,)] 2, let 0,,: ~ - - +
n i ~--~t= .. that Q,,(u,. b,,(~,~)) = rain ....... Q,:(,,:. tl)/'or all ~
0,, be a measurable m a p p i n g such that O,,(w) ~ ('),,(qs. in ~.
O,,(u))) f o r each ~o in ~ and (b) In a d d i t i o n , s u p p o s e I~,,} a n d ~ ,,i are increas-
ing sequences o[" compact subsel~ qt (') Such thai
!,, ' £ IV, - o,,(X,)l - ,, ' £ IY, - a,,(x,)F):-- ; .... U); ('),, is dense in (~) and tO, ~ (~L(~u) C (7), lbr all
] ,, ,, . . . . .
' I i
~,J in [L n = l..~• . . . . . Suppose there exists ,)Urlction
n = 1, 2 . . . . . where {d',,} is a sequence o f positive Q: b) ---, :R ~uch that f o r all e ;. !1
real n u m b e r s such that ~,,--, L'< ~ 0 as n --, ~. m
PI": sup iCLI,,, nt -- (_)tO)i - ei
We define the set 0 " ( 0 ~: {0 ~ O: IE([Y, - #(X)J ~)

a~ ~z • .... (4,11
- E([Y, - o<,(x,)DI = 10 o: z,(0, o,>)
(*'~-}. The desired result is a n d ) b r 0,, C (~)
THEOREM 3.5. Given Assumptions A. l(i). A.2, B.3(i 1. inf 0(01 - Q(0. i (k (4.2i
.H0 ,
and B.4 or A.l(ii), A.2, B.3(ii), and B.4. for any

?. > 0 we have P[O,, ~ ®*(~<, + ~)] --+ l as n -+ ~. ! t where #'(0,,. c) ~ 10 ~ O: p(#. (!,,) ~ ~:}, atui Q is
in addition C.,, = 0 then p(0,,.
"" 0,,) --->P (i. ~7] continuous at t).,. ]'hen p([L, 0,3 -L (I :!
The final conclusion delivers the consistency of i)~

In our application. (~l. F. P) is tile space on which
for Oo when optimization becomes exact in the limit
the stochastic process {Z,} is defined. The p r o p e m e s
(L, = 0). The first conclusion provides a precise state-
ment of the effects of approximate optimization. In of P determine whether {Z,} is an independent or a
the limit, the approximate estimator belongs to a mixing sequence. The space e) contains the object
neighborhood of the true mapping Oo with probability of interest {the unknown regression function 0,), and
p is a metric that measures distance in this space
approaching unity. The smaller is ~,, the smaller is
(weighted mean squared error). Q, is the criterion
the neighborhood.
function (squared error/ optimized to arrive at an
estimator 0,. The set O,(¢u I over which optimization
is carried out may depend on the data through u):
this permits treatment of cross-validation proce-
4. T H E O R E T I C A L FOUNDATIONS
dures. In some applications it may be natural-to have
This section contains theoretical results underlying Q,, defined only on the graph of ~,(c,~) or on f~ ×
those of the previous section. A variety of other net- O, instead of on all of F t x O as we assume. L e m m a
work learning results can be obtained from the re- 2.1 of SW establishes that defining Q, on [) × O
sults given here. Our results follow as a corollary to results in no loss of generality, as there generally
T h e o r e m 2.1 of White and Wooldridge (1990) (WW). exists an appropriate measurable extension to ~) x
For simplicity, the result here is stated so as to be O of Q, originally defined on gr (9,, or f/ x (9,,.
easily applied to stationary stochastic processes Its Part (a) establishes the existence of a measurable
validity is not limited to this case, however. Where estimator 0.. Without measurability w e cannot make
possible, notation follows that of WW. Other nota- probability statements about 0,,. such as statements
about consistency. Part (b) establishes consistency of for all 0 in q,,(0") ~ {O E @,,: p(O. 0 °) < d,,(O")}.
(),, for 0,. The object 0, is distinguished by its role as Put ~,, >- sup:ej sup,e~.,,, Is,;(z, O)l and t-~,,
minimizer of Q (condition (4.2)), the limit to which sup:c, supoe,.,,, m , ( z . O). Put ~J,, ~ inf,~:,.,, d,,(O) ~, and
Q,, converges uniformly (condition (4.1)). This uni- suppose ~,, >- d,, ~.
form convergence can be verified in particular sto- Let {Z,: [~ --+ I"} be a stochastic process on ( ~ , F,
chastic contexts; our next result permits this for i.i.d. P). (i) If" {Z,} is an i.i.d, sequence then f o r any c >
and stationary mixing processes. The other notable 0 and Jbr all n sufficiently large
assumption of part (b) is the existence of nonsto-
chastic sets @,, and ~,, bounding ~,,(~). The behavior P sup In I + [,,,(z,, o) - E(.,.,,(z,, 0))11 ~>
;;J1
of (% ensure~s that (~),,(~,o) becomes sufficiently dense
in (4. so that an element of (2),,(c,9) can well approx- <- 2G,,([c/6-~,,]' ')[exp(- (3n/7)
imate an element of @. T h e behavior of @,, ensures
+ cxp(-;::,;/~:~,.lls + 4;:1)].
that (~),,(u)) does not increase too fast and that Q,;
converges to Q uniformly in an appropriate sense. If in addition {@,,} is such that n ~ 7~, --, 0 a s n - - , ~c
The constants q_, and ~,, of the previous section de- and f o r all c > 0
termine e),, and @,, in our application.
We now give a result permitting verification of (,v~/,) H,,([;:,"6~,,]' ' ) ~ II as n--~ ~ (4.3)
condition (4.1), related to Corollary 2 of Haussler then f o r at O, ~: > 0
(1989). Instead of using the concept of V - C dimen-
sion (Vapnik & Chervonenkis. 1971) we use the con-
P [ s u p .... In ' 2 a [s'(Z"
cept of metric entropy ( K o l m o g o r o v & Tihomirov,
1961" see also Lorentz, 1966). The metric entropy of
a closed set K. d e n o t e d H(c), is the logarithm of the E(s,,(Z,, 0))]i > e: J ~ 0 as n ---. ~.
n u m b e r of sets of radius c (with respect to a specified
metric, not necessarily the same as that in T h e o r e m (ii) ~{f {Z,} is" a stationary mixing process" with either
4.1) required to cover K. We also make use of ex- ¢ ( k ) = O,,Pf, or a ( k ) = aopf,, 0~,, ~, > (I, 0 < p,, <
ponential inequalities awfilable for both independent 1. k >- 1, then there exist constants 0 < cj. ce < sc
and d e p e n d e n t stochastic processes. To specify pre- not depending on n such that f o r any ~: > () and all n
ciseh' the allowed d e p e n d e n c e , we utilize mixing sufficiently large
measures of stochastic d e p e n d e n c e , in particular.
uniform ( d ) - ) and strong (ct ) mixing. These are P sup 1,7 ' [s,,(Z, 0) - f(s,,(Z. 0))1[ > ;:
defined as L 0 c (-I / I
c,G,,([;;/6hT,,] ~')[exp( -c4t'-~) + e x p ( - c.¢-n' :/6f,,)l.

rh(k) = sup, sup,, u:;,ch' ~,.,,~ ,,~ iP(BIA) - P(B)[
If'ill addition {@,,I is such that n i 7{,, 3 __+ () as it ~ :c
~t(k) = sup, sup,, ~:.¢,;. ~= !P(A (1 B) - P(A)P(B)]
attd f o r all c > 0
where F~ ~ a(Z, . . . . . Z,) is the a-field generated (L ?n )H,,(IUOm,, 1' ") --, I~ as tz -+ :~ (4.4)
by {Z~ . . . . . Z,}, and F7 =- a ( Z . . . . . ) is the a-field
then .for arty c > 0
generated by {Z,, Z,+, . . . . }. For a discussion of oS(k)
and a ( k ) and the properties of mixing processes {Z,}
(i.e.. processes for which qS(k) ~ 0 or a ( k ) ~ 0 as
k --+ z), we refer to White (1984). For our next result,
P [~ up n ' 2 [s,,(Z,. O) - E(s,,(Z,. 0)) 1 > ;:] ---, II
t c e~ n ~ I
CIS t l ~ ~c.
the metric p need not be identical to that of T h e o r e m
4.1. Except for the metric entropy conditions (4.3) or
(4.4), the conditions of this result are straightforward
THEOREM 4.2. Let ( ~), F, P) be a complete probability to verify in applications. The metric entropy H,,(c)
space, let (e), p) be a metric space, and let {(%} be a is merely a little tedious to determine in most ap-
(nonstochastic) increasing sequence o f separable sub- plications; our next result provides bounds for our
sets" o1" @. Let H,,(c) be the metric entropy o f ®,;, and application. Given H,,(¢:), choosing (% appropriately
put G,,(~) ~ exp H,,(c). is fairly direct, allowing verification of condition (4.1)
Let {s,,: I' x @,,---, R. n = 1, 2 . . . . } a n d { m , , : with Q , , ( . , O) = n i y , , , s , , ( Z , , 0 ) .
]' x @,,---, R +. n = 1,2 . . . . } be sequences o f func- A (loose) bound for the metric entropy of single
tions such that s,,(., O) and m,,(., O) are continuous hidden layer feedforward networks is given by the
on 1"for each 0 in ®,,. Suppose there exists a sequence following result.
{d,,: @,, --> R ~} and a constant ). > 0 such that f o r
each z E I' and 0" in @,,
LEMIVlA 4.3. Let q/ C L, q ~ N attd ~ ~ R* be given.
M , ( z , O) - s,,(z, 0")i < m,,Cz, O")p(O, 0"); Let p~ denote the tmtform metric, p.(Oj. 0:) =-
.544 i/ tthih
sup,ee tO, Ix) - O2(x)], 0,, O: ~ T(~,, q, ~ ) , and let H e r e we see the growth rates for network complexity
H(~u, q, A) d e n o t e the m e t r i c e n t r o p y o f T(~u, q, ~) a p p e a r i n g in A s s u m p t i o n s A.3 and B.3. T h e growth
with respect 1o p.~. Then f o r all ~: > 0 sufficiently s m a l l . rate for A,,, & -: o(n~4), follows from the require-
ment in T h e o r e m 4.2 that n .s -: ..... u. with ;' set ~o
t t , ( ~ u , q , ~ ) :---plogSh: + p l o g l , X -* rL,X] 4 p logt,
4,x~.
wherep~q(r + 2) ~- 1. ¢'. ] ' h e result f o r a p p r o x i m a t e o p t i m i z a t i o n loltows
as a c o r o l l a r y to T h e o r e m 4. t
To apply T h e o r e m 4.2 to connectionist regression.
we must specify Q .... s,, and m,,: from these L, and ( ] O R ( ) I A A R Y 4 . 5 . L e t the c o n d i t i o n s a~td d e l t n m o n s
n-5,, n e e d e d for (4.3) and (4.4) follow. T h e m e t h o d ot T h e o r e m 4.1 hohl. k o r 5 .c: Ip. & ' l i n e (~) I~,~. 2
of least squares is i m p l e m e n t e d by taking l0 C e),,(-)): tQ,,(¢o, u) Q,,(-,-~).(~',))l "- ~{ and
~-~ t:~ u, ¢ e): Q ( 0 ) Q(0,.) ~ 2}. l.c, {~ ::
Q,(.. O} = n ' <" ['}', H(Xr}]'. m ?satisl3'~ .-~ >tlasn~ ~ and l e t , d . . ~).----
I
~r'~ he rt s e q u e n c e o1 measurabh" /)otclicm.s such thal
so that s,,(z, 0) = [y - 0(x)] e. Because (a e -- b:) O,(c,~) C ('),.l~'~. 5,1. ~t - 1_ 2. "fJum for alf :
= (a + b)(a - b). we have tl_ P[~) C (*):'(; - £ ) 1 - ~ 1 as ~t - - ' . It in additimt
IL then .~if),,. u , I - G (L
ls,,(z, O) - s,,(z, o")[
-<~ 12>' - (o(x) + o ' ( . r ) ) l , io(.v) o"(x)i
:: ( sup 2iv - 0(x)[)t~,(0, (Ire) m
5. SUMMARY A N D DIRECTIONS FOR
FURTHER RESEARCH
so we take m,,(z, 0) = sup0~%(,,,! 2iY - 0Ix)i, ). :- l
O u r results resolve the learnabiht~ issue for mult>
and d,,(O) = l. W i t h o u t loss of generality, we m a y
laver f e e d f o r w a r d n e t w o r k s by establishing the con-
take sup,~xl0(x)] -> 1 (this s u p r e m u m is finite because
sistencv of connectionist sieve estimators l'o~ an un-
the Lipschitz condition of qx ensures the continuity
known regression function. These networks thus
of 0), so that
possess n o n p a r a m e t r i c regression capabilities Net-
sup m,,(z, O) <- sup sup 2[yi + 210(x)i work complexity plays a f u n d a m e n t a l role in our
a n a b s i s . We consider both deterministic and data-
> sup sup 4[0(.r). driven m e t h o d s for controlling the growth of net work
complexity as a function of n e t w o r k experience. T h e
as tY[ -< 1. As ~ is a s s u m e d b o u n d e d , we may take data-driven m e t h o d s ate based on ,, cross-validation
the b o u n d to be unity, without loss of generality. We m e a s u r e ot n e t w o r k p e r f o r m a n c e , cross-validated
then put ~,, = 4&, -> s u p , e , sup,,~,t, ,,,, 4t0(x)t. Sim- average squared error, that we advocate f o r general
ilarly, use m evaluating network p e r f o r m a n c e . This is not
the only a p p r o p r i a t e or useful m e a s u r e i t . g . . ~e¢
Iv - O(x)]: ~ v: + 2b'1" 10(x)l + 0Ix): :~ sup 40(.~)-.
Barron. 1989~. but it offers a considerable i m p r o v e -
ment over naive methods. We also consider the im-
and we take L, = 4A~, ~ s u p , ~ , sup(,~_~.,,,~, 40Ix):.
plications ot a p p r o x i m a t e rathm than exact o p t n n >
Provided that the metric e n t r o p y conditions (4.3)
zation of n e t w o r k p e r f o r m a n c c .
and (4.4) hold, we obtain from T h e o r e m 4.2 that
Consistency ~s a minimal p r o p e r t y tor a learning
(4:1) holds (i.e.. Q,,(.. O) converges uniformly to
rule: the analysis here is inherently limited by our
-Q(O) = E ( [ Y , - 0(X,)]2)). Using L e m m a 4.3. we
focus on this property. T h e r e are n u m e r o u s nnpor-
obtain conditions on {q,} and {A,,} ensuring the va-
t a m areas for further investigauon. Especially desir
lidity of the metric e n t r o p y conditions (4.3) and
able is further analysis regarding the rate of conx.e"-
(4.4).
gence of connectiomst sieve estimators. A m o n g other
things, this will help m a k e possible m o r e i n f o r m e d
]_.EMMA 4.4. L e t {q,, ~ N} and {d~,, E N ' } be s e q u e n c e s
application of the p r o c e d u r e s described here. T h e
such that {A,,} is increasing a n d & , - - , ~c as n --, ~c.
results and a p p r o a c h of Severmi and W o n g (1987}
Put L, = 4A,~, N,, = 4A,,, 2 = 1, a n d f o r given tu
may prove helpful in this investigation. Such results
L a n d ~: > 0 p u t H,,(r.) = H,:(~u, q,,, &,).
will permit further evaluation ot the trade-offs be-
(i) I f q,, ZX4,,log q,, zX,, = o ( n ) , then f o r any c > 0
tween ~,, and q,, required to ensure c o n v e r g e n c e at
(L;'-In)H,,([c/6-~,;] '~) --, I) as iz--, ~c: the best possible rate. and will facilitate c o m p a r i s o n s
with other n o n p a r a m e t r i c regression m e t h o d s , such
(it) I f q,, A~ log % A,, = 0 0 ? ' 2 ) . t h e n f o r a n y e >
as kernel (Bierens. 1987) and sptine (Cox, 1984)
methods. Rate of c o n v e r g e n c e results m a y also be
(L,/n~:)H,,([d6-~,,] ~;) --+ 0 as n --+ ,z. u useful in establishing a s y m p t o t i c optimality p r o p e r -
C(mnectionist Nonparametric Regression 545
ties (as in Li. 1987) for cross-validation procedures. take O,, = (9, n = 1, 2 . . . . . e),,(.) = (a),,(~u) -- T(~u,
A n o t h e r area for further research is greater auto- q,,, A,,) is n o n e m p t y and c o m p a c t (for all n suffi-
mation of consistent and optimal procedures. De- ciently large) given A s s u m p t i o n A.3, and ~,, is an
veloping and i m p l e m e n t i n g well-behaved m e t h o d s analytic c o r r e s p o n d e n c e because it is nonstochastic.
for computing ()',{~,~is especially i m p o r t a n t for prac- Q,,((o, o) n ' N,~,:,[Y,(~o) '¢t
O(X(~o))]2 is F × B
-
tical application of the cross-validation approach. (e)) m e a s u r a b l e as a consequence of L e m m a 2.2

Straightforward extensions of our analysis yield of Stinchcombe and White (1989b) (SW) because for
n o n p a r a m e t r i c estimators of conditional functionals ever}, ~,) in (] Q,,(c,), .) is (/,: - ) continuous and for
other than the conditional expectations studied here, every 0 in the separable metric space (9 Q , ( . , 0) is
such as conditional variance or conditional quantiles. measurable. Continuity of Q,,(~,), ") implies lower
Thcse can be estimated with p r o p e r choice of Q,,. as semi-continuity. The existence of a m e a s u r a b l e con-
discussed in WW. In fact, it is possible to use multi- nectionist sieve estimator 0,,: ~ ---, (9 now follows
laver f e c d f o r w a r d networks to obtain n o n p a r a m e t r i c from T h e o r e m 4. l(a).
estimates of joint or conditional density, using the By A s s u m p t i o n A.3, {(% = (% T(I//, q,,, A,,)}
a p p r o a c h described by G e m a n and H w a n g (1982. is an increasing sequence of c o m p a c t subsets of (9.
section 6). Such multilayer feedforward networks By L e m m a 3.2 U,~ 1 e),, is dense in (9. A r g u m e n t s
would be essentially self-organizing, and could be sketched in the text of section 4 and L e m m a s 4.3 and
used in pattern completion p r o b l e m s in which arbi- 4.4 apply to establish the existence of Q: (9
trary parts of the pattern are supplied and the ar- x, Q(O) = E ( [ Y , - O(X,)]:) such that P[u):
bitrary r e m a i n d e r has to be completed. sup,e_%lQ,,(u), 0) - Q(O)I > ~:] --~ 0 as n --~ ~, as a
Finally. we mention that it is often useful to have consequence of T h e o r e m 4.2, given A s s u m p t i o n s
awfilable statistical confidence regions for linear or A . l ( i ) and A.3(i) or A . l ( i i ) and A.3(ii). Now Q(O)
n o n l i n e a r functionals of the n o n p a r a m e t r i c esti- - Q(0,,) = p(O. 0,):, so inf,e,:u,,.,Q(( 0 - Q(O,,) =
mator. For e x a m p l e , we might want a 95% confi- inf,~_,,,,,,,~ p(O, 0,,): -> ::: > () and Q(0) = p(0, 0,,) 2
dence interval for the conditional expectation of Y, 4- Q(O,) is continuous at 0,,. That i~(0,,, O,,) ~ ' 0 now
given X, = x. a specified value. This interval may follows from T h e o r e m 4.1(b). ~i
be obtainable using m e t h o d s similar to those of An-
drews (1988). The p r o o f of T h e o r e m 3.4 m a k e s use of the following
lemmas.
MATHEMATICAL APPENDIX
LEMMA A. 1. Let (e), p) = ( L:(~'. lt)./):) and let f ( ~ ,
Unless otherwise noted, all definitions and notations q, A) be as in Definition 3.1 with ~// E L. Let ( ~ , F)
are as given in the text. be a measurable space attd let q: ~ --+ N be measur-
able-FIB(N), where N C N and H is endowed with
Proo/" o/' L e m m a 3.2. Let E'(~,) -= {g: E,' ---, ~: the discrete topology. Then ~'or every ~u and A, f(~,.
0, A): ,Q ~ (9 has gr T(~,, c"t, A) C F (~) B((9).
/ = 1. . . . . q, q ~ HI. It follows immediately from
T h e o r e m 2.4 of Hornik. Stinchcombe, and White t'roq/~ For each i// and A. T(q/, q, A) is a c o m p a c t
(1989) or Corollary 3.6 of Hornik, Stinchcombe. and set, as it is the continuous image of the c o m p a c t set
White (199(i) and T h e o r e m 3.14 of Rudin (1974) that :~ × l ~, B : {/:: ,_:/,, I/q -< ~}. :' - {.,: "~;:, 5:; ,, I;'.[
V'(V/) is/)e-dense in Le(F, I0. Let g: t ' - - , R be an < q A/. Then gr T(~/, ~. A) _ U,,~((/ '(n) x T ( ~ ,
arbitrary e l e m e n t of E'(,I/), so that for some q ~ IN, n, A)) ¢ F @ B(e)), because 0 is m e a s u r a b l e (0 '(n)
/li ~ R , / = (). . . . . q, 7 / ~ R ' + ' , / = 1 . . . . . q, we C F for all n C N) and T(q/. n, A) is c o m p a c t for all
-t
have :,,(x) = /)'<, + W/, ,/:ig/(x ,'i). Because q,, z , ~ N (so T(~,, n, A) E B(("))). ~i
and A,,--. :~. we can always pick n sufficiently large
that V'/ ,, [[:,1 -< A,,, V'/, E: ,, 17,1 -< q,, A,,, and LEMMA A.2. Suppose Assumption A.1 holds, and
q =-~ q,,. Thus. for n sufficiently large g belongs to p.: N,, = [q,,.kT,,]n N / o r ~/,,.?,, ¢ i~. ~,, _< ?,,.. =
e),,fu/) = T(~u, q,, A,,) and therefore to U , ; , ("),,(~:). l, .~. . . . . For each n = 1. . . . . . . t = l ..... n
Because ,g is arbitrary, V ' ( ~ ) C U,; a (9,,(~/~). It fol- and q E N,, let O,,t, be measurable-F/B((9). Then for
lows from t h e / > - d e n s e n e s s of E'(~u) in L 2 ( F . / 0 that each n there exists q,,: ~---" N,, measurable-F/B(N,)
U,{ , f"),,(u:) is p:-dense in L2(X ' . / 0 . c~ such that
Proof o / Theorem 3.3. We apply T h e o r e m 4. I. By

4', ~ N,. -= argmin 17 t . [Y. - O,/,,,, (X.)]:,
A s s u m p t i o n A, 1, ([L F. P) is a c o m p l e t e probability q \,,, ' I
space: ((9, p) = ( L : ( I ' , / 0 , P:) is a c o m p l e t e separable

metric space (e.g., K o l m o g o r o v and Fomin, 1970, attd .for all ¢,J in ~ and q,, C N, we have q,,((,J) <
T h e o r e m 37.5, Problem 37.4). Consequently, we may :l,,~u)). E
546 ti. WIute
Proof. We apply Corollary 2.4 of SW. Condition (i) It follows by argument identical to the proof o;
of SW holds given Assumption A. l, and condition Proposition 2.4 of WW that therc exists a nonsto-.
(ii) holds as (N,,, B(N,)) is a complete separable chastic seq_u_ence {0~ ~ O,} such that: 0.~ is a solution
metric space under the discrete topology. For each to inf0~o,, Q(9): for all ~; > ~
q in N,,, Q,(., q) -= r/--1 W~'.:~ [Y, - 0,q,u~(X,)]= is
measurable-F given the measurability of the ele-
lira inl.. { ,,, ,,~,..int~ Q(01 £J{u ~ .~ - 0.
ments of {0%)} and {Z,}. Because N, is a finite set
Q,,: f~ x N, --+ R is measurable-F @ B(N,,), so that
where rr(O;,, ~:1 ~ {0 C O: p(O, 0,,} ~ c}, and p(Oi,
condition (iii) holds. Condition (iv.b) holds as N,
does not depend on o~, so that N,, is an analytic set.
0,,) --, 0 as n --+ ~. Because (% C (~L.(e)), {07,}satisfies
(A.I).
Because N, is a finite set, a minimum exists. Thus.
Because ~/'(0,,. el contains ~',I(#Z ~;I = {fl ~ {aj,,:
by Corollary 2.4 of SW, q,, ~ N~ and 0~ is measurable-
F/B(N,). Further, the arbitrary selection of Corol-
p(O, 0;,) >- c}, it follows that
lary 2.4 can be taken to be such that for all c,) in ~
and q,, ~ N~ we have O,,(co) ~ O,,(o)), I~ lira inf.. { inf Q(0) -0(01) } > u.
Proof of Theorem 3.4. The proof is analogous to that By assumption P[¢u: sup0eo,,iQ,b,u- ol Q(O)! _-. c]
of Theorem 3.3, except for some details. Lemmas -+ 0 as n --, r: for any ~; > o: because O, C_ {4,,, we
A . 1 - A , 2 immediately imply that gr ~),,(~u) ~ F @ also have # E O . It now follows immediately from
B(O) C A ( F @ B(O)). The denseness of U~:~ O,, the argument of Corollary 2.3 of WW that
follows from Lemma 3.2 under Assumption B.3. By
construction, _O, C O,,(~,) C_ O,,, as ~/,, -<q,, s ?t,,. P* ] ~u: ,~,,;,,inl,,:. Q,,(~u.u, Q,,{(u.,,, ,,i-,i
Condition (4.1) holds by Theorem 4.2 for {O,}, using
the same argument as in Theorem 3.3. As the rest , t b ,rt ....
of the argument is unchanged, the conditions of Because O,M,9) C 0,, q,,(O,,, cl contains t/~t#,,, ~:. ;v}
Theorem 4.1 hold and the result follows. ~' so that
Proof of Theorem 3.5. The result follows immedi- c,~: inf Q.,(uJ. tl} - Q.,Ro. u..I - :~I
ately from Theorem 4.5, as the conditions of Theorem
4.1 hold as established in the proof of Theorem 3.4.
~2 c { ,, u ._inf
:,:.~ , Q,
, ,({u.
~ Q,,(ua.o~)>~}.
Proof of Theorem 4.1. (a): The result follows im- It follows immediately from the monotonicitv of P~
mediately from Corollary 2.4 of SW, as we directly that (A.2) holds, so that p(O~, 0~.) ~ 0 bv Theorem
impose their conditions (i)-(iii) and (iv.b) on (fl, F) 2.1 of WW.
and (O,, p). (Completeness and separability of (% Because ~),, is measurable (by ta) above), p(O,.
ensure that it is Souslin, condition (ii).) We also im- 0),) ~ 0. Because p(O~,, 0o) ~ o_ the triangle in-
pose their conditions (v.a) (~),(e~) nonempty and equality yields p(O,,, 0,,) P-%O.
compact) and (vi.b) (Q,(c9, .) lower semi-continuous
on O, for each co), which ensure that {~u: Q,,(co, . )
achieves its infimum in (a),(co)} = D. Proof of Theorem 4.2. We verif.,, the conditions of
(b) We first verify the conditions of Theorem 2. t Theorem 2.5 of WW. The conditions on (~-L F. P~.
of White and Wooldridge (1990) (WW). The con- (O, p) and {O,} are imposed directly. The functions
ditions imposed in (a) suffice for all of their condi- s,,(Z,, .) and m,(Z,, -) correspond to s,, and m,, of
tions through their (2.1) and (2.2), as 0,,(eg) mini- WW. The required measurability and Lipschitz con-
mizes Q , ( ~ , 0) in ~),(co). It remains to verify their ditions for these hold by assumption. We take M, of
(2.3) and (2.4), that is, there exists a nonstochastic WW to be M, = n ~ , .
sequence {0~ ~ O} such that Under conditions (i), ({Z,} i.i.d.) it follows from
the Bernstein inequality (e.g., Proposition 3.2 of
P* [co: 0~, ~ ~)~(o9)] -+ 1 as n --~ ~ (A.1) WW) that
and for all ,: > 0
P[ 2s~(Z,,O)-E(s.(Z,,O))I
1 , :~X
(o: inf
0~:nc(0~.eaol
Q,(o~,0) - Qo(eo, 0;,) > 0
asn~oc,
i 1
(A.2)
< 2 exp[-- '~-'/(2a~ - 4~,,&/3)1,
where a~ ~ var(Z%~ sn(Z,, 0)), and we use the fact
where ~/~(0~, e, o~) =- {0 ~ O,(co): p(O, 0~,) >- c} and that Is,(Z,, O) E(s,(Z,, 0))1 <- 27°. Further. a~ <
P* is the outer measure associated with (fl, IL P). n ~ , and we may take ~,, -> I (implying L, ~ ,V~)
without loss of generality, so that P r o o f of L e m m a 4.3. We construct an e-net for T(~,

q, A). (Recall that T,. = {h . . . . . th} is an e-net for
P[ ,~ s,,(Z,, 0 ) - E(s.(Z,. 0 ) ) > A] T if each 0 in T there is at least one element tk of T,:
such that p~(O, tk) <- e.) Then H,(~', q, A) _< log h
-< 2 exp[-A2/f~,(2n + 4A/3)]. (e.g., Lorentz, 1966, chap. 10).
Let r / > 0 be given, and let B, -= {b~ E B, k = 1,
Note that this bound is independent of 0, and define . . . . /}, C,-= {G ~ F, k = 1. . . . . m} be (l~ -
Ff,(A) -= 2 exp[-AZ/?~(2n + 4A/3)]. Application of
norm) r/ - nets for B = {fl: N[~] -< A} C R q+' and
the Bernstein inequality with m. in place of s,, leads F = {7: I])'N -< qA} C R <//'+~. (For notational con-
to an analogous inequality, venience ))'ll here denotes the l~-norm in spaces R" of
whatever dimension.) Let D, = B,~ x C,, and put
P{ ,~_~m,,(Z,,O) - E(m,,(Z,,O)) >A} T,~ = {0 E T(~,, q, A):6 E D,} in an obvious use of
notation. We choose ~l so that 5¢, is an e-net.
~< 2 exp[-AW(2n~ + 4~,,A/3)].
Let 0 ~ T(qJ, q, A) be arbitrary, with corre-
This bound is also independent of 0; we define sponding parameters d = (fl', 7')'. Then there exists
Fm(A) --= 2 e x p [ - a 2 / ( 2 n - ~ + 4~,A13)1. d = (b', c')' in D,, such that lift - bl[ ~ q and 117' -
Theorem 2.5 of WW applies to yield the desired cl[ -< r/. Let t be the element of 7~, corresponding to
inequality, provided that n = O(M~ d,). Because d. The triangle inequality gives
M,, = n ~ , and ~,, -> d , t , we see immediately that
this holds. Consequently, by Theorem 2.5 of WW it 10(x) - t(x)l = fl,> + ~ fl,~,(~':,,) - l,,,
follows that for all e > 0 and all n sufficiently large t I
~,(.i'c,) -< - b. + (fl, - b,)~,(2

r I1
- b,
P sup n ~ _ [G(Z,, O) - E(&(Z,,O))] > e
i Oe I')~ I : I
<_ G,,([eX6"~,,]";)[I'2(2M.) + I';,O;n13)l. + 2 b,(~,(~'i,,)- ~(X-'c,)) .
Now F."(2MD = 2 e x p [ - 6 n / 7 ] and E~(en/3) = 2

For the first term, the fact that ~, is bounded (set the
exp[-e2n/~(18 + 4e)]. Substituting these expres-
bound to unity) and the triangle inequality give
sions into the expression above gives the first result
of part (i).
To obtain the second result (uniform convergence /7,, - b,, + ~] (/7, - b,)~,(.~'~,,)
S I
to zero), it suffices that
G,,([U6~,,]"~)exp[-6n/7] ~ 0 as n --~ ~c 2 Ifl, - b,I = II/S - bll-< ~.
/ II
and
For the second term, the triangle inequality gives
G,,([el6~,,i'~)exp[-c2n/~,(18 + 4e)] ~ 0 as n ~ oc
for any e > 0. Because the force of the condition b,(~,(~ ),) - ~,(~'c,))
j=l
occurs when e is small and because we take ~. -> 1.
the second condition suffices for the first. For this it <- ~2 Ib, I1 ~,(.~'~',) - ~,(x'q)l
suffices that i=1
c'~nl~,(18 + 4r.) - H.([e/6~,,] ''~) ~ zc as n --* oc Ib) ku(2;,,)- ~,(2'c,)l

for any t: > 0. Now
e2n/~,(18 + 4r) - H,,([e/6N,,]l;-) s=l
=(n/~)[c2/(18 + 4~:) - (g~,/n)H.([e/6~o]' ~)]

The Lipschitz condition imposed on ~u and the tri-
>- (n/d~)[e2/2(18 + 4e)].
angle inequality give
where inequality holds for all n sufficiently large be-
/~'(.~'~',) - ~(~'c,)l-< L I~'~', - x-','.,
cause (~/n)H.([e/6-~.] v;-) -+ O. Because n - l ~ ~ 0
we have n/-g~, ~ oc, and the required divergence is
= L ~ £,(;',i- c,i)
verified.
The argument for the mixing case (it) is analogous,
using a mixing analog of the Bernstein inequality
I =l) i tl ,
(Bosq, 1975, Theorem 3.3 of WW). The condition
(~.lnl/'-)H.([el6-~t.] ~/x) --~ 0 can be similarly verified
<- rL ~ 17n - c,il.
to suffice for the uniform convergence result, c~
54?t f~ ~l h m
Consequently, E';,=~ ]~(.f'i,;) - g/(2'ci) I _< rL E'/=~ P r o o f ot Corollary 4.5. S u p p o s e that for s o m e .:
E";=)!;;;i°' - c;;t -< rLq, so that l) P[O. ~ t')"(L" - ~)]--~ 1. T h e n I'[t:1 = P[,,; {')..(,,.,)
e)"l J, ~ ,j) i.o. ] > 0. Pick ~.~ ~ t" a n d a s u b s e q u c n c c
b;(y;(x'7,) - ~( 2'c,))' ~; rL ~;1. ~;)'} of {n} such that 0,,,(t.o t ~ (9*f:J ' L:). B e c a u s e
sup,,~,, IQ.,(- o) - Q ( O ) ] - L ( ) a n d ;,(~;.. ~).) .... I) ,.~,
It follows that Call w i t h o u t loss of g e n e r a l i t y pick c. ~:_ t. and u Im-
ther s u b s e q u e n c e In"} such that sup., ,., Q,,t<,;. :;)
p,(O, t) ~ sup/0(x) - t(x)[ -<_-d l + rL A) = ;:.
~61 r Q(()I ~ 1) and 140 ,.(t,~l. u.) --->(1 (~'.~.' . T h e o r e m 2.4.4
of L u k a c s . 1975). By the triangle inequality
p r o v i d e d that we c h o o s e q = e/t1 + rL ~). B e c a u s e
0 is a r b i t r a r y a n d t ~ 7~,;. Z ~- "/~:,,, ,;.a) is an c-nel (5l,.(,,)) 0(~;.) --,0(0..,o,)~ .... -~ ~,,,~)
for T(~u, q. &). () ,(<,), u,,,(~,;)) t~ ~ - i;';)~
Let # be the cardinality o p e r a t o r . B e c a u s e +-'(~.{~.'~- ;;.("~)) Q(O.. (<'~)) , ;( ;; ~,',), ?_)(<; )
#7~; = # D , ; a n d D,; = B,; x C , ; , w e h a v e #/~,; =:.
Bv u n i f o r m c o n v e r g e n c e , the facl that p(O.(v?). ~)r ~
( # B , ; ) ( # C , ; ) . It follows f r o m T h e o r e m s IX. X. a n d
-~- (L a n d the c o n t i n u i t y of Q a~ ,~ . u follows that
V of K o i m o g o r o v a n d T i h o m i r o v (1961) a n d ele-
g w e n any ;: "-> t) we have for all n' ~ufficientlv large
m e n t a r y g e o m e t r i c a l a r g u m e n t s that for all c ( h e n c e
that
q) sufficiently small #B,; <-- 2 ( 2 A / q ) ';+ ' a n d #C,; ~:-
2(2q~/q)';I'+'L T h e r e f o r e with ;1 - ;:/(1 + rL ,.k). Q(O,,(.J)) ()(-)
p = q(r + 2) + l, w e h a v e - Q,,(o). U,luJ))
log#T ~ l o g 4 + [q(r + 2) + l]log2~;';/ so that for all n" sufficientl~ lara~

+ q(r + l ) l o g q [ 0 . ((,J. tL.((,Jil (J,.( ~.;11
l o g 4 + p l o g 2 / ~ : + p l o g ( A + rL ~ ) + p l o g l ?.
C)(O,,((,/I) - L)iu ) ;
B e c a u s e H,:(~, q,/.X) = log # T ; a n d log 4 < p log 4 J 3c.
we h a v e
b e c a u s e IJ,, (~,J) ¢ el*(4 c B u t by c o n s t r u c t i o n
H,(~,. q, 20 -< p log 8/;: + p log(~ + rL ~'-) + p log p. of ~}, (c,)). we h a v e IQ,,.(t'>. ;;,.(t,>)) Q,,4t,J. d,.(eJ))
_, ~ _ -- _= 3~: for ~: > i) sufficiently small a n d
all ~l" sufficiently large. ~, c o n t r a d i c t i o n . H e n c e . lor
P r o o f o f L e m m a 4.4. (i) With N,, = 4&,, and ;. = all ~ ~-I). PRO,, C (@~(.L, 411 .... !.
1, for any ~: > 0 W h e n J -= 0. we have that P [ 0 . ~ e)*(d)] --, t a~
H,,([~:/6-~,,] '~') = H,,(;:/24~,, } ;i -~ :'- for all c - {1. It follows from 14.21 that for /
all , > () t~/~ E ;/(p'.,, cl[-~ 1. ~o that p(~/...u )--,
-< p,, log 192 &,/c + p,, log [&,, + rL ',.k~,] + p,, log 17, ().
f r o m L e m m a 4.3. B e c a u s e ~,, --, ~, for any ~: > (I
t h e r e exists N,: sufficiently large that A,, ~ 192/r for REFERENCES
all n > N,. A l s o , for all n sufficiently large ~ , ~ &,
Andrews D.W.K. { 19881. Asyntptott( ~ormaitt. of ser~e.~ esu-.
a n d ~ , > (1 + rL). W i t h o u t loss of generality sup- nlators ffir various nonparametric trod semt-parametric e?ll-
pose this is true for all n > N,:. C o n s e q u e n t l y . mators Cowlcs Foundation Discussion Paper 874L Yale Uni-
versl iv.
p,, log 192 &,/c + p,, logl&, + rL .X~] Baba. N. ( 19891. A new approach tor finding the global mJ m m u m
<- p,, log A~ + p,, log A~,(1 + rL) of crror fu action of neural networks. Neural Networks. 2. 367-
37-4.
= p,,logA~, + p,,logA~ + p, log(l + rL) Barron. A (19S9L Stanstical propcrues ot artiticiat neural net-
-< 6 p,, log A,,. works. In Proceedings of the 28th I E E E (onference on De-
ctsion and Control. Tampa. Florida {pp t:280-2851. New ~t\)rk
SO t h a t H,([e/6N~] ";~) <- 6t),, log p,, ~,. W i t h ~, = IEEE Press.
4 A~, Bierens, H. (19871. Kernel e s n m a t o r s ol regression lunctlon~, in
T. Bewley (Ed.), Advances m econometrtcs--Fifth World ('ot~.-
(~,In)H,,([e./6-~,,]';) <- 96n 'p,, ~ log p,, A,,. gress (Wol. I. pp. 99-144) New York: Cambridge University
Press.
B e c a u s e q,, ~4 log q,, A = o ( n ) , we h a v e p,, A,~ log Bosq, D (1975). ln6galit6 de Bernstein pour les processus sta-
p,, &, = o(n), so t h a t ( ~ / n ) H , , ( [ e / 6 - ~ , ] '/~) ~ 0 as tionnmres et m61angeants. Applications, C.R. Acad. Sc. Paris.
required. S(rie A . 281. 1095-1098.
(ii) Similarly, Carroll, S .M., & Dickinson, B.W. (1989)_ Construction of neural
nets using the radon transform. In Proceedings.of the Inter-
(~,,/n~/2)H,([e./6-~,]'/'-) <- 24n-"2p,, a~ log p,, A . national Joint Conference on Neural Networks, Washington,
D.C. (pp. [:607-6111. New York: I E E E Press,
B e c a u s e q,, ~, log q,, ~ , = o(n';-~), the result fol- Cox. D.D. ( 1984 ). Multivariate s m o o t h i n g splines. S I A M Journal
lows. ~J of Numerical Analysis, 21. 789-813
Connectionist Nonparanletric Regression 549
Cvbenko. G. (1989). Approximation by superpositions of a sig- ing internal representations by error propagation. In D.E.
mold function. Mathematics qf Control, Signals and Systems, Rumelhart and J i . McClelland (Eds.), Parallel distributed
2, 3113-314. processing: Explorations m the microstrucmre of cognition (Voh
Funahashi, K. ( 19891. On the approximate realization of contin- 1, pp. 318-3621. Cambridge, MA: MIT Press.
uous mappings by neural networks. Neural Networks, 2. 183- Scales, L.E. 11982). Introduction to non-linear oplimization. New
192. York: Springer-Verlag.
Gallant, A.R., Nychka, D,W. 11987). Seminonparametric maxi- Sevcrini, T.A., & Wong, W.H. (1987). ('onvergcnee rate~ o/max-
mum likelihood estimation. Econometrica, 55, 363-390. imum likelihood and related emStimaws m general parameter
Geman, S., & ltwang+ C. ( 19821. Nonparametric maximum like- ~spaces. (University of Chicago Department of Statistics Work-
lihood estimation by the method of sieves. The Annals qf ing Paper).
Statistics, IlL 401 414. Stinchcombe, M., & White, H. (198%). Universal approximation
Grenandcr, U. (1981). Abstract inJorence. New York: Wiley. using feedforward networks with non-sigmoid hidden layer
Haussler, D. 119891. Generalizing the PAC model j o t neural net activation functions. In ProceedingmS o/'lhe International Joint
and other h'arning applications, (UC Santa Cruz Computer Co;([t'rence on Neural Networks, Washington, D.C. (pp. I:613-
Research Laboratory Technical Report UCSC-CRL-89-311). 6171. New York: 1EEE Press.
ttechl-Nielscn, R (1989). Theory of the backpropagation neural Stinehcombc, M., & White, H. (1989b). 3ome measurabililv re-
network. In Proceedings r?l"the hlternational Joint Conl~'rence .sUltLSIOr extrema ql'random /i+nction,~over random ~et.L ( UCSD
on Neural Net,'orkms, Washington, D.C., (pp. 1:593-606). New Department of Economics Discussion Paper 89-181.
York: 1EEE Press. Stone, M. 119741. Cross-wfliditorv choice and assessment of sta-
Hornik, K., Stinehcombe, M.. & White, H. 11989). Multilayer tistical predictions, Journal O['the Royal Stati,stical Society. Se-
feed forward networks arc universal approximators. Neural ries B, 36, 111-133.
Networks, 2. 359-366. Vapnik, V.N., & Chervonenkis. A. Ya. 1197I). On the uniform
Hornik, K.. Stinchcombe, M.. & White, H. (19911). Universal convergence of relative frequencies ol events to their proba-
approximation of an unknown mapping and its derivatives bilities. Theory 01' Probability and it.~ Applications, 16, 264-
using naultilavcr feedforward networks. Neural Networks, 3, 28O.
this isstte. Wahba, (}. 119751. Smoothing noisy data with spline functions.
Kohnogorov, A.N., & Fomin, S.V. (197(I). Introductory realanal- Numerical Mathematics, 24, 383-393.
v.si~. New Yl~rk: Dover. Wahba, (L, 6: Wold+ S. 11975). A completely automatic French
Kolmogorov, A.N,, & Tihomirov, V.M. (1961). ~-cntropy and ~- curve: Fitting spline functions by cross validation. Uommu-
capacity of sets in functional spaces. American Mathematical nicalion,~ in Statistics, 4, 1-17.
Society Translations 2, 17, 277-364. White, H, 11984). Azwnptolic theo O, /or Econometrician,s. Or-
Li, K-C. (1987). Asymptotic optimality for C;,, C;. cross-valida- hmdo: Academic Press.
tion and generalized cross validation: Discrete index set. The White. H, (1989a). Learning in artificial neural networks: A sta-
Anmds' ed'Statistics. 15, 958-975. tistical perspective. Neural Computation 1,425 464.
Lorentz, G.G. (1966). Approximation qf]tmctions. New York: White, H. (1989b). A l?ractical eross-validati+m method jor mahi-
Holt, Rinehart and Winston. layer/?ed/brward network,s (Internal HNC Report #89-03).
Lukacs, E. 11975). Stochastic convergence. New York: Academic Whitc, H., & Wooldridge, J. (199(I), Some results on sieve csti-
Press. marion with depcndcnt observations. In W. Barnett, J. Powell,
Rudin, W, (19741. Real ctnd complex analysis. New York: Mc- and O. Tauchen (Eds.), Nonparametric and mS'emi-lmrametric
(}raw-Hill. methods in econcm:,,'t,ie.~ attd mStatistiCmS.New York: Cambridge
Rumclhart, D.E., tlinton, G.E., & Williams, R.J. (1986). Learn- University Press, to appear.

ConnectionistNonparametricRegression White90

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ConnectionistNonparametricRegression White90

Uploaded by

Copyright:

Available Formats

Neural Networks, Vol. 3, pp. 535 549, 1990 1~893-t~08()/911$3,110 ÷ .

Connectionist Nonparametric Regression: Multilayer

geywords--Learning, Feedforward networks, Nonparametric, Regression, Convergence.

1. I N T R O D U C T I O N erty of consistency for a certain class of learning

series processes that can exhibit considerable short- TABLE 1

rain ,, ' ~ [ g - 0(X,)]~. min n J ~ , [ E - o(X,)] 2.

ASSUMPTION A.1. L e t (l't. F. P) be a comptete prob- t

ability space. The observed data are the realization

The consistency result for the cross-validated sieve

does not permit overfitting in the limit. n = 1,2 . . . . .

We define the set 0 " ( 0 ~: {0 ~ O: IE([Y, - #(X)J ~)

and B.4 or A.l(ii), A.2, B.3(ii), and B.4. for any

The final conclusion delivers the consistency of i)~

c,G,,([;;/6hT,,] ~')[exp( -c4t'-~) + e x p ( - c.¢-n' :/6f,,)l.

tical application of the cross-validation approach. (e)) m e a s u r a b l e as a consequence of L e m m a 2.2

Proof o / Theorem 3.3. We apply T h e o r e m 4. I. By

space: ((9, p) = ( L : ( I ' , / 0 , P:) is a c o m p l e t e separable

without loss of generality, so that P r o o f of L e m m a 4.3. We construct an e-net for T(~,

~,(.i'c,) -< - b. + (fl, - b,)~,(2

<_ G,,([eX6"~,,]";)[I'2(2M.) + I';,O;n13)l. + 2 b,(~,(~'i,,)- ~(X-'c,)) .

Now F."(2MD = 2 e x p [ - 6 n / 7 ] and E~(en/3) = 2

c'~nl~,(18 + 4r.) - H.([e/6~,,] ''~) ~ zc as n --* oc Ib) ku(2;,,)- ~,(2'c,)l

=(n/~)[c2/(18 + 4~:) - (g~,/n)H.([e/6~o]' ~)]

log#T ~ l o g 4 + [q(r + 2) + l]log2~;';/ so that for all n" sufficientl~ lara~

You might also like