Professional Documents
Culture Documents
DO
Printedin the USA.All rightsreserved. Copyright© 1989PergamonPressplc
ORIGINAL CONTRIBUTION
KEN-ICHI FUNAHASHI
ATR Auditoryand Visual Perception Research Laboratories
(Received 6 May 1988; revised and accepted 14 September 1988)
Abstract--In this paper, we prove that any continuous mapping can be approximately realized by Rumelhart-
Hinton-Williams' multilayer neural networks with at least one hidden layer whose output functions are sigmoid
functions. The starting point of the proof for the one hidden layer case is an integral formula recently proposed
by Irie-Miyake and from this, the general case (for any number of hidden layers) can be proved by induction.
The two hidden layers case is proved also by using the Kolmogorov-Arnold-Sprecher theorem and this proof
also gives non-trivial realizations.
Keywords--Neural network, Back propagation, Output function, Sigmoid function, Hidden layer, Unit, Re-
alization, Continuous mapping.
rectly give the realization theorem of functions by tween the desired output and the output signal of
networks with finite units. the network is minimized. We generally use a
In neural networks of the feed-forward type by bounded and monotonic increasing differentiable
Rumelhart-Hinton-Williams, bounded and mono- function which is called sigmoid function for each
tone increasing differentiable functions such as the unit's output function.
sigmoid function ~b(x) = 1/(1 + e -x) are used as If a multilayer network has n input units and m
output functions of units. This is a different point output units, then the input-output relationship de-
from the McCulloch-Pitts model and perceptron fines a continuous mapping from n-dimensional Eu-
which use heaviside function as output functions of clidean space to m-dimensional Euclidean space. We
units and is the reason why it is possible to derive a call this mapping the input-output mapping of the
learning algorithm for multilayer networks. network. We study the problem of network capabil-
In a feed-forward type network, it's input-output ities from the point of view of input-output map-
relationship defines a mapping which is called an pings. It is observed that for the study of mappings
input-output mapping of the network. We studied defined by muitilayer networks it is sufficient to con-
the problem of network capabilities from the point sider networks whose output functions for hidden
of view of input-output mappings. layers are the above +(x) and whose output functions
In this paper, we started from an integral formula for input and output layers are linear.
recently proposed by Irie and Miyake (1988) and
proved the theorem which guarantees the approxi- 3. APPROXIMATE REALIZATION
mate realization of continuous mappings by three- OF CONTINUOUS MAPPINGS
layer (one hidden layer) networks whose output BY NEURAL NETWORKS
functions for hidden layer are sigmoid, and whose
output functions for input and output layers are lin- We shall consider the possibility of representing con-
ear in the sense of uniform topology. It is easy to tinuous mappings by neural networks whose output
prove the theorem for k (->3)-layer networks by us- functions in hidden layers are sigmoid, for example,
ing the theorem for a three-layer case. But the proof +(x) = 1/(1 + e-X). It is simply noted here that
of the theorem for the case k > 3 gives only trivial general continuous mappings cannot be exactly rep-
approximate realization of given mappings. There- resented by Rumelhart-Hinton-Williams' networks.
fore we show another proof for the four-layer case For example, if a real analytic output function such
by using the Kolmogorov-Arnold-Sprecher theorem as the sigmoid function +(x) = 1/(1 + e -*) is used,
(Kolmogorov, 1957; Sprecher, 1965). then an input-output mapping of this network is an-
McCulloch-Pitts showed that any logical circuit alytic and generally cannot represent all continuous
can be designed using their model. Correspondingly, mappings.
our assertion shows that any continuous mapping can Let points of n-dimensional Euclidean space R"
be approximately represented by the Rumelhart- be denoted by x = ( x 1 . . . . . Xn) and the norm of x
Hinton-Williams multilayer network. defined by Ix[ = teni:o xT~,l., ) '-.
We prove the following theorems and corollaries
in this paper.
2. MULTILAYER NEURAL NETWORKS
The Rumelhart-Hinton-Williams multilayer net- Theorem 1.
work that we consider here is a feed-forward type Let qb(x) be a nonconstant, bounded and monotone
network with connections between adjoining layers increasing continuous function. Let K be a compact
only. Networks generally have hidden layers be- subset (bounded closed subset) of R" and f ( x l . . . . .
tween the input and output layers. Each layer con- xn) be a real valued continuous function on K. Then
sists of computational units. The input-output for an arbitrary e > 0, there exists an integer N and
relationship of each unit is represented by inputs xi, real constants ci, Oi(i = 1 . . . . , N), wq(i = 1 .... ,
output y, connection weights wi, threshold 0, and N, j -- 1 . . . . . n) such that
differentiable function + as follows:
f(xl ..... x.) = c,+ woxj - oi
i=1
y 0t
satisfies m a x , ~ I f ( x , . . . . . x , ) - f (x~ . . . . . x,)[ <
The learning rule of this network is known as the ~. In other words, for an arbitrary ~ > 0, there exists
back propagation algorithm (Rumelhart, Hinton, & a three-layer network whose output functions for the
Williams, 1986). The back propagation algorithm is hidden layer are +(x), whose output functions for
an algorithm that uses a gradient descent method to input and output layers are linear and which has an
modify weights and thresholds so that the error be- input-output function f ( x ~ . . . . . x . ) such that
Approximate Realization of Continuous Mappings 185
- f,(Xm, . . . , x,)l 2 dx)1,2 < The following examples are included in the class
of sigmoid functions considered here.
Sigmoid functions qb(x) where 6 ( - ~ ) = 0 and For f(x) E LI(R"), Fourier transform
6(~) = 1 are appropriate as output functions in the
neural model because if we set +,(x) = + ( x & ) ( e > r(e) : f. e dx, (1)
O) then these converge to the threshold function in
the McCulloch-Pitts neural model and perceptron as
where (x, {) = £7-1 xi{i, can be defined and set
~--> +0.
McCulloch-Pitts shows that one can design any / ( g ) = ,.;f(g).
If f(x) satisfies an additional condition that f(x)
logical circuit using their model. Correspondingly,
has continuous partial derivatives of order up to n,
theorem 2 above shows that any continuous mapping
then f(x) can be represented at each point by inverse
can be approximately represented by multilayer net-
Fourier transform of .f (~) as follows.
works with sigmoid output functions.
IIf(x)lk~ = n
If(x)l ~ dx , 5. PRELIMINARY 2 (IRIE-MIYAKE'S
INTEGRAL FORMULA)
and the convergence f,(x) ---> f(x) in LP(R") is de- The following theorem is a starting point for proof
fined by of Theorem 1.
lim IIf,(x) - f(x)llL, = O.
Theorem (Irie-Miyake)
Generally, for any measurable set K, L P ( K ) ( p >- Let O(x) E LI(R), that is, let O(x) be absolutely
1) is defined similarly. integrable and f ( x t . . . . . x,) ~ L2(R"). Let W(~j)
Let p(x) be a function on R" which satisfies the and F(wl . . . . . w,) be Fourier transforms of @(x)
following conditions: and f ( x l , . • • , xn) respectively.
If W(1) # 0, then
(i) p(x) >- O, 9(x) has continuous partial derivatives
of all orders and the support is contained in the
unit sphere IxI <-- 1.
(ii) fR" p(x) dx = 1
qJ x;w;- Wo (2,rr),xF(1)
i=l
Then, for ~ > O, set p,(x) = (1/e)"p(x/e). x exp(iw0) dwo dwl "'" dw,,.
If u(x) ~ L~o~, that is, u(x) is locally summable,
consider
Remark
This formula precisely asserts that if we set
p**u(x) = fR° p,(x - y)u(y) dy,
I~A(X1 . . . . . Xn) . . . .
then the following assertions hold: (a) p,*u(x) C -A -A
sigmoid function 1/(1 + e -~) does not satisfy the (for any g > 0). (2)
condition of this formula that 0(x) be absolutely in-
Since the Fourier transform al(~) of gl(x) =
tegrable and so the formula does not directly give
do(X + o0 -- 6(X -- a) E LI(R) is continuous, so,
the realization theorem of functions by networks.
from (1) and (2), G I ( 0 is identically zero. Therefore,
I
•J L
Ig(x)l d x = f'_' g ( x ) d x =
L I-°
J -L+ct
can be approximated uniformly on K by the Riemann
sum
+(x) = + - -
h(Xl ..... Xm, tl ..... tn)
k l " 2A1 for some ot and ~ so that +(x) satisfies Lemma 1 in
- h -A1 + ~ ..... Section 6.
The essential part of the proof of Irie-Miyake's
integral formula is the equality
-Am + ----~-, fi . . . . . tn < 2A1 "" 2Am"
I~A(X 1. . . . . X,,) = JA(X1 ..... Xn) (7)
Assertion of the L e m m a is obvious from this
inequality, q.e.d. and this is derived from
7. P R O O F OF T H E O R E M S
We will prove our theorems in Section 3 under the
above preliminaries. = exp(i~xiwi)i=l "*(1). (8)
p~*f(x) ~ f ( x ) ( a ~ + 0) uniformly on R ~. Therefore uniformly on R". That is to say, we can state that
we may suppose f ( x ) is a C~-function with compact for any e > 0 there exists A > 0 such that
support for proving T h e o r e m 1. By the Paley-Wie-
ner theorem (see, e.g., Yosida, 1968), the Fourier max II~A(X, . . . . . Xn) -- f ( x l , • • • , X,)I < ~/2.
x~R n
transform F(w)(w = (wl . . . . . w~)) of f ( x ) is real
analytic and, for any integer N, there exists a con- (i)
stant CN such that
Step 2. We will approximate I~.A by finite integrals
IF(w)h ~ C~(1 + Iwl) N. (3) on K. For e > 0, fix A which satisfies (i).
In particular, F(w) E L ~ O LZ(R"). For A ' > 0, set
We define IA(Xl . . . . . X,), L.A(Xl . . . . . X,) and
JA(XI, . . . , Xn) as follows: A A
IA'.A(X1 .....
f
Xn) . . . .
-A -A
Ia(Xl' " " " 'Xn) = f~A "" f~zQ (~XiWi -- Wo)
IfA: (, xiwi Wo)
1
F(wl . . . . , w,) 1
(2"rr)"qs(1) × F(wl . . . . , w,)
(2"rr)"~(1)
× exp(iw0) dwo dWl "" dw,, (4)
× exp(iw0) dwo[ dwv • .dw,.
I~A(X,, j: f:[f:
. . . , Xn) . . . .
A A
We will show that, for ~ > 0, we can take A ' >
* i=,xiwi-
) 1
wo (2~r)"~(1)
0 so that
max [IA' A(Xl. . . . . X,) -- L A(Xl, • • • , X,)I < ~/2.
x•K
Theorem (Sprecher)
O( aWX,, wo o) For each integer n -> 2, there exists a real, monotone
increasing function ¢(x), ¢([0, 1]) = [0, 1], depen-
the Riemann sum can be represented by a three-layer dent on n and having the following property:
network. Therefore f(x) can be represented approx- For each preassigned number ~ > 0 there is a rational
imately by the three-layer networks, q.e.d. number ~, 0 < ¢ < 8, such that every real continuous
190 K. Funahashi
function of n variables, f(x), defined on 1% can be proximate these functions using a sigmoid func-
represented as tion ¢b.
Let K j ( j = 1, . . . , 2n + 1) be the images of [0,
f(x) ~-~ × hi~J(Xi + ¢ ( j -- 1)) + j -- 1 , 1]" by mappings
/=1 i=1
n
where the function × is real and continuous and h is •j:x
an independent constant of f. ,=l
by Sprecher's theorem for f~(x) and ×~(i = 1, . . . . Xp.N(X) = E Ci.N+(ai.NX + bi,N) (9)
i=1
m) are used for the second hidden layer.
so that
9. A L T E R N A T I V E P R O O F OF T H E O R E M 2
F O R T H E C A S E k -- 4 I×p(x) - Xp.N(X)l < ¢ / ( 4 n + 2)(p = 1 . . . . , m)
(10)
In section 8, we reviewed Kolmogorov's theorem and
its refinement from the point of view of neural net- on K~. As ×p.U(X) are uniformly continuous on K~,
works. The K o l m o g o r o v - A r n o l d - S p r e c h e r theorem sufficiently small -q can be taken so that if Ix - Yl
and the following proposition are used to prove our < ~q(x, y ~ K~) t h e n ]Xp,N(X) -- Xp.N(Y)I < ¢ / ( 4 n +
theorem 2 for the case k = 4. This proposition is a 2)(p = 1 , . . . , m ) .
special case (one variable case) of theorem 1 in Sec- We apply our lemma to "rj and approximate "r/on
tion 3. [0, 1]" by "rj,U' SO that
]-rj(x) - Tj.N'(X)] < min(-q, a), (11)
Proposition
where % N , ( x ) ( j = 1 . . . . . m ) are defined as follows:
Let g ( x ) be a continuous function on R and ~b(x) a We approximate +(x) by
bounded and monotone increasing continuous func-
N'
tion. For an arbitrary compact subset (bounded
~JN'(x) ~--- E eil~)( ~lix + [)i) (12)
closed subset) K of R and an arbitrary ~ > 0, there i-1
are an integer N and real constants a~, bi, ci(i =
1. . . . . N) such that on 2n~ neighborhood of [0, 1] and set
holds on K.
In the appendix, we shall state the direct proof of so that the above inequality (11) is satisfied. Using
the above proposition by a different method without a transformation
using Fourier transforms under the additional con- 2n+l 2n+l
dition that ¢b(x) has a weak derivative which is sum- X X
mable. j=l j-1
Next we prove theorem 2 for the case k = 4 by 2n+l 2n+l
Xp,N['rj.N,(X)] (p = 1 ..... m)
/=1
fp(x) = ~ ×e hi*( x' + ~ ( J - 1)) + j - 1
j=l i=1
on [0, 1]" so that the errors are less than ~. Looking
(p = 1, . . . , m), where h and ~ are constants. We at the form of this approximation, the theorem is
apply our proposition to functions ×p, tb, and ap- obtained, q.e.d.
Approximate Realization o f Continuous Mappings 191
10. N E U R A L NETWORK mation processing Systems, Denver, Colorado, 1987 (pp. 387-
396). New York: American Institute of Physics.
A N D INFORMATION PROCESSING Irie, B., & Miyake, S. (1988). Capabilities of three-layered Per-
IN THE BRAIN ceptrons. IEEE International Conference on Neural Networks,
1,641-648.
In the Rumelhart-Hinton-Williams multilayer
Kolmogorov, A. N. (1957). On the representation of continuous
neural network, input and output values of each unit functions of many variables by superposition of continuous
correspond to pulse-frequencies in a neuron and thus functions of one variable and addition. Doklady Akademii
each unit, disregarding time characteristics, is a very Nauk SSSR, 144, 679-681; American Mathematical Society
simple model of the neuron. When a neural network Translation, 28, 55-59 [1963].
Lippmann, R. P. (1987, April). An introduction to computing
is implemented for pattern recognition in engineer-
with neural nets. IEEE ASSP Magazine, 4, pp. 4-22.
ing fields, output units correspond to gnostic cells in McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the
the brain. idea immanent in nervous activity. Bulletin of Mathematical
The approximate realization of continuous map- Biophysics, 5, 115-133.
pings using neural networks, which are simple Poggio, T. (1983). Visual algorithms. In O. J. Braddick & A. C.
Sleigh (Eds.), Physical and biological processing of images (pp.
models of the neural system, suggest that there are 128-135). New York: Springer-Verlag.
several gnostic cells in the brain. It also shows the Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986).
possibility of revealing information processing in the Learning representations by error propagation. In D. E. Ru-
brain through neural network approaches. melhart, J. L. McClelland and the PDP Research Group (Eds.),
Parallel distributed processing (Vol. 1, pp. 318-362). Cam-
bridge, MA: MIT Press.
11. SUMMARY Sejnowski, T. J., & Rosenberg, C. R. (1987). Parallel networks
that learn to pronounce English text. Complex Systems, 1, 145-
We proved the approximate realization theorem of 168.
continuous functions by three-layer networks. This Sprecher, D. A. (1965). On the structure of continuous functions
theorem leads to the approximate realization theo- of several variables. Transactions of the American Mathemat-
rem of continuous mappings by k(->3)-layer net- ical Society, 115, 340-355.
Tamura, S. and Waibel, A. (1988). Noise reduction using con-
works and we showed that any mapping whose nectionist models. 1988 International Conference on Acoustic,
components are summable on compact subset, can Speech, and Signal Processing, pp. 553-556.
be approximately represented by k(->3)-layer net- Uesaka, Y. (197l). Analog perceptrons: On additive represen-
works in the sense of L2-norm. We also showed an tation of functions. Information and Control, 19, 41-65.
alternative proof of the theorem for the case k = 4 Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang,
K. (1988). Phoneme recognition: neural networks vs. hidden
by using the Kolmogorov-Arnold-Sprecher theo- Markov models. 1988 International Conference on Acoustic,
rem and a proposition which is a special case of the Speech, and Signal Processing, pp. 107-110.
three-layer case. We consider that one of the prob- Wieland, A., & Leighton, R. (1987). Geometric analysis of neural
lems of analyzing neural network capabilities is network capabilities, IEEE First International Conference on
solved in the form of the existence theorem of net- Neural Networks, 3, 385-392.
Yosida, K. (1968). Functional analysis. New York: Springer-Ver-
works which are approximately capable of repre- lag.
senting any mapping given.
Presently, for application of neural networks to APPENDIX (DIRECT PROOF OF THE PROPOSITION
pattern recognition or related engineering fields, up IN SECTION 9 BY A DIFFERENT METHOD)
to four-layer networks are used (Waibel, Hanazawa, Proof. There is a continuous function g(x) on R which has a
Hinton, Shikano, & Lang, 1988; Tamura & Waibel, compact support such that g(x) = g(x) on K. We may prove the
proposition for 8(x) and so we may initially suppose that g(x) has
1988). The theorems proved here provide that the a compact support. We may also suppose that ~b(~) - d~(-~) =
mathematical base and their use would be funda- 1. For the arbitrary ¢ > 0, we will approximate g(x) on K by a
mental in further discussions of neural network sys- summation of sigmoid functions whose variables are shifted and
scaled. Initially, we can approximate g(x) by a simple function
tem theory. (step function) c(x) with compact support so that
REFERENCES [g(x) - c(x)l < ~/2 (A.1)
Amari, S. (1967). A theory of adaptive pattern classifiers, 1EEE on R and whose step variances are less than e/4. Here c(x) is
Transactions on Electronic Computers, EC-16, 299-307. represented using the Heaviside function H(x) as follows:
Duda, R. O., & Fossum, H. (1966). Pattern classification by
iteratively determined linear and piecewise linear discriminant c(x) : ~ ciH(x - x,).
functions. 1EEE Transactions on Electronic Computers, EC-
15,220-232.
For a sigmoid function ~b(x), set (b~(x) = ¢(x/ot)(ot > 0). Then
Gel'fand, I. M., & Shilov, G. E. (1964). Generalized functions, d
(Vol. 1, Chap. 1). New York: Academic Press. ~b'(x) = d~x ~b~(x) converge to the delta function as ct ~ 0. We
Hecht-Nielsen, R. (1987). Kolmogorov mapping neural network consider the convolution c*~b'(x) of c(x) and ~b'(x). We set
existence theorem. IEEE First International Conference on 2~' = "minimum width of steps" and obtain
Neural Networks, 3, 11-13.
Huang, W. Y., & Lippmann, R. P. (1987). Neural net and tra- c(x) - c*~'(x) = f " *'(y)[c(x) - c(x - Y)] dy.
ditional classifiers. In D. Z. Anderson (Ed.), Neural infor-
192 K. Funahashi
Divide the integrand of the right term into ( - ~ , - ( ) , [ - , ' , and so, c*+'(x) is represented as follows:
E'], ( ( , ~) and estimate these using the properties of sigmoid
functions. For example, c*+'(x) = ~ e,+°(x - x3.
i:I
and other terms will be arbitrarily small for a sufficiently small c~. c(x) - ~ c,+°(x x,) < ~/2. (A.2)
i=l
Therefore we obtain
Using (A.1) and (A.2) we obtain
Ic~x) - c*+'(x)l < ~/4.
As c*+~(x) = c'*qb~(x) and c'(x) is given by g(x) - ~ c,+o(x -- x,) < E.
i=1