You are on page 1of 17

NeurolNetworks,

Vol.9, No. 5,pp.819-S35,1996


Pergamon Copyright
01996 Elsevier
scienceLtd.Allrishtsreserved
PrintedinGreatBritain
0893-6080/96$15.00+.00
OS93-6OSO(95)OO1O7-7

CONTRIBUTED ARTICLE

Characterizationof a Class of Sigmoid Functions with


Applications to Neural Networks

ANIL MENON, KISHAN MEHROTRA, CHILUKURI K. MOHAN AND SANJAY RANKA

SyracuseUniversity

(Received 21 November 1994; revised and accepted 15 August 1995)

Abatraei-Westudy two classes of sigmoidr: the simple si&noi&, de~ned to be odd, asymptotically bounhd,
completely monotone fmctions inonevariable,andthehyperbolicsigmoidr,apropersubsetof simplesigmoih anda
naturalgeneralizationof thehyperbolictangent.Weobtaina completecharactertiationfor theinversesof hyperbolic
sigmoih usingEuler’sincompletebetafunctions,and describecompositionrulesthat illustratehow suchfunctions
may be synthesizedfromothers.Theseresultsare appliedto twoproblems.Firstweshowthat withrespectto simple
sigmoidvthe continuousCohen-Grossberg-Hopjield modelcanbe reducedto the (associated)L.egendredt~erential
equations.Second,we showthat theeffectof usingsimplesigmoti as nodetransferfmctions ina one-hiizkkn layer
feedforwardnetworkwith onesummingoutputmaybe interpretedas representingthe outputfmction as a Fourier
seriessine transformevaluatedat thehiakknlayernodeinputs,thusextendingandcomplementingearlierresultsin
this area. Copyright01996 Elsevier Science Ltd
Keywords-Sigmoid functions, Hypergeometnc series, Legendre equation, Cohen-Grossberg-Hopfield model,
Additive model, Fourier transform.

1. INTRODUCTION C). Figure 2 shows their inverses, x = tanh-l(y),


x = ln(y/(1 – y)), and x = y/~~, labeled A‘,
Sigmoid functions, whose graphs are “S-shaped”
B‘ and C‘ respectively.In a few eases,sigrnoidcurves
curves,appearin a great varietyof contexts, such as
can be deseribedby formulae; this rubric includes
the transferfunctionsusedin manyneuralnetworks.l
power series expansions (e.g., hyperbolic tangent),
Theirubiquityis no accident;thesecurvesare among
integral expressions(e.g., error fimction), composi-
the simplest non-linear curves, striking a graceful
tions of simpler functions (e.g., the Guderrnannian
balance between linear and non-linear behavior.
function), inversesof functionsdefinableby formulae
Grossberg(1973) gives an illuminatingand pertinent
[e.g., the “complexified” Langevin function, a
discussionof this point.
Figure 1 shows threesigrnoidalfunctions,viz. the
hyperbolic tangent y = tanh(x) (graph A), the
1
“logistic” sigmoid y = 1/(1 + exp(-x (graph B),
and the “algebraic’ sigmoid,y = x/ + (1 + X2) (graph .“
.“
0.6
Acknowledgements: We wouldlike to acknowledgethe helpful ~ /“”
commentsof the reviewersof this paper.We would also like to 0.2
. ..
thankK. GeorgeJosephfor severalusefuldiscussions.
Requestsfor reprintsshouldbe sentto K. Mehrotra,Schoolof
ComputerandInformationScience,2-120Centerfor Scieneeand -0.2
Technology,SyracuseUniversity,Syracuse,NY 13244-41OO, USA;
e-mail:kishan(ij)top.ek.syr.edu
1 Otherexamplesof the useof sigrnoidfunctionsarethelogistic -0.6
function in populationmodels, the hyperbolictangent in spin
models, the Langevinfknction in magneticdipole models, the
Gu&rmannianfunctionin specialfimctionstheory,the (cumula- -1
tive)distribution
, functionsin mathematicalstatistics,the piecewise -3 -2 -1 0 1 2 3
approximatorsin nonlinear approximation theory, and the
hysteresiscurvesin certainnonlinearsystems. FIGURE1. Some elgmoide.
819
820 A. Menonet al.

I I I I I n 1993). Some theoretical support comes from


2 - consideringthe first derivatives(if defined)of the
various transfer functions proposed; the first
1.2 - derivativesare partially responsiblefor control-
c
lingthe stepsizein theweightadjustmentphaseof
0.4 - the back-propagation algorithms, which in turn
..
influences the rate of convergence. Explicit
A .’ expressions for sigmoids are useful in such
.0.4 - ..
.“ considerations.
3. The dynamical systemdescribingthe continuous
-1.2 - B
Cohen-Grossberg-Hopfield (CGH) model
(Grossberg, 1988), also known as the additive
-2
I ! STM model and the Cohen-Grossberg model,
I
raisesan intriguingquery.If one assumesa tanh(.)
-1 -0.6 -0.2 0.2 0.6 1
node transfer function, one can show that the
FIGURE
2. Some Invereeslgmoide. CGH model is transformable to the Legendre
differential equation (see Section 6.1). An im-
sigmoid defined as the inverse of the function, portant question is whether this relationshipis
I/x – cot(x)], and differentialequations. robust with respect to the choice of the transfer
Although the level of abstraction in many function.
problems is such that one does not need to work 4. The recentstudyof sigmoidalderivativesby Minai
withexplicitformulae,2it is usefulto studynetworks and Williams(1993) is anothercase in point; they
with specific transferfunctions, as demonstratedby derived a connection with Euleriannumbers[see
the following considerations. Graham et al., (1989) for definitions], but
restrictedtheir inquiryto the very specificlogistic
sigmoid. Any generalization of their results
1. In determiningwhethera single layered feedfor-
requiresa careful look at sigmoids representable
wardnet is uniquelydeterminedby itscorrespond-
by formulae.
ing input-output map, Sussmann (1992) in his
5< There are other related issues.For instance,the
elegantproof of uniquenessspecificallyused the
hyperbolic tangent and logistic sigtnoid are
propertiesof the tanh(.) function. A lateranalysis
essentiallyequivalent,in thatone can be obtained
by Albertiniet al. (1993)obtainedthe sameresults
from the other by simple translationand scaling
with fewer assumptions on the node transfer
transformations.Specifically,
function, but still requires such functions to be
odd, and satisfycertain “independence” proper-
ties. With respectto the uniquenessproblem, all (1.1)
node transferfunctionsare not equivalent.3 1 +Jp(-x)-;=;’anh( xl’)”
2. Without tractableanalyticalforms to work with,
many problems relatingto sigmoids are resistant Many sigrnoidshave power seriesexpansionswhich
to theory. Neural net theory offers many alternatein sign. Many have inverseswith hypergeo-
examples. For example, there have been claims metric seriesexpansions. On the other hand, many
in the literatureabout the advantage(withrespect sigmoids have no such simple forms, or obvious
to computability, training times, etc.) of certain connections with well-known sigmoids. It is natural
sigmoidaltransferfunctions over others in back- to ask whetherthesevariedanalyticalexpressionsfor
propagation networks (Elliot, 1993; Barrington, sigmoidshave anythingin common. It is difficultto
— answer such questions without a thorough under-
2 For example,in neuralnet approximationtheory,significant standing of the analytical expressionsfor sigmoid
resultscan be obtainedaboutthe existenceof realizationswithin functions.
preassignedtolerances,withveryfew constraintson the natureof
the node transferfunction;classic results along these lines are
In viewof theseconsiderations,thispaperundertakes
found in Cybenko (1989), Diaconis and Shahshahani(1984),
Funahashi(1989),andHorniket al. (1989). a study of two classes of sigmoids: the simple
3 Another exampleof the non-equivalenceof “sigmoids”is sigmoidr, defied to be odd, asymptoticallybound-
offered by the work of Macintyreand Sontag (1993) on the ed, completely monotone functions in one variable,
Vapnik-Chervonenkis (VC) dimensionof fecdforwardnetworks, and thehyperbolic sigmoidr, a proper subsetof simple
whichshowedthatit is finiteonlyfor a classof sigmoidalfunctions
theycalltheexp-RAfunctions.Theyshowedthatanalyticityof the
sigmoids and a natural generalization of the
transferfunctionis crucial,andcannotbe relaxed,for instanceby hyperbolictangent.The classof hyperbolic sigmoids
makingthe functionG’. includesa surprisingnumberof well-knownsigmoids.
A Class of SigmoidFunctions 821

The main contributions of the paper are as DEFINITION2.1 (real analyticity). Let U S L%be an
follows: open set. A function f: U 4 9? k said to be real
analytic4 at X. E U, lfthe fmction may be represented
● Simple sigmoids, hyperbolic sigmoids and their by a convergent power seraks on some interval of
inversesare completelycharacterizedin Sections4 positive radius centered at XO, i.e., f(x) =
and 5. z~o aj(x–xo)J. The fmction is said to be real
● Using seriesinversiontechniques,in Section 5, we analytic on V c U, f it is real analytic at each
obtain the seriesexpansionsof hyperbolicsigmoids xl) c v. ■
from those of their inverses.These resultsextend
those of Minai and Williams(1993)for the logistic DEFINITION2.2 (monotonicity).Afunctionf: 91 ~ ~
function. is absolutely monotonic in (a, b) &it has non-negative
● In Section 4, we study the composition of simple derivatives of all orders, i.e., f e Cm ((a, b)) and,
sigmoids via differentiation,addition, multiplica-
tion, and functional composition. These results fl’J(x)>o a < x <b, k = O, 1,2.... (2.1)
also completely specify the relationship between
Euler’s incomplete beta function and the para-
metrized sigmoids. A fmction f: ~ - B is completely monotonic in (a,
● In Section6.1 we show thatthecontinuousCohen– b), ~ff(–x) is absolutely monotonic in (–b, – a).
Grossberg-Hopfield equationsbelong to the class Equivalently, f is completely monotonic in (a, b)
of non-homogeneous Legendre differentialequa- z~ff ● Cm((a, b)) and,
tions if the neural transfer function is a simple
sigmoid. (-l)kJ@)(x)>O a < x <b, k=O, 1,2.... (2.2)
● In Section 6.2 we establisha connection between
Fourier transformsand feedforwardnets with one A fmction f: 97*3 is completely convex in (a, b),
summingoutputand one hiddenlayerwhose nodes z~ff ● Cm; ((a, b)), and for aZZnon-negative k and
contain simplesigmoidaltransferfunctions. x E (a, b), (–l)kflk)(x) >0. w

We do not purport to have discovered a general A fundamentalproperty of absolutelymonotone


framework to describe all sigmoids; indeed, such a and completelymonotone functions is that they are
quest is largelymeaningless;nor are we arguing for necessarilyrealanalyticon theirdomains.5Addition-
limiting the notion of sigmoids to the classes ally, iff is absolutelymonotone on an intervalZ ~ $%,
consideredin this paper. Simplesigmoids are rather then it is non-negative,nondecreasing, convex, and
special sigmoids, but their regular structure often continuous on Z.
makes the relatedtheory tractable,paving the way
for more generalanalysis.
DEFINITION 2.3. The generalized Gauss hypergeo-
metric (GH) series ~Fg(al, . . . . r+; ‘H,. . . . 79; Z) iS
defined by,
2. PRELIMINARIES
Notation: 9t and S?+denoterealspace, and the setof ~Fq(a,, ..., ap; ‘yl,. . . ,~q; z)
positive real numbers,respectively;(a, b) and [a, b] m (cr,)k(a,)k . . (a,)’ <
denote theopen and closed intervalsfrom a to b. If A ‘~ (’_h),(%)k... (~,), ‘!
is a set, then IA I is the cardinalityof A. Given a Vi : vi # O, – 1, – 2, . . . (2.3)
function J its domain and range are denoted by
Dorn(f) and Ran~, respectively;~f~)refersto the kth
derivativeof~(if it exists).Occasionally,we shalluse where (a). = (a)(a + 1) . . . (a+ n – 1) is the rising
J’(x) in place of~tl)(x). If a function is k times factorial or Pochhammer’s polynomial in a. By
continuouslydifferentiableon a giveninterval1, then definition, (a). = 1. The @is are the numerator
we writef E Ck(Z). Cm functions are called smooth
functions. The term “propositions” refers to results 4 Red analytic fimctions are also referred to as regular,
cited from externalsources. homomorphic, or monogesicfunctions.
The concepts of real analyticfunctions (K.rantz& 5 Thisis thecelebrated“Bernstein’stheorem”(Polya,1974).In
Parks, 1992), absolute monotonic and completely full, Bernstein’stheorem asserts tbat given a function f(z), if
infinitelymanyof its derivativesf(w), f(w),... areof constantsign
monotonic functions (Widder, 1946) and hypergeo- in the open intervalI, and if the sequencem, nz, . . . does not
metric functions(Erdelyiet al., 1953),are central to increasemorerapidlythana geometricprogression(i.e., thereis a
what follows; for convenience they are reviewed fixedquantityC, suchthat Vknk+l/n’ < ~, thenf(z) is analytic
below. on the intervalI.
822 A. Menon et al.

parameters, and the ~is are referred to as the kinds of sigmoids being considered. The following
&nominator parameters of the GH series. ■ results clarify the implications of the fourth
constraint.
In particular, the classical GH seriesb in z,
2F1(rY,~; ~; z) is definedby, PROPOSITION 3.1 (Feller, 1965) A fmction
f: (O, 1) d Si?is absolutely monotone on (O, 1) a~it
*F’,(@,~; ‘y;z) = F(% P; 7; z) = ~ m (a)k(fl)k
(7)k Z’k(~.4)
n possesses a power series expansion with non-negative
coefficients, convergingfor O < x <1. ■

REMARK2.1. The ~Fq representationof a hypergeo- LEMMA3.1. A fmction f: (O, 1) 48 is completely


metric series,though standard,is not a unique one. monotone on (O, 1) i~itpossesses an alternating power
For example,the series series expansion, convergingfor O < x < 1.

ProoJ8 Iff is completelymonotone in (O, 1), thenthe


power series expansion of f in (O, 1) has to be
alternating because, (–l)~fl~) >0]. On the other
hand, consider an alternating power series f(x)
could be viewedas OFO(;; z), or as IFI(1; 1; z), or as
convergingfor all O< x < 1 and ita derivatives:
3F3(I, 1, 1; 1, 1, 1; z), etc. We shall henceforth use
the “minimal” representation,in thiscase OFO(;; z).
j(x) =ao–a,x+a~x2– aqx3 +...

REMARK2.2. In general,theparameters~i and ~i, as (–l)J’)(x) = +al – ti~x + 3a~x2 +.. o (3.1)
well as the variable z, are allowed to be complex; flz)(x)=2a2-6a3x+““ “.
however,we follow common practiceand restrictour
attentionto real values,i.e., Vi : ~i, ~, z G 9i?.Even
where ai > 0 Vi. From real analysis we know that
with this restriction,the hypergeometricfunction is
each of (–l)nfln)(x) has the same convergence
amazinglyversatile.Spanierand Oldham (1987) list
properties as eqn (3.1). Also, the sum of a
over 170functions thatare representablein termsof
convergent infinite alternatingseries is always less
hypergeometricfunction. The hypergeometricfunc-
than or equal to the first term. This fact, along with
tion is a periodic table d la Mendeleev for
the above equationsimpliesthat (–l)~fl~l(x) >0, i.e.,
mathematical functions; different functions get
f(x) is completely monotone on (O, 1). ■
neatlypegged into variousgroups7 by the valuesof
the parameters and the form of the dependent
COROLLARY3.1. u(x)/x k a completely convex
variable.
fmction in (O, 1) If rT(fi/fi is a completely
monotone function in (O, 1).
3. SIMPLE AND HYPERBOLIC SIGMOIDS
Proof If a(x)/x is completelyconvex in (O, 1),thenit
DEFINITION 3.1 (simple sigmoids). A fwction has to be analytic in (O, 1) (Widder, 1946). Also,
u : 9 + (–1, 1) is said to be a simple sigmoid if it cT(x)/x is an even function, implying that its power
satisfies the following conditions: seriesexpansionwillconsistonly of evenpowersin x,
which alternate in sign. From Lemma 3.1,
1. a(.) is a smooth function, i.e., u(–x) is~. a(~/fi, will hence be completely monotone in
2. a(.) is a oddfinction, i.e., 0(–x) = –a(x). (O, 1). The same argument suflices for the con-
3. a(.) has y = +1 as horizontal asymptotes, i.e., verse. ■
lim.+m u ( x ) =1. If a simplesigmoidis also strictly increasing,then
4. a(x)/x is a completely convexfunction in (O, 1). ■ a much stronger statement can be made, as
demonstratedby the following proposition.
Simple sigmoids are required to be odd smooth
functions bound by horizontal asymptotes; con- 3.2 (Krantz&Parks, 1992)Let y = a(x)
PROPOSITION
straintsimpose a degree of standardizationon the be a strictly increasing simple sigmoid (i.e., Vx E 3,
a’(x) > O). Then:
6 The classicalGH seriesis referredto as the Gmwjiinction in
the literature(SpanierandOldham,1987). 1. q - a-l : (–1, 1) -+ 9? exists.
7 ~~~em mustbe manyuniversitiestodaywhere95 per-t, if
not 100per cent, of the functionsstudiedby physics,engineering,
andevenmathematicsstudents,arecoveredby this singlesymbol 8 Lemma3.1 appearsto be “folklore”;we havebeenunableto
F(o, b;qz).” —W. W. Sawyer,citedby Grabamet al. (1989). finda referenee.
A Class of SigmoidFunctions 823

2. q(y) is a strictly increasingfunction, analytic in the inverse that is a solution to some second order
interval (–1, 1). Fuchsian equations.11 sin= any second order
3. q’(y) = l/c’(q(y)), where q’ and u’ are the first Fuchsian equation with three singularitiescan be
akrivatives of q and u respectively. transformed into the Gauss hypergeometricdiffer-
4. q(y)/y is absolutely monotone in (O, 1). entialequation, one solution of which is the classical
GH series (Klein-B6cher theorem; Whittaker &
REMARK3.1. If u(x)/x is completely monotone on Watson, 192’7),it follows that the inverses would
(O, 1) and cr is invertible then q(y)/y is absolutely have classicalseriesexpansions.Theseconsiderations
monotone on (O, 1), whereq denotesthe inverseof a. motivatethe following definition.
The converse is also true, and is an immediate
consequenceof Lemma 3.1. DEFINITION 3.2 (hyperbolic sigrnoids). A function
n : B + (–1, 1) is said to be a hyperbolic sigmoid
REMARK3.2. A simple sigmoid has two horizontal fwction ifit satisfies the following conditions:
asymptotes,hence its inverse (if it exists) will have
two vertical asymptotes(i.e., limY4~lq(y) + +~). 1. u is a real analytic, odd, strictly increasing sigmoid,
It will be seen that as they have been defined, such that Iimx+w a(x) = 1.
sigrnoidsand theirinversesarequitesimilar;both are 2. Letq :(–1, 1) + ~ denote the inverseof o, and
odd, increasing, univalent, analytical functions. q’ its first derivative.Then,
However, the two differ fundamentally in that (a) q(y)/y has a Gauss hypergeometric series
sigmoids are asymptotically bounakd, while their expansion in y2 with at most three para-
inversesare not. meters.
Simple sigmoids encompass many of the often (b) q’(y) has a Gauss hypergeometric series
used sigmoidsdescribedby formulae.The hyperbolic expansion m y2 with at most one parameter.
tangent and its close relative, the “exponential” or
logistic sigmoid, are often used in many neural
network theoretical studies and applications. For
example,most of the “spin-glassexplanations”of the 4. CHARACTERIZATION: INVERSE
CGH netusethehyperbolictangent.9The hyperbolic HYPERBOLIC SIGMOIDS
tangenthas, among others, the following properties: The following result is a complete characterization
for the inversesof hyperbolic sigrnoids.Proofs are
1. It is an odd, strictlyincreasinganalyticalfunction, presentedin the Appendix.
asymptoticallybounded by the linesy = +1.
2. Its inversetanh-l (y) hasa GH expansiongivenby THEOREM4.1 (inverses).Let y = a (x) be a hyperbolic
yF(l, 1/2; 3/2; y2). sigmoid, and let q : (–1, 1) * St be its inverse. Then,
3. The first derivative of tanh-l(y) is given by either
1/(1 – yz) = IFO(l;; y2), i.e., the “GH expansion
of thefirstderivativeof tanh–l(y) is dependenton
only one numeratorparameter.
(4.1)
It can be shown that many other simple sigmoids,
such as that of Elliot (1993), and the Gudermannian
(Section 4.2), also have inverseswith classical GH or
se~es representations.loThe function tanh-*(y)/y
satisfiesa second order linear homogeneous differ- q(y) = YF(a, –; –; Y2)= (l –yy2)a
~ >0 (4.2)
ential equation, with three regular singular points,
located at O, 1 and oo. A sigmoid with similar
analytical behavior could be expected to have an where, by F(a, –; –; y2), we mean F(a, ~; /3;y2)
(B c .%). ■

9Stochastic versionsof neuralnets often replacedeterministic Notation. Each inversehyperbolicsigrnoidis denoted


state assignmentrules by probabilisticones, obtainedfrom some by qa and is characterizedby a singleparameterm
distribution-usually the Gibbs distribution (e.g., Boltzmann
machines and stochastic CGH models). Computingexpected
values for the states of the system then leads to the hyperbolic COROLLARY 4.1. The set of hyperbolic sigrnoih is a
tangentfunction.See Hertzet al. (1991)for a typicalexample. proper subset of the set of simple sigmoih.
10The phenomenon is not unduly surprising.A heuristic
explanationis thatif the graphsof two functions“look”the same,
their respectivedifferentialequationsare usuallymembersof the 11Fuchsian~wtiom are lineardifferentialeqUatiOnS
Mch of
samefamily. whosesingularpointsareregular(Rainville,1964).
824 A. Menonet al.

A proof of Corollary 4.1 may be given along the 3. Connections with other indefinite integrals of
following lines.If a is a hyperbolicsigmoid,thenit is powers of trigonometricor hyperbolic fimctions.
simple on the interval (–1, 1). This follows from 4. Connections with statistics via the function
Theorem4.1. The seriesrepresentationfor its inverse L(P, q) (SPanierand Oldham, 1987).
in (–1, 1) has non-negative coefficients, and this
impliesq(y)/y is absolutelymonotone (Proposition When inversehyperbolic sigmoids are characterized
3.1). Hence a(x)/x is completely monotone, and by eqn (4.2), we can use the identity,
thereforesimple (Lemma 3.1 and Remark 3.1). The
converse is not true. Simple sigmoids need not be cosh(tanh-’ (y)) = (4.4)
hyperbolic. The error function erf(.) is simple, but J&
one can use the study of Carlitz (1963) on this
function to show that it does not have an inverse to show that,
representableby a classicalhypergeometncseries.It
ycoshti(tanh-l(y)) = (l –yy2)a. (4.5)
follows that erf(.) is not a hyperbolic sigmoid, and
hence the set of hyperbolic sigmoids is a proper
subsetof the set of simplesigmoids. ■ The fundamental role played by the hyperbolic
For specificvaluesof itsparameters,thehypergeo- tangent is once again evident. Here, it relates the
metric function often reduces to other well known two types of hyperbolic sigmoids defined by eqns
specialfunctions. When inversehyperbolic sigmoids (4.1) and (4.2).
are characterizedby eqn (4.1), there is an intimate
connection with Euler’sincomplete beta function. 4.1. New Inversesfrom Old

4.1 (Spanierand Oldham, 1987).Leta,


PROPOSITION
Theorem4.1 makesit possibleto generatenewinverse
P andy be such that, ~ = ~ – 1. Then, hyperbolicsigmoidsfrom others.Thekeyideaisthatif
yF(a, 1/2; 3/2; y2) is an inversehyperbolic sigmoid,
thenso is yF(a + 1, 1/2; 3/2; ~). A similarstatement
may be made for inversehyperbolic sigmoidsof the
form yF(a, –; –; Z2). GH fimctions such as
F(~, ~; ~; Z) and F(CY+ 1, /3;T; z) are stid to be
where B(v; u; z) is the incomplete beta function, contiguous. Erd61yiet al. (1953)givea completelisting
&fined by ~ t’-l(l – t)”-’dt, where O<z <1. Zn of themanyidentitiesthatrelatethem.Lemma4.1is a
particular, straightforwardconsequenceof threesuch identities.

tmh-i(z)
LEMMA 4.1. If qa : (–1, 1) ~ 9% is an inverse
~ i3(l/2; 1 – a; Z2)= cosh2(”-’)(t)dt. ■
~o hyperbolic sigmoid, then the functions qm+l and q._l
defined by:
Spanierand Oldham (1987) give a detaileddescrip- Y2(1-C4 d
tion of the many propertiesof thisimportantspecial %,1(Y) = ~ ~ (y‘-%(Y)) ~~ I (4.6)
function. The following corollary is an immediate 2CZ-1
consequenceof Theorem 4.1 and Proposition 4.1. It
gives the connection between inverse hyperbolic
%-l(Y) =
(h - ;;’1
- y’)a-2
sigmoidsand Euler’sincompletebeta function.
x ~@{
dy {
w}
y;?;-’ z cY>2 (4.7)

4.2. If q.(y) = yF(cr, 1/2; 3/2; yz), then


COROLLARY
are also inverse hyperbolic sigmoids. Also, there exist
%(Y) = k B(l/2; 1 – ~; Y2)” ■ fwctions K,(a, z), Kz(a, z) and K3(a, z) such that the
following relation holds:
The relationshipbetween hyperbolic sigmoids and
the incompletebeta function also makes explicitthe K1(a, Z) V.-I (y)+ K2(cY,z)%(y) + ~3(a, Z)%+l (Y) =0.
relationshipbetweenthe hyperbolictangentfunction, (4.8)
and inverse hyperbolic sigmoids of form
yF(a, 1/2; 3/2; y2). Other consequencesinclude: ProoJ Equation (4.6), which definesqa+l(y), results
from the following identity:
1. The availabilityof good approximationsfor small
valuesof y and (1 – y). + n, P; 7; z) = ~
(CY)nZa-’F(a [Za+n-’F(CY,
/3;~; z)].
2. Rapidlyconvergingseriesexpansionsfor y close to
1. (4.9)
A Class of Sigmoid Functions 825

In the following, we will use F(O) as an abbreviation The definitionof hyperbolic sigrnoidsimpliesthat
for F(6; /3;-y;z). Equation (4.7) follows from the theirinverseshaveGH expansionsiny2. Theorem4.2
identity: relaxes this requirement by only requiring GH
expansionsin some odd, injective C* function g(y).
- Z)a+d-7-nF(~
(-y- CY)nZ7-a-’(l _ ~) A proof is provided in the Appendix.

=~[Z~-a+n-’(l -Z)”+o-7F(a)]. (4.10) THEOREM4.2. Leta :9?-+ (–1, 1) be a retdarudytic,


oaW strictly increasing sigmoid, such that its inverse
Equation (4.8) relatingqa-l (Y),qa(y), and qa-l (y), is q : (–1, 1) + 91 has a GH series expansion in some
a consequenceof the identity: injective, odd, increasing C’ function g(.), with at most
three parameters, convergent in (–1, 1). Also let q’
have a GH series expansion in g(.), with at most one
(7 - ~)HCY- 1)+ (z~ - v - fYZi-~z)F(cY)
parameter. Then, either
+ cr(z– l)F(a + 1) = O. (4.11)

Inversehyperbolicsigmoidscome in two forms; one V(Y)= dy)F(~, ;; ;; (dY))2)


form has threeparameters[eqn(4.1)],whilethe other
hastwo “missing” parameters[eqn(4.2)].Subjectto a
minor condition, the latterform is alwaysobtainable
from the former.
or,
1/2; 3/2; y2),
LEMMA 4.2. Let qa = yF(cY, where
cs>1. Then the function q._l defined by: n(Y) =g(Y)~(w –; –; o!(y))’)
/?(Y) (4.15)
%-I(Y) = Y(l –Y*); %(y) = yl’(a – 1, –; –; yz) =(1 - (g(y))z)~‘“’itha >0

(4.12)
provided
in an inverse hyperbolic sigmoid, with parameter
a – 1. ■ i?’(Y)
PI (1 - yq” + “
For inverse hyperbolic sigmoids with “missing”
parameters,thereis a very simplecomposition rule. where g’(.) is thefirst derivative of g(.). ■

LEMMA4.3. ~ –y2)a and q., (y) =


q.(y) =Y/(1 In the case g(y) = y, we obtain the characterization
Y/(1 – Y*)& are two inverse hyperbolic sigmoih with for inversehyperbolic sigmoids. Another interesting
a, a‘ >0, then thefunction (qo(y)q~ I(y))/y N also an special case is when g(y) = q(y), where q(y) is an
inverse hyperbolic sigmoid with parameter inversehyperbolicsigmoid (sinceq(y) is an injective,
(a +a’). ■ smooth, odd, increasingfunction the conditions of
the theorem are satisfied).The elementarycomposi-
In general,the set of inversehyperbolic sigrnoidsis tion rules presentedhere allow the generationof an
not closed undermultiplicationor addition.But if Va infinitevarietyof inversehyperbolic sigmoids.’3The
and q~~are inversesof two hyperbolicsigrnoidsthen next section presentssome examples.
their sum would also be an inverse hyperbolic
sigmoid qfi for some p c $%,i.e., q~+ qat = Kqfi, for
some K, if and only if 4.2. Examples
Any fimctionof theform y/(1 – y2)”, wherecx>0, is
vu +%’ = KZI,* (CY)n
+ (cr’)” = K(p)n VrI>l (4.13) the inverseof a hyperbolicsigmoid. For example,for
a = 2, the function y/~~ is the inverseof the
which in turn, is possible’2if and only if a = a‘, or hyperbolic sigrnoidx/~w.
a = O,or a‘ = O. Of all inverse hyperbolic sigmoids of the form

12Equation(4.13),with K = 1, provideaan amusingapplica- 13An in~~ng ~w is thePieceWise rationalsigrnoiddefined


tion for Fermst’slasttheorem;if we acceptthatfor all n >2, there by Elliot (1993), a(z) = y/(1+ [zI). Although its inverse
cannot exist positive integersa, b and c satisfyingthe identity q(v) = v/(1 – Iyl) does not fit in an obvious way into the
an+ /?’= e“, then we may conclude that the sum of inver3e frameworkdevelopedin the last few sections,it is easy to relax
hyperbolicsigmoidswith differentintegralparametersearmotbe the conditionsplacedon g(y), in Theorem4.2, so as to includethis
an inversehyperbolicsigmoidwithan integralparameter. sigmoidas well.
826 A. iUenonet al.

yF(cE,1/2; 3/2; y2), the function tanh(.) is note- Y/(1 – Y2)”, a remarkably explicit form for the
worthy: first, it corresponds to the case a = 1; coefficients{b2J+1}~may be given:
secondly, all inverse hyperbolic sigmoids with
integralvalues of CYmay be generatedfrom tanh(x) THEOREM5.1 (hyperbolic s@ioids—1). Z~the inverse
by a process of differentiation(Lemma 4.1); and sigmoid is given by y/(1 – y2)a, cx>0, then in some
thirdly, it is a function often encounteredin neural neighborhood of the origin, we have the valid expansion
nets. As was mentioned in the Introduction, the
logistic function may be thought of as a translated
and scaledversion of the hyperbolictangent.
There is a good example of the hypergeometric
compositiondescribedin Theorem4.2. Sincetan(~y) where,
is an odd, injective,smooth, increasingfunction of y
(for some constant ~ > O), from Theorem 4.2, one b~+’=(-l)k(M+l)’ ((z:l)a)o “1)
may conclude that for positive a, the function, tan
(@Y)F(~, 1/2; 3/2; tin2(@)) is the inverseof some ProoJ See the Appendix. ■
realanalytic,odd, strictlyincreasingsigmoid.It turns
out thattheinverseGudermannianfunction’4may be
obtained from this function, by choosing CY = 1 as
follows:
gal-*(y)= ln(aec(y)+ tan(y)) for – ~ < y < ~ 5.2. HyperbolicSigmoidaof the SecondKind
= 2 tan(y/2)F(l, 1/2; 3/2; tan2(y/2)). When an inversehyperbolic sigmoid is of the form
x = yF(a, 1/2; 3/2; y2), the problem is much hard-
Many such examplescould be generated.15 er. The Lagrange inversion formula leads to an
intractableexpression. Kamber’s formulae, as pre-
sented by Goodman (1983), can be used to give
5. CHARACTERIZATION: HYPERBOLIC explicit expressions for the coefficients. Unfortu-
SIGMOIDS nately, the resulting expressions involve determin-
ants, and are of little computational value. The
It is often desirable and necessary to work with method of repeateddifferentiationis more successful.
sigmoids themselves,rather than their inverses.In The starting point for this line of attack is the
this section, we obtain power series expansionsof observationthat if x = q(y) is an inversehyperbolic
sigmoids. sigmoid, then:
If x = q(y) is an inversehyperbolicsigmoid,then
~ ~ ~–1 must have a Maclaurinseriesexpansionof
dxd 1
the following form: — = — q(y) = q’ (Y) = (1 – y2)a . (5.2)
dydy

From Theorem 3.2, we seethat for y = m(x),

We are interested in determining the coefficients


~ =: .(x)=*=(1– yz)”. (5.3)
{b21+l}~0 associated with the inverse hyperbolic
sigmoids: By virtue of eqn (5.3), we can compute the higher
v derivativesof o(.) and hence compute
(1 - yqa

dM+lu(x)
and yF(a, 1/2; 3/2; y2). ~+1 = &k+l .
X=(I

5.1. Hypbolic Sigmoidsof the First Kind Note that dy/dx is expressed in terms of y; this
necessitatesthe use of the chainrule.For example,to
When an inversehyperbolic sigmoid is of the form
calculatethe second derivative:

2=(31
14‘f& inverwGudrjrmannian functionfinds use in relating
circularand hyperbolicfunctions,without the usc of complex
functions.
ISme tablesof SpanierandOldham(1987),andHansen(1975)
-’’20%
=(1 - yz)” (-$(1 - Yz)”). (5.4)
in particular,containmanysuchfunctionsandexpansions.
A Claw of SigmoidFunctions 827

The following theorem presentsan efficientway to C(l, o) = 1


implementthisprocedure. (5.7)
C(n, @ = O Vk>n, k <0

THEOREM5.2 (hyperbolic sigmoids-IIA). fit the C(rI+ 1, k) = (2k – n + l) C(rI, k)


inverse hyperbolic sigmoid be yF
q. = (a, 1/2; –2(na – k +1) C(n, k – 1)
3/2; y2), and u G q: ~. tit D ~ dldx. Then,
n a and n and k are natural numbers. Furthermore,
D“(y) = D“(cr(x))= Gn- , (y)(l – y2)m (5.5) Dn(a(x)), the nth a%ivative of u, is given by:

D“(y) = D“(a(x))
where Gn : (–1, 1) ~ 9i?i.r a fiction satisfying the n–1
recursion =x C(n, k)yX-n+’(l -y2)w-k, forn>l. (5.8)
k=O

Go(y) = 1, Proof. See Appendix. ■


G.(Y) = ; G.-l (Y)
The recursivesystemdescribedby eqn (5.7) does
2yna not involve any differentiation.The desired value
–—
1 – Y2 G.- l(y), n> 1. (5.6)
bx+l may be obtained by computing the value of
C(2k + 1, k). Equation(5.8) givesinformationabout
the shapes of the derivatives of the hyperbolic
In particular, bzk = O, and b=+l = DX+*(a(x)) =
sigmoid. From eqn (5.7),
GZk(0).
b, = 1 b,= -2c2, (5.9)
Proo$ Proved by induction on n. ■ b5=4a(7a–3) fq=–8cr(127cr2– 123cr+30). (5.10)

While the procedure implicit in Theorem 5.2 is Theorem 5.3 may be viewedas a generalizationof the
efficient, it does involve the computation of the work of Minai and Williams(1993)on the derivatives
derivative of Gn(y). Equation (5.6) is a partial of the logistic sigmoid. They obtained relations
differenceequation with variablecoefficients.There- similar to eqn (5.7).17 In general, eqn (5.7) is a
fore, thereis littlehope of solvingit in any generality partialdifferenceequationwith variablecoefficients,
and obtaining a closed form expression.Even more and the systemdoes not appearto be relatedto any
sophisticatedmethods—suchas Truesdell’sgenerat- well known setsof numbers.Obtaininga closed form
ing fimction techniqueand Weisner’sgroup theoretic solution for the numbers C(n, k) appears to be
approach (McBride, 1970)-do not give any special intractable.
insightinto the nature of the polynomials G. (y).16
The next theorem offers a somewhat different
approach to the method of repeatedderivatives. 6. APPLICATIONS
In this section, we presenttwo applications.The first
THEOREM5.3 (hyperbolic s@noids—IIB). Let showsthatif theneuralnetworktransferfunction is a
hyperbolic sigmoid, then the dynamical equations
describing the CGH neural network (Hopfield &
Tank, 1986; Grossberg, 1988) can be transfonrted
into a set of non-homogeneousassociatedLegendre
differentialequations. Some conclusions regarding
be an expansion for a hyperbolic sigmoid, whose the behaviorof the CGH model can be drawn,as the
inverse is of the form yF(a, 1/2; 3/2; y2), valid in outputs saturate(i.e., output+ +1).
some neighborhood of the origin. Then bzk = O, and The second application derives an interesting
bx+l = C(2k+ 1, k), where the sequence C(n, k) connection between Fourier transforms and one-
satisjies: hiddenlayerfeedforwardnets(one-HL nets). Subject

16 Equation (5.6) is a differential-difference


system of the 17Interestingly,in theeaseof the Io@stic siwoid, th~
ascending type; it ean then be shown that the polynomials relations happenedto be the recursionscorrespondingto the
{Gn(y)}~l satisfy Truesdell’sF-equation. Unfortunately,the Euleriannumbers (Grahamet al., 1989). In other words, the
resultinggeneratingfunction for Gn(y) is too complicatedfor coeftkientsarisingin the computationof higherorderderivatives
anypracticaluse. of the logisticsigmoidturnout to be the Euleriannumbers.
828 A. Menonet al.

to an additionalminor constraint,we show that the 2. Multiplythroughoutby (1 – v~)u+l.


use of one-HL nets with simple sigmoidal transfer 3. Differentiateonce more with respectto vi.
functions for function approximationis tantamount
to assumingthat the function being approximatedis Equation(6.4) is then transformedinto:
the product of two functions;one the derivativeof a
bounded non-negative function, and the other ~ @ – Z(I – a) ~
dyi
0 – ‘i) dv; + 2~yi =, Qi (6.5)
satisfying some linear nth order differentialequa-
tion, where n is the number of nodes in the hidden
layer. where

6.1. ContinuousCGH Neta and LegendreDifferential


Qi =$ [(1–
#
));)”+l +
1 + 2giVi.

Equations
The continuous CGH network with N neurons is Finally, substituteYi = zi(l – v~)Qi2in eqn (6.5),
describedby the following dynamics:18 yielding

dui
~+gi*{j=~ TijVj+Zi=Ei
j

= – ::, vi E {1,..., N} (6.1)


1
where Ri = (1 – V~)–a’2Qi. The associatedLegendre
differential equation is of the form (Spanier &
wherew and viarethenetinputand netoutputof the
Oldham, 1987),
ith neuron, respectively,Zi is a constant external
excitation, and E is the “energy” of the network,
given by: (1 -*7 ?.?&z. g

E = –~ ~ Tijvivj =
zvi#=-D’’Ei” (6”2)
+
[
V(v+ ,::21
1) –— f =0. (6.7)
l,]

It is clear that the left hand side in eqn (6.6) is the


Assume –1 < vi <1. ~t vi= ~(ui), where a(.) is a associatedLegendredifferentialequation with para-
hyperbolicsigrnoid.Let qs a-l, or, ui= ~(vi). There meters p = –a [eqn (6.5) requires us to choose
are two cases to consider. p = –a, rather than +a], and v = a. In other
words, the continuous CGH model with a neural
CaseZ. Let q(vi) = ViF(~, 1/2; 3/2; v?). In this case transfer function given by ~(vi) = ViF(~) 1/2;
3/2; v?), is reducible to the non-homogeneous
dq 1
(6.3) associatedLegendredifferentialequation with para-
Z = (1 – V;)a
metersp = –d and v = cv.

and substitutingeqn (6.3) into eqn (6.1), we get: CaseIZ. Let ~(vi) = ViF(~, –; –; v?). An analogous
approach leads to the very same conclusion as in
(6.4) CaseI, i.e., it is possibleto transformthe continuous
CGH equationwith the above transferfunction to a
non-homogeneous associated Legendre equation.
The following sequenceof operations is applied to However, the right hand side of the transformed
eqn (6.4): equationis complicatedand we do not consider this
case further.
1. Substitute We emphasizethat the link between the contin-
uous CGH equation and the Legendre differential
dvi equation is not accidental, given that it can be
‘i= x’ established for all hyperbolic sigrnoidal transfer
functions.For ui = tanh-l(vi), a = 1, and the above
and differentiatewith respectto vi. equationshave a ratherelementaryform.
An immediateapplicationof the above transfor-
18me ~ynaptj~wej~ts Z’ijare assumedto ~ symmetric.1t mation is in studyingthe saturationbehavior of the
makesthe derivationssimplerwithno loss in generality. CGH neural net. By saturation,we mean that the
A Class of Sigmoid Functions 829

outputs of the neurons tend to +1. As discussedby – Zv, ~


(1 - v;)*
Hopfield and Tank (1986), this usuallyoccurs when ‘ dvi
the networkis headingtowardsa criticalpoint (local
or global). Saturationimpliesthat as a node output
vi ~ +1, the quantity Ri ~ O. In other words, we
may studythe saturationbehaviorof the continuous
+
[
(Y(CI+ 1) – &
i 1
Zi = Ri (6.13)

CGH model by consideringthehomogeneousversion whereRi = (1 – V~)–a’2Qi,and


of eqn (6.6), viz.,
: (1
1[
-v;)”+’
: 1+ 2givi.
(1 -v;)* –2Vi~+, a(~+ 1)‘—
[ 1:VJ ‘i = 0“
(6.8) Consider the case when Qi = K is a constant. Then
the above equationreducesto the non-homogeneous
From the theory of associatedLegendreequations,it equation,
is seen that eqn (6.8) has a solution in termsof the
associatedLegendrefunctions, 1$’)(x), and Q$!)(x) a’
(1 - @ * –2Vi ~+, fl’(~ + 1) ‘—
(Erd61yiet al., 1953). Here, p = –a, v = a, and [
1 -v;
1
‘i
x = Vi,and we have: = K(l – v;)-”” (6.14)

Zi = C] P; ‘a)(vi) + C2Q:-R)(vJ
which may be solved using the specialfunction s~,al,
= c1P~-u)(vJ+ C2Q:-U)(VJ (6.9) defined and described by Babister(1967). Equation
(1 -;;)@
(6.14) first arose in the context of solving for
4 = c1p~-u)(vi)+ C2Q~-u)(vi). Poisson’s equationin sphericalpolar coordinates.
(1 - ;~)”l’ dt

Neglectingthe effect of gi, as is common practice,we 6.2. FourierTransformsand FeedforwardNets


obtain from eqn (6.4):
There have been many differentattemptsto describe
the behavior of feedforward networks such as the
(6.10) group theoreticanalysisof the perception, proposed
by Minskyand Papert(1988),the spacepartition(via
Equation (6.9), in conjunction with eqn (6.10), hyperplanes)interpretationdiscussedby Lippmann
implies: (1987) (and many others), the metric synthesis
viewpoint introduced by Pao and Sobajic (1987),
and the statistical interpretation emphasized by
E,= - #=(1 - v~)-ul’[cl p~-”)(vi) + c’Q~-a)(vi)].
White (1989). Gallant and White (1988) showed
1
(6.11) that a one-HL feedforward net with “monotone
cosine” squashingat thehiddenlayer,and a summing
output node, embeds as a special case a “Fourier
Equation(6.11) in conjunctionwith eqn (6.2) implies network” that yields a Fourier seriesapproximation
that the overall energy at saturationmay be written to a givenfunctionas its output.We presenta related
as follows: construction in this section; it is shown that a one
hidden layer (one-HL) nets with simple sigmoidal
convextransferfunctions (at the hiddenlayer),and a
single summing output, can be thought of as
+ c2Q~-u)(vi)]. (6.12) performing trigonometric approximation (regres-
sion). Specifically,the inverse Fourier transform of
Here, Ei does not dependon Ej for i #j. Thus, to a the function (to be learned) is approximated as a
crude fist approximation, the CGH network linearcombination of weightedsinusoids.
“dissociates” at saturation,into independentunits, The result is a consequence of a connection
and thequadraticenergyfunctionmay be writtenas a between a class of simple sigmoids and Fourier
linearsumof non-linearunivalentfunctions,givenby transforms,that facilitatesa novel interpretationof
eqns (6.11) and (6.12). one-HL feedforward nets. The following. classic
We wish to stress the possibilities revealed by theoremdue to Polya (1949) is a startingpoint.
dealingwith the CGH equationin a generalcontext.
For example,in eqn (6.6), PROPOSITION6.1 (Polya, 1949). A real valued und
830 A. Menonet al.

continuous fmctwn f(x) aljined for all real x and REMARK 6.2. In eqn (6.15), h(t) is an even function.
satisfying thefollowrng properties: Hence the transformis a Fourier cosine transform.
The sinecomponentvanishesduringthe course of an
● f(o) = 1,
integration.
● f(x) = f(–x),
Consider a one-HL net, with k input nodes, n
● f(x) is convex for x >0,
hidden layer nodes with convex simple sigmoidal
● limx+mf(x) = o,
transfer functions a(.), and one summing output
is always a characteristicfmction (Fourier transform) node. Let wij denote the weight of the connection
of an absolutely continuousdistributionfmction,~a i.e., betweenthe ith node in the hidden layer and thejth
node in the input layer; similarly,let Ci denote the
weightof theconnectionbetweenthe ithhiddenlayer
f(X) = #(h(f); X) = j’” exp(ixt)h(t)dz.
-m node and theoutputnode. Then the output O maybe
expressedas,
Furthermore, the a%nsityh(t) is an evenfwction, and ir
continuous everywhere except possibly at t = O. N
O= ~ Ci~f = ~ CiU(U,)
The following result connects simple sigmoids with i=l i=l
n k
(6.17)
Fourier transforms. =
x C,u (x w~jxj+ 0/
i=l j=l )
THEOREM 6.1. Let a(x) be asimpZe sigmoid. lfa(x)/x
is a convex fmctwn, then it is the Fourier transformof
where ui and Oi are the input and bias for the ith
an absolutely continuous distributionfmction, i.e.,
hidden node, respectively.Since m(.) is a convex
U(X)– #(h(t);
x
x) =

r
-m exp(ixt)h(t)dt.

Prooj It sufiices to prove that c(x)/x satisfiesthe


(6.15)
simplesigmoid,usingTheorem6.1, eqn (6.17) maybe
rewrittenas,

~(t) = ~ C,~, = ~ ciu&(h(t); U,) (6.18)


conditions of Polya’s theorem; u(x) being simple is i=] i=l
bounded, and hence limx+w a(x)/x = o. Also,
t7(-x)/-x = –a(x)/–x = cr(x)/x. Since cr(x) is where #(h(t); ui) denotes that #(h(t); x) is
completely monotone in (O, 1), it follows that evaluatedat the point
lima u(x)/x = K (some positive constant). There
is no loss of generalityin assumingK = 1, since one
can always scale a(.) appropriately. Finally, the .i = ~ W,jXj+ O,.
convexityof a(x)/x ensuresthatall of the conditions j=l

of Polya’s theorem are satisfiedand the conclusion


follows. ■ Usingthewell known propertyof Fourier transforms
(Davies, 1978), that if f(x) = #(h(t); x), then
REMARK6.1. Polya’s theoremis a sufficientbut not xf(x) = iS(h’(t); x) = .$(-ih’(t); x), where h’(.) is
necessary condition for f(x) to be the Fourier the first derivativeof h(.), and i = ~, Equation
transform of some function h(t). Hence, Theorem (6.18) may be rewrittenlgas,
6.1 is also only a sticient condition for a simple
sigrnoidto be a Fouriertransform.The case in point
o(t) = ~ C,s(–ih’ (t); u,). (6.19)
is the non-convex fimction tanh(x), which is stillthe {=1
Fouriertransformof a welldefinedfunction[thisand
other examples may be found in Oberhettinger
Equation(6.19)can be recognizedas beinganalogous
(1973)]:
to the Heaviside expansion formula in Laplace
transformtheory,20which allows the reconstruction

19In ~n @.19J, tie i term in #(–ih’(t); U) ~nv@.$ ‘ie


In other words, the conclusionswe draw in the next Fouriercosine transformrepresentationof u(z)/z (see Remark
few paragraphsmay be valid for some non-convex 6.2) into a Fouriersine transform.
simplesigrnoidsas well. m Forconvenience,we restatea simpleversionof the formula:
If the Laplacetransformof a functionh(t) is given by f(z), i.e.,
f(z) = Y(h(t); z)= ~h(t)exp(–zt)dt, andf(z) hasonlyfmt
InR- tit m absolutelycontinuousfunction F(2) is a orderpolesat 21,x2,..., 2“, th~ h(t) = ~~=1 ~k(Zk), where
distributionfimction if it can Ix written in the form F(s) = Fk(z~) is the residue or pole-coe!lkient of f(z)exp(zt) (Bohn,
Cmh(t)~, whereh(t) is calledthe densityof F(m). 1963).
A Clamof SigmoidFunctions 831

of a timevaryingfimction usinginformationrelating
to its spectralcomponents. Equation(6.19) suggests #-’(o(u)) = -411’(t)~ Cfsin ~ Wfjxj+ Of
that 1-HLnetswith convex simplesigmoidaltransfer i=l (, 1
functions can be thought of as implementing a (6.23)
spectral reconstruction of the output using the
weighted inputs u; to evaluatethe associated pole Equation(6.23)maybe usedas a startingpoint for an
coefficients(residues)of the Heavisideexpansion. analysis identical to that adopted by Gallant and
In particular, it can be demonstrated that the White (1988) in their study of one-HL nets with
resultsof Gallant and White (1988) are implied by “cosine squashing” functions. It is then straightfor-
eqn (6.19).In whatfollows, we shalluse.!?S(h; x) and ward to show that the weights may be so chosen
#c(h; x) to indicate the Fourier sine and cosine (hardwired) so that the one-HL nets embeds as a
transformsof h(t). specialcasea Fouriernetwork,whichyieldsa Fourier
Since h(t), the continuous distribution fimction series approximation to a given function as its
corresponding to cr(x)/x, is an even fhnction output. In this sense, the results of this section
(from Polya’s theorem), it follows that a(x)= extend those of Gallant and White.
x.!F(h(t); x) = xSJh(t); x). Using the property of More generally,one can draw similarconclusions
Fourier transforms that x~c(g(t); x) = by consideringsigmoids that are the L.aplace trans-
&,(–g’(t); x) (Davies, 1978),we may conclude that forms of some function; for exampletanh(x)/x is the
a(x) = #,(–h’ (t); x). Laplacetransformof sgn(sin(rt/2)), wheresgn(x) is
Let Ui= u + ri, where the ri are appropriate +1, Oor –1 dependingon whetherx is greater,equal
functions of the xi (since the ui are functions of the or lessthanzero (Spanier& Oldham, 1987).A similar
inputsXi) analysis would lead to a connection with real
exponential approximation (rather than trigono-
metric approximation).Efficientalgorithms,such as
o(u) = ~ C,#.(–h’(t); u + n). (6.20) Prony’s, exist for certain restricted forms of the
isl
exponentialapproximationproblem (Su, 1971).
Also relatedare the considerationsof Marks and
From the frequency shifting property of Fourier
Arabshahi (1994) on the multidimensionalFourier
transforms(Davies, 1978),
transforms of the output of a one-HL feedfonvard
net; they showed that the transformof the output is
~ W~(f(t); x + a) = &,(f(t) cos(ar);x) the sum of certainscaledDirac deltafimctions.Here,
we view the sigmoiditself as the Fourier transformof
+ .%C(f(t)sin(at); x), (6.21)
some function; the main advantageof our interpreta-
tion is the algorithmsit suggestsfor trainingone-HL
it follows that nets of the type consideredin this section.
Another potentialuse of eqn (6.23) is its possible
O(u) = ~ Ci*.(–h’(t);
U+n) usein exploringthe “goodness” of theapproximation
i=l obtained by a one-HL net with simple sigmoidal
transfer functions. In the last 200 years, much has
=S 2ci{w
s
(-~’(t)cos(rit)”
~) been learned about the errors associated with
i=l
exponential and trigonometric approximation, and
+ SC(–h’(t) SiIl(~it); U)}
ways to deal with it; however,considerationof these
issuesis beyond the scope of this paper.
D .!7. 2 ~ Ci(–h’(t) COS(rit);U)
{ ixl }
n
7. CONCLUSION
+%. 2~ ci(–h’(t) sin(rJ); U)
{ i=l } We have analyzedthe behavior of importantclasses
of sigmoid functions, called simple and hyperbolic
.%-’ (O(U)) = ‘h’(t) ~ Ci Sh(ri + U)t. (6.22)
i=l sigmoids, instancesof which are extensivelyused as
node transferfunctions in artificialneural network
implementations. We have obtained a complete
But we may choose u artibrarily:let u = O,implying
characterization for the inverses of hyperbolic
sigmoids using Euler’s incomplete beta functions,
rt = Ui = ~ WijXj+(.?i, and have describedcomposition rules that illustrate
j=l
how such functionsmay be synthesizedfrom others.
We have obtainedpower seriesexpansionsof hyper-
and eqn (6.22) becomes, bolic sigmoids,and suggestedproceduresfor obtain-
832 A. Menonet al.

ing coefficientsof the expansions.For a largeclassof in the sciences of complexity. Redwood City, CA: Addison-
node fimctions, we have shown that the continuous Wesley.
Hopfield,J. J., & Tank, D. W. (1986). Computingwith neural
CGH net equations can be reduced to Legendre circuits:A model.Science,233, 625-633.
differentialequations. The fact that the connection Hornik, K., Stinchcombe,M., & White, H. (1989). Multi-layer
between Legendre differential equations and the feedforwardnetworks are universal approximators.Neural
CGH equation holds for such a wide variety of Networks, 2, 359-366.
sigrnoids,and is not just an accidentalconsequenceof Kran@ S. G., & Parks,H. R. (1992). A primer of real analytic
fmctions. Berlin:BirkhauserVerlag.
a particularsigmoid, strongly indicatesthat further Lippmann,R. P. (1987).An introductionto computingwithneural
exploration of this connection is warranted.Finally, nets. IEEE ASSP Magazine, 4, 422.
we have shown that a large class of feedforward Maeintyre,A., & Sontag,E. (1993).Finitenessresultsfor sigmoidal
networks representthe output function as a Fourier “neural” networks. In Proc. 25th Annual Symp. Theory
series sine transform evaluatedat the hidden layer Computing, SanDiego. New York:Associationfor Computing
Machinery(ACM).
node inputs, thus extendingan earlierresult due to Marka,R. J., &Arabshahi,P. (1994).Fourieranalysisandtiltering
Gallant and White. of a single hidden layer pereeptron.In Int. ConJ ArtlJicial
Neural Networks (IEEE/ENNS), Sorrento,Italy.
McBride,R. E. (1970). Obtaining generating frictions (Vol. 21).
REFERENCES Berlin:SpringerVerlag.
Albertini, F., Sontag, E., & Maillot, V. (1993). Uniquenessof Minai, A., & Williams, R. (1993). On the derivativesof the
weightsfor neuralnetworks.In R. Mammone(Ed.), Artz@cial sigmoid.NeuralNetworkr,6(6), 845-853.
neural networks with applications in speech and virion. London:
Minsky,M., &Papert,S. A. (1988).Perceptions, an introduction to
ChapmanandHall. computational geometry. Cambridge,MA:The MIT Press.
Babister, A. W. (1967). Transcendental functions satisfying Oberhettinger,F. (1973). Fourier traruforrns of distributions and
nonhomogeneous linear dl~erential equations. New York: their inverses. New York:AcademicPress.
Macmillan. Pao, Y. H., & Sobajic,D. J. (1987).Metricsynthesisandcomxpt
Bohn, E. V. (1963). The transform analysis of linear systems. discoverywithconnectionistnetworks.In Proc. IEEE Systems,
Man and Cybernetics Con$, Alexandria,VA.
Reading,MA: Addison-Wesley.
Carlitz, L. (1963).The inverse of the error function. Pac~c J. Polya,G. (1949).Remarkson the characteristicfunction.In Proc.
4th Berkeley Symp. Math. Statist. & Probab. (pp. 115-123).
Math. 13(2), 459-470.
Cybenko, G. (1989). Approximationby superposition of a
Polya,G. (1974).Onthe zeroesof thederivativesof a functionand
sigmoidrdfunction. Math. Control, Signals and Systems, 2, its analytic character.In R. P. Boas (Ed), George Polya:
collected works. (pp. 178–189).Cambridge,MA: MIT Press.
303-314.
Davies, B. (1978).Integral transforms and their applications. New Rainville,E. D. (1964). Intermeditite d~~erential eguations. New
York:SpringerVerlag. York:Macmillan.
Diaconis,P., & Shahshahani, M. (1984).Onnonlinearfunctionsof Spanier, J., & Oldham,K. (1987).An arhrsof fmctions.
linearcombinations.SIAM J Sci. Stat. Comput. 5, 175-191. Washington,DC: Hemisphere.
Elliot,L. D. (1993).A betteractivationfunctionfor artificialneural Su, K. L. (1971). Tinre+brrain synthesis of linear networks.
networks.TechnicalReport TR 93-8, Institute for Systems EnglewoodCliffs,NJ: Prentice-Hall.
Research Universityof Maryland,CollegePark,MD. Sussmarm,H. J. (1992). Uniqueness of weights for minimal
Erd61yi,A., Magnua,W., Oberhettinger,F., & Tricomi, F. G. feedforwardnets with a given input~utput map. Neural
Networks, 5(4), 58%593.
(1953). Higher transcendental fmctions (Vol. 1). New York:
White, H. (1989). baming in artificial neural networks: a
McGraw-Hill.
Feller, W. (1965). An introduction to probability theory and its statisticalperspective.Neural Computation, 1,42S464.
Whittaker,E. T., & Watson,G. N. (1927).Moakrnanalysis(4th
application (vol. II). New York:JohnWiley.
Funahashi, K. (1989). on the approximate realization of cd.). Cambridge:CambridgeUniversityPress.
Widder, D. V. (1946). The Luplace tromforrn. Princeton, NJ:
continuousmappingsby neural networks. Neural Networks
2(3), 183-192.
PrinectonUniversityPress.
Wilf, H. S. (1989).Generatingfiictionology. New York: Academic
Gallant,A. R., & White,H. (1988).Thereexistsa neuralnetwork
that does not makeavoidablemistakes.In IEEE International Press.
Conf on Neural Networks (Vol. 1, pp. 657%54).San Diego,
CA. APPENDIX
Goodman,A. W. (1983). Univalentfwctions (Vol. I). New York:
THEOREM 4.1. Let y = u(x) be a hyperbolic sigmoid, and let
MannerPublishingCo.
T : (–1, 1) + ~ be its inverse. Then, either
Graham,R. L., Knuth,D. E., & Patashnik,O. (1989). Concrete
mathematics. Reading,MA: Addison-Wesley.
Grossberg,S. (1973).Contourenhancement,short termmemory,
andconstanciesin reverberatingneuralnetworks.Studies Appl.
Math., 52, 217–257.
Grossberg, S. (1988). Nonlinear neural networks: principles, or

mechanismsandarchitectures.Neural Networks 1, 17-61.


Hansen, E. R. (1975). A table of series and products. Englewood q(yqcl, –; –; /) = (~ -;2)0 ~> o
(A.2)
Cliffs,NJ: Prentice-Hall.
Barrington,P. (1993).Sigmoidtransferfunctionsin baekpropaga-
tion neuralnetworks.Amdyt. Chem.65 (15), 2167–2168. where, by F(a, –; –; y’), we mean F(a, ~; ~; y2)(/3 G 9?).
Hertz,J., Krogh,A., &Palmer,R. G. (1991).An introduction to the
theory of neural computation (Vol. 1), Santa Fe Institute studies Proof Since.u(.) is hyperbolic,by definitionrI(.)/x is &scribedbya
A Class of Sigmoid Fmctiorrs 833

GH series with at most three parameters.There are four


possibilities: A(x) .*.

rt(X) = X3FO(aI, C3Z,


(Z3; ; X2) + CSSC
1 (A.3) Then
q(x) = xzF1(a,, az; 71;X2) + casc2 (A.4)
q(x) = xl Fz(al; 7,, W; xz) + caac3 (A.5) A(x) =~=~(xF(a, fl; ~; X2)
7)(X) = XOF3(; ’71,72, ‘h; X2) + CSSC4. (A.6)
=F(a, ~; V;x2)+xM(a’~7; x2)
The followingpropositionshowswhythereis no needto consider
cases 1, 3, and4, as possibleformsfor hyperbolicsigmoida. = F(a, 6; ~; X2)+ 2# $ F(a + 1,~+ 1;T+ 1;X2)

WJ(oJk :+22 (7)k


(“)k(@k (k-+- l)!
PROPOSrmON A (Spanier& Oldham, 1987).Let ~F~(czI,...,ap; = (~ (~)i ‘—
‘h,... ,%; z). ~ a GHseries in Z.withp+ qparometers.If noneof )
the numerator parotneters are non-positive integers, i.e., (a)k(~)k 1
. ~+2 x~
Vi: a, # O, – 1, – 2,..., then convergence behavior of ~F~ is os ( 1‘~ (~- 1)!(+k ())
follows:
(a)k(@)k ~~+,) ~
= ( 2 (7),
p < q + 1: ~Fqnecessarilyconverges
forallfmitez. )

P = q+ 1: cotwer@?rzceOfPF~ islimitedto – 1< z <1, = ~~~-


(a)k(fl)k (3/2)& xx

onddeperuirontheporornetersqandyi (A.7) ( )

forollnon-zeroz.
p > q i- 1 : *Fqnecessarilydr”verges
From the definitionof hyperbolicsigmoida,A(x) is representable
by a GH functionwithat mostthreeparameters;wemusttherefore
Sincelim,-.., q(z) - *CO,butis finitein theinterval(–1, 1),it makethe identification,@= 1/2 and~ = 3/2. Fromthe symmetry
follows that if a GH series is to representq(.), then it has to propertiesof the GH function,we neednot considerthe case when
convergein the interval(–1, 1), but divergeat z = *1. a = 1/2, ~ = 3/2. It foUowsthat,
This rules out non-positiverntegralvahkesfor the numerator
parameters;otherwise,theserieswouldconvergefor all z c 91 (and
not just in the interval (–1, l)). Yet, even if the numerator y = xF(a, 1/2; 3/2; X2)
parametersdo not havenon-positiveintegralvalues,in threeof the ~ = d~(x)
abovecases,the numberof numeratorparametersto denominator —= F(a, –; -; X2)= —
(1-;2)0“ (A.16)
ok
parametersis suchthat each seriesconvergesfor aUz (case 1), or
divergeafor aUz (cases 3 and 4). That leavesjust one case to
consider,viz., theclassicalseries,2F1(cq, cq;71,z) = F(cz, 0; ~; z), The parametera cannot take any arbitraryreal value. The
i.e., we maytakeq(x) = xF(a, f3;T; X2). behaviorof q(x) at the endpointsof its interval,requiresthat,
Sinceq(.) has to be a GH serieswithat most three parameters,
some of the parametersare allowed to be “missing”.133other linrq(x)+ *CO*hnll A(x)+ *CO. (A.17)
X-* I
words,case 2 spawnsthe followingpossibilities:
Equations(A.16)and(A.17)takentogetherimplythata >0. This
q(x) = xF(fl, ~; ‘y; X2) + CaseZ(a) (A.8) is a necessarybut not sutlicientcondition. The following two
q(X) = X~(CY, ~; –; X2) + Caac2(b) (A.9) propositionsaUowus to pin downa’s valuemoreprecisely.
q(x)= xl+, –; ~; X2) +- caae2(c) (A.1O)
q(x)= xF(a, –; –; X2) - Casc2(d) (All) PROPOSITION C (Erd&lyi et al., 1953).If a and ~ are dtflerent from
0, – 1, . . . then F(cr, P; ~; z) converges absolutely for z <1. For
q(x)= xF(–, –; 7; X2) - Case2(e) (A.12) z = 1:
q(x)= xF(–, –; –; X2) - CZSC2(f). (A.13)
F(cz,~; 7; z)arwergesobsolutely if(a + /3- 7) <0 (A.18)
PropositionA canbe usedonceagainto weedoutallbuttwoof the F(a, p; ~; Z) convergesconditwnolly ~O<(a + B – y) <1 (A.19)
aboveset, viz., eases 2(a) and 2(d). The restlead to inappropriate
divergenceor convergencebehaviorin the interval.The following F(a, @;~; z) diverges if 1< (a + /3- 7). ■ (A.20)
propertyof GH functions will be needed.
D (Erd61yiet al., 1953).1f(~– a – /3)>0 then
PROPOSITION
PROPOSmrON
B (Spanier& Oldham, 1987). If y = F(rI, ~; y; x),
then r(7)r(7– Q- f7)
‘(a’ 7; 1)= r(7 - a)r(~ -p)
~ = ~F(~ + 1,~+ 1;~ +1; X). ■
dx’y where

Caac2(a): three parameterGH series:


r(x) = ~ exp(-t)#-’ di
r
q(x) = xF(a, 8; 7; xz) = x ~0 ~(a),(P), X*
~“ (A.14)
is Euler’s gamma fmction. g

Let If a <1, from PropositionC we see that the seriesconverges


absolutelyat z = X2= 1. FromPropositionD, this in turnimplies
that q(x)/x will have a finite valueat the endpointof ita domain
A: (–1, 1)+ 9?+, interval.Therefore,a> 1. The final form for the threeparameter
GH representationfor q(x) is, therefore, xF(a, 1/2; 3/2; X2)
with wherea> 1.
834 A. IUenonet al.

Case2(d)Gne-parame
ter GH series: with+(0) = 1. Then thereis a neighborhoodof the origin(in the t-
In thiscase, plane)in whichthe equationu = t+(u) has exactly one root for u.
Let
+k
tj(x) = XF(U, –; -; X7 = x ~ (q ~ = (~ -
xX2)-.
k>O

Thesituationis muchsimpler,sincewe haveto placeboundsonthe


valueof oneparameteralone.An argumentalmostidenticalto the be the Maclaurinexpansionof~(u(t)) in t,
and
one above allows us to concludethat for q(x)/x to satisfy the
propertiesof a hyperbolicsigmoid, it is both neceasmyand
sufEcientthatwe take a >0. ■

THEORes14.2. Let a : @~ (–1, 1) be a realanalytic,oti, strictly be the Maclaurinexpansionof the functionf’(u)[~ (u)]”.Then
increasingsigmoid, such that its rnverseq : (–1, 1) ~ S?hasa GH
series expansion in some injective, oaii, increasing C’ fitwn g(.),
with at most threeparameters, convergentm (–1, 1). Also let V’ have 0# = : c.-,.
a GH series expansionin g(.), with at most one parameter. Then,
either
Here, y au, xs r, and ~(u) = (1 – yz)o. Takef(u) = ua y,
and the theoremfollowsfrom the Lagrangeinversionformula.
w =d)’)+>
;;:;W))’) THEOREM
5.3. Let
m (a)k (g(y)p ~ilha>l,
“g(Y) k~o~ —k! ‘ (A.21)

=~~o
“ b+, X*
a(x) pk + 1)!
or
bean expansionfor a hyperbolic sigmoid, with an inverse of theform
v(~)= g(y)~(a, –; –; (@))z) yF(a, 1/2; 3/2; y2), valid in some neighborhoodof the origrn. Tken,
i?(Y) (A.22)
bk =0. ~ &+ I = c(2k + 1, k), wherewe d@ne the sequence
=(1 - (g(y))’)” ‘‘i’ha >0 C(n,k) osfollows:

provided c(l, o)= 1


C(n,k) = O Vk>n,k <0
g’(Y) C(n + 1, k) = (2k - n + l)C(n, ,4)
!5 (1 –yq” + “
- 2(na -k+ l)c(n, k - 1), n>l. (A.25)

where g’(.) is thejirst derivative of g(.). Here, n and k are natural numbers. D“(u(x)), ths nth derivatiws of
U, are given by:
Prooj The proof of Theorem4.2 is very similar to that for
~e~o~c~;i;~:eti;~t with q(x) =g(x)F(a; 1/2; 3/2; (g(x))’), D“(y) = D“(u(x))
“-l

‘k~o C(n, k)yx-~’(l - Yz)nm-k. (A.26)


(A.23)
ProoJ This theoremwas obtainedby a processidenticalto that
where
g’(x) isthe6rst derivativeof g(x). Sinceg’(x) >0 for all describedin MinaiandWilliams(1993),on the derivativesof the
x ~ Dom(g), and C-Z >0, it follows that q’(x) >0 for all logisticsigmoid.We thereforerestrictourselvesto an outline.
x ● Dom(q), i.e., q(x) is a strictly increasing function. The It is giventhaty = q(x) = XF(CZ,1/2; 3/2; X2), and x = u(y).
analyticity, continuity and oddness of q(.) follow from the It can be shownthat
respective properties of the GH function. We assure that
IimX-lrt(x)+ w, by forcingits derivativeq’(x) to go to infinity
at the endpointsof its interval. ■ D(x) =: u(y)= l/q’(x)= (1 -X2)”.

THEOREM
5.1. I’ the inverse sigmoidis given by y/(1 – yz)”, a >0, Considerthe derivativesof the polynomialf~l(x) = Xk (1 – X2)’,
then in some neighborhoodof the origak,we have the valid expression

D~k,,(X)) = $fi,,(X)=kxk-l(l - X2)04

- 21X’+’(1 - X’)*’-’
= (k)fi-,,e, (x) + (–2/)&,4-,(x)
where, = ~(jk,, (X)) + ~(_fk,, (x)). (A.27)

~+’=(-l)k(u+l)’
((ui’)a) (A.24) In eqn (A.27)we havesplit the effectof the operator

Proof We will needthe Lagrangeinversionformula,statedbelow


(Wilf, 1989).
Considerthe functionalequation:u = t#(u). Suppose and
~(u) arc analyticin some neighborhoodof the origin (*plane), into the sumof the actionsof two operatorsL and R (Minaiand
A Class of SigmoidFunctions 835

=
C(I, O)fo,o(z) ‘n 1

C(2, w-l,2tY(4
n = 2

c(3,0)f_2,3a(z) C(3, l)fo,sa- l(z) C(3,2)/2, ~a-~(z) n = 3

C(4, o)f_3,4e(z) C(4, l)j_l,4e -1 (z) C(4, 2)fI,ta - 2(z) c(4,3)f3,4a -3(Z) n = 4
FIGURE3. BinaryWorlvalion’g
treefor hyperbolkslgmoids.

Williams refer to them as ~ and Al). With respect


to the C(rI,k)f~-~l,m-k = C(n, k)xw-*l(l - xz)’’’’-~,where C(n, k)
polynomials~~,l,theseoperatorsare definedby: is a constant.It canbe seenthatthe rsthderivativesof u satisfy:

~(Af~,l(~))= ,@.1,-, (~) (A.28)


R(A~k,,(x))= ‘z/Afk+,,#-,(%) (A.29)

where,4 is a constant.The main advantageof introducingthese


operatorsis that they give a systematicway of visualizingthe 3. Therearetwo sourcescontributingto thevalueof C(n, k). One
productionof D“+’(x) fromD“(x);L and R maybe thoughtof as is the action of R on the (k – I)th term,andthe otheris that of
beingappliedto a binarytree of expressions,whereeach node is L on the kth termon the (n - l)th level.
some polymnial ~k,,(x), and the root is the polynomial
A,. =(1 – X2)”.The action of L on each node of this tree is to Inductionargumentsin conjunctionwiththeaboveargumentsthen
producea left child, given by eqn (A.28), and that of R is to give:
producea rightchild, giversby eqn (A.29);L actingupon~k,,(x)
does threethings:mul@es it by k(= the degreeof x), reducesthe C(I, o) = 1
degreeofx by 1,irscreures
the&greeof (1 – X2)bya. Onthe other
hand,R rncreasesthe degreeof x by 1, thatof(1–d)by(a–1), C(n,k) = o Vk>n,k <0
and multipliestheoperandby –21,wherelis the degreeof (1 – X2). C(n+ 1,k)= (2k– n+ l)C(n, k)
Figure3 depictsthe processfor the lirst fourlevels. By a detailed - 2(na– k+ l)C(n, k - 1), n> 1. (A.30)
studyof this “derivative”tree, the followingobservationsmaybe
proved:
Now, all terms in D“(x) containingan x term having positive
1. The rkhlevel of the tree correspondsto the nth derivativeof degreewill vanish,when evaluatedat x = O.For even n, all the
a(y), lr”(x) = Dn-’(u(y)) = L(D”-’(x)) +R(D’’-’(x)) (the nodes have an x term with an odd degree, and hence D“(x)
root of the treeis designatedn = 1, and DO~~,l(x)) =J~,l(x)). vanishesidenticallyat x = O.For odd n, all terms,exceptthe term
2. At the ntblevel,the treehas n nodes,andthe kthnode (k runs corresponding to k = (rI+ 1)/2, vanish at x = O. Since
from O through n – 1), is a polynomial in x, given by b.= D“(x) 11=0,itfollowsthat~k = Oandbu+,= C(2k+ 1, k).

You might also like