Professional Documents
Culture Documents
Ken Kreutz–Delgado
Electrical and Computer Engineering
Irwin and Joan Jacobs School of Engineering
University of California, San Diego
Contents
1 Introduction 2
1
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 2
1 Introduction
Drawing from basic concepts of Equilibrium Statistical Mechanics and Thermodynamics [6, 24]; the classic textbooks on
spin-glass model-based neural networks by Amit [3] and by Hertz, Krogh and Palmer [21]; and the theory of finite-state
Markov Chains and Markov Chain Monte-Carlo (MCMC) stochastic sampling [30, 5]; a summary is given of the basic theory
of stochastic binary artificial neural networks.
We show that the Ising distribution can be developed from 1) statistical mechanics; 2) from entropy maximization; and 3)
as an approximation to the probabilities fully describing a binary categorical random vector.
action forces, each of which can be in a finite number of individual particle states.
2 P(X) is the algebra of events associated with a measurement of X.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 3
have P (X) = PX (X) = probability that X = X ∈ X = {X(1) , · · · , X(K) }, so that P (X) takes its values in the finite set
{p1 , · · · , pn } where pi = P (X(i) ). Therefore P (X) can be written as
K
X
P (X) = pi δX , X(i) .
i=1
e−βE(X) 1
P (X) = > 0, β= >0 (1)
Z kT
where it is assumed that the partition function Z = Z(β) is a finite-valued normalization factor
K
X (i)
Z= e−βE(X )
< ∞.
i=1
Note that a Boltzmann distribution is positive, P (X) > 0, for all states X ∈ X under our assumption that the energy is finite.
In the equilibrium statistical mechanical formalism used to describe a many-particle physical system, the energy E(X) is
a well-defined physical quantity and the Boltzmann distribution arises as a consequence of the theory. For real-world physical
systems k is set equal to Boltzmann constant, an important physical parameter that relates temperature to thermal energy. In
this note, where we are exploiting the mathematical model of the Boltzmann distribution for information processing purposes,
we can set k = 1 and refer to T as the pseudo-temperature.4
Several properties of the Boltzmann distribution are derived and described in Appendices B and C. In particular:
• The Boltzmann distribution is the maximum entropy (MaxEnt) distribution given the constraint that the system has a
fixed known average energy. (See Appendix C.2.)
• The Boltzmann distribution is the minimum free energy distribution. (See Appendix C.1.)
e−βE(X) 1
P (X) = , E(X) = − XT W X − bT X (2)
Z 2
where the elements of the n × n matrix W = W T and the n × 1 vector b are continuous and real. This distribution is the
Boltzmann Distribution with Quadratic Energy (BDQE).
The quadratic form (2) can arise from invoking physical models, as is done in Section 3 below for the Ising spin glass
model. Alternative, one can justify the use of the quadratic-energy Boltzmann distribution by invoking the following property:
• The Boltzmann Distribution with Quadratic Energy (2) is the maximum entropy (MaxEnt) distribution subject to con-
straints on the first and second moments of X. (See Appendix C.3.)
3 This distribution is also variously known as the Boltzmann-Gibbs distribution, the Gibbs-Boltzmann distribution, the Gibbs distribution, or the canonical
distribution.
4 I.e., in this note we are looking at the properties of stochastic artificial neural networks (ANNs), not the properties of a true physical system where the
Boltzmann constant would have physical relevance. We will usually drop the adjective “artificial”, with the understanding that we are not concerned with
modeling the behavior of a true physical system and, consequently, we have agreed to set k = 1 throughout the note.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 4
The Boltzmann distribution with quadratic energy (2) also arising in the theory of categorical random vectors where it is a
(multivariate) quadratic exponential categorical (QEC) distribution5 . Note that it is the discrete random variable analogue to
the Gaussian distribution used to model the behavior of continuous random variables.6
Thus use of the Boltzmann distribution (1) with quadratic energy (2) as a viable stochastic model can be motivated from a
variety of perspectives, including arguments drawn from the following areas:
• Physics, invoking equilibrium statistical mechanics (as discussed in the body of this note).
• Information Theory, invoking the principle of maximum entropy arguments (see Appendix C.3).
• Statistics, invoking the theory of multivariate categorical variables (see Appendix D).
• Statistics, invoking the theory of probabilistic graphical models (see Appendix F).
that each encode a vector of realized site values. We call the random vector X the state of the random field where the
components of the random vector X correspond to finite random variables on the sites of the random field. The random field
perspective is used when the sites i, and the variables xi attached to the sites, are each associated with a fixed physical location
in space.8
A (finite-site) random field has n sites (or nodes, or vertices) and to each site i is associated a random variable xi that takes
a finite number, m, of possible realization values xi = xi ∈ X = {ξ (1) , · · · , ξ (m) }, |X| = m, with ξ (j) 6= 0.9 Note that
(`)
Xj = ξ (j) is the j-th component of a realization X = X(`) of the random vector X. The random vector X denotes a random
configuration state of the random field as
X = (x1 , · · · , xn )T .
X takes realization values
X = X = (x1 , · · · , xn )T ∈ X ⊂ Rn
in the state space
X = X × · · · × X = Xn with K = |X| = |Xn | = |X|n = mn .
| {z }
n times
Summarizing, to each site i is associated a particle where xi denotes the random i-th particle-state value. Thus the state X
denotes the random configuration of a system of n interacting particles. Note that even in the simplest nontrivial case where
each xi is binary, m = 2, the size of K = |S|, i.e., the number of possible configurations of the random field, quickly becomes
astronomical as the number of particles, n, increases in size.10
The system of interacting particles with random state vector X is assumed to have a constant number of particles, n, a
constant volume, and to be in thermal equilibrium with a heat bath at temperature T . Heat can flow between the system and
the heat bath, but no mechanical energy (work) or particles can be exchanged between them. Thus, according to the theory
of equilibrium statistical mechanics [6, 24], a Boltzmann distribution describes the probability that the multi-particle system
described by the random vector X = (x1 , · · · , xn )T is in a particular realization state X = X,
e−βE(X) 1
P (X) = > 0, β= >0
Z kT
5 See Appendix D.
6 See Appendix E and the discussion in [9, 10].
7 This is a change in perspective from component particles being free to move around (as for a gas), to particles that are now viewed as being pinned to a
specific “site”.
8 For example the fixed locations of a network of physical spin- 1 particles attached to sites on a crystalline solid, or the immobile neurons in an artificial
2
brain.
9 Setting x ≡ 0 encodes the removal of site i from consideration, as discussed below. Note that we have the same m realization values for each site i.
i
Also note that the case m = 2 was discussed in Section 3
10 E.g., m = 2 for a binary image and n = 10,000 for a 100 × 100 pixel grid gives K = 210,000 = (210 )1,000 ≈ 103,000 . To get a sense of the
vastness of this number, note that the number elementary particles in the universe is estimated to be about 1086 while the age of the universe in nanoseconds
is estimated to be less than 1027 .
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 5
where E(X) is the energy the system when it is in the realized configuration S = X. Again note that a Boltzmann distribution
is positive, P (X) > 0, for all realized state vectors X under the assumption that the energy is finite, |E(X)| < ∞ for all X ∈ X.
The partition function Z = Z(β) is a finite normalization factor
X
Z= e−βE(X) < ∞ .
X∈X
Because of the frequently astronomical size of the value of K, the computation of the partition function Z is usually
intractable, precluding the determination of the Boltzmann distribution in closed form. For this reason, and others, Markov
Chain Monte Carlo stochastic sampling techniques are frequently resorted to, as discussed in Section 4 below.
If we limit the site values (components of X) to be binary,11 |X| = m = 2, with realization values xi = ±1, the resulting
random field model is known as a spin glass model. If in addition the energy function is a quadratic function of the form (2),
1
E(X) = − XT W X − XT b (3)
2
where the n × n matrix W = W T and the n × 1 vector b are real and continuous, then we have the Ising spin glass model
and the Ising distribution describing the behavior of a magnetic particle system [24]. As we elaborate below, the binary-sites
Ising spin glass model is used to model stochastic binary node neural networks, resulting in the Boltzmann Machine (BM)
distribution [1, 21].
The Ising Model is the Boltzmann Distribution with Quadratic Energy (BDQE, see Appendix C.3) restricted to the state
vector X having binary components which was mentined at the end of Section 2.1. The binary-components Boltzmann distri-
bution with quadratic energy function is also known in the statistics literature as the quadratic exponential binary distribution
[10].
[24]. We will use the terminology “Ising Model” and “Boltzmann Machine” interchangeably.
13 x denotes the i-th spin random variable, x denotes an unknown realization value for spin particle i, while stating that x = +1 indicates a known
i i i
realization value of +1. The particles can all be simultaneously measured, i.e., the vector X = x can be instantaneously measured. If the system is quantum
mechanical, prior to a measurement the particles can be in an entangled state.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 6
Note that this pairwise, bilinear interaction model forces the symmetry condition wij = wji . We take the energy of self-
interaction to be zero by setting wii = 0, for all i.14
In addition, each particle i feels the local (vertical) field intensity bi of an external (vertical) magnetic field, resulting in
1
Summing up these energies for a system of n spin 2 magnetic particles yields the total energy
n n n n
X X 1 X X
E(X) = − wij xi xj − bi x i = − wij xi xj − xi bi
i,j=1 i=1
2 i,j=1 i=1
i<j
using the facts that wij = wji and wii = 0. Note that we can write the total energy in vector-matrix form as the vector-matrix
quadratic form
1
Total Energy for the Ising Model = E(X) = − XT W X − XT b (4)
2
with
W = WT and diag(W ) = 0 .
The local magnetic field at particle site i is denoted by15
n
X
hi = hi (X) = wij xj + bi , i = 1, · · · , n
j=1
e−βE(X)
π(X) = (5)
Z
where E : X → R is an energy function (or potential) defined on S that is assumed to be finite-valued.
Because typically |X| = K 1 (see footnote 10), it is generally intractable to compute the partition function Z and/or
draw a sample X = X from the distribution π. For this reason, we will construct a homogenous (i.e., time-independent)
14 This can be done with no loss of generality because x2 ≡ 1 so that w x2 = w is a constant that merely changes the energy baseline ground state
i ii i ii
value.
15 Because w = 0, we actually have h (X) = h (X ) where X
ii i i −i −i denotes the fact that particle i has been being ignored. This is discussed latter below.
16 Here we need the entire realization vector X.
17 Note that h (X) = 0 defines a separating hyperplane in Rn [20].
i
18 Note that this has nothing to do with equilibrium statistical mechanics, but is a purely mathematical fact due to the fact that π is positive. For specified,
π(X)
arbitrary choices of a reference state X0 and a value for E(X0 ), one takes E(X) = E(X0 ) − β1 ln π(X ) . See Property 1 of Appendix B.
0
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 7
discrete-time finite-state Markov chain which has π has its asymptotic, equilibrium (steady-state) distribution. This type
of dynamic stochastic sampling procedure is known as Markov chain stochastic sampling or Markov Chain Monte Carlo
(MCMC) sampling [30, 5].
A homogeneous discrete-time Markov Chain on K = |X| states is defined by a K × K stochastic matrix P whose
components are independent of time,
To prevent confusion, we denote conditioning with respect to a previous time-step random variable using a double vertical
stroke.19 Note that that a realization in the first argument of the homogeneous transition probability P (· k ·) is an event that, in
time, immediately follows a realization in the second argument.
At time k = 0, the Markov chain is initialized to a probability π0 = λ > 0, eT λ = 1, and then subsequently evolves
according to20,21 X
πk+1 (X0 ) = P (X0 k X) πk (X) with X0 , X ∈ X, |X| = K .
X∈X
Balance: π = Pπ (6)
I.e., the desired distribution π is an invariant distribution for the transition matrix P.
If these three conditions hold, then π is the unique invariant distribution for P and limk→∞ πk = π independently of the choice
of the initial distribution π0 = λ. In this case π is the unique equilibrium distribution.
For a large state space, K 1, and/or complicated form for π it can be difficult to find a transition matrix P that satisfies
the Balance Condition (6). For this reason it is usually the case that one replaces the Balance Condition (6) with the stronger
condition of Detailed Balance.
To do so, note that the components-level statement of balance,
X
Balance: π(X0 ) = P (X0 k X) π(X) ∀ X, X0 ∈ X (7)
X∈X
19 With the single vertical stroke denoting same-time-step conditioning. Thus P ( · k · ) refers to a transition probability while P ( · ) and P ( · | · ) do not.
20 This is known as the Chapman-Kolmogorov Equation. The proof is straightforward:
0 0 0 0 0
X X X
P (X k X) πk (X) = P (Xk+1 = X k Xk = X)P (Xk = X) = P (Xk+1 = X , xk = X) = P (Xk+1 = X ) = πk+1 (X )
X∈X X∈X X∈X
21 Note that this sum, similarly to the sum needed to compute the partition function, is generally over a vast number |X| = K of terms. Fortunately we
will not have to compute these sums, but instead will simulate the dynamical behavior of the Markov chain which will be a tractable computation.
22 See [30, 5] for rigorous treatments. A nice summary description of these conditions can be found in [20]. A self-contained and very readable set of
can be written as
!
X X X
0= P (X k X ) · π(X0 ) −
0
P (X0 k X) π(X) = P (X k X0 ) π(X0 ) − P (X0 k X) π(X) .
X∈X X∈X X∈X | {z }
| {z }
=1 , T (X,X0 )
Therefore a sufficient condition for balance between π and P to hold (i.e., for π to be an equilibrium distribution for P as
required by the Balance Condition C3 ) is that T (X, X0 ) = 0 for all X, X0 ∈ X. This yields the Detailed Balance Condition:
C3.0 The desired asymptotic distribution π and the transition matrix P satisfy the condition of
which is a sufficient condition for π and P to be in balance (i.e., a sufficient condition for π to be an invariant distribution
for P).
Note that if P (X k X0 ) 6= 0, then the Detailed Balance Condition C30 can be written as23
P (X0 k X) π(X0 )
Detailed Balance: = for P (X k X0 ) 6= 0 (9)
P (X k X0 ) π(X)
The Detailed Balance Condition is a symmetry condition that requires the rate of the transition X → X0 to be equal to the
rate of the transition in the reverse direction, X0 → X. Because so many references merely state, without explanation, that the
Convergence Conditions C1, C2 and C30 are requirements for a Markov chain to converge to a desired equilibrium distribution
π, many people appear to be unaware that generally the Detailed Balance Condition C30 is not a necessary condition, but
only a sufficient condition for convergence to occur (assuming that C1 and C2 hold), and can be replace by the weaker (but
generally harder to verify) requirement of Balance as defined in Eq. (6).
Given a desired stationary distribution π there are many possible choices of the transition matrix P that satisfy the Conver-
gence Conditions C1, C2 and C30 . One class of choices leads to the class of Metropolis-Hasting algorithms. Another set of
choices results in the Gibbs Sampler family of stochastic sampling algorithms.24
Imposed Transition Constraint: P (X0 k X) = 0 if X and X0 differ at more than one site
note that this reduces the number of states that one can transition into from K = mn , a generally vast number, to nm.
The sites can be treated as nodes on a graph where the edges of this graph are determined from the pairwise conditional
dependencies between nodes that are encoded in π(X), and hence, equivalently, in the energy function (potential) E(X).26
The instantaneous configuration of the random vector X on this graph is denoted by the realization X = X, and sampling
23 Recall that our positivity assumption on π ensures that π(X) 6= 0.
24 The Metropolis-Hasting family includes the Gibbs Sampler family as a special case, as we show below.
25 Remember, we are considering MCMC samplers because drawing from the unconditional distribution π(X) is supposed to be hard. Unless we are
careful, why should drawing from the conditional distribution P (X k X0 ) be any easier?
26 This graph is known as a dependency graph [19]. An edge is drawn between site i and site j if the random variables x and x are not independent
i j
conditioned on the remaining components of X.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 9
one site at a time is equivalent to 1) moving from node i, with current value xi , to node j, with current value xj , followed by
2) a transition from xj to a new value x0j . Doing these two steps in some principled manner corresponds to the Markov chain
transition X → X0 .
To more easily analyze this transition, it is useful to consider the subgraph and random variable that ensues when a single
site i is removed from the full graph. To indicated this mathematically, we set xi ≡ 0. The resulting induced random subgraph
on (n − 1) nodes is then associated to the random vector27
X−i = X − xi ei
X = X−i + xi ei . (10)
Remember that we are building the sampler to have the desired properties, so we get to choose the function form of
P (X0 k X). We do this by imposing the efficient sampling constraint
where
A common choice for P (i k X) is the random scan choice where site i is chosen uniformly over all possible sites, indepen-
dently of X,
1
P (i k X) = P (i) =
. (18)
n
Thus the transition matrix for the random scan Gibbs Sampler is given by the:
1
The last line of (19) says that one first randomly selects a site i with uniform probability P (i) = n, followed by the selection
of a new realization value x0i drawn from the conditional probability π(x0i | X−i ).
It is easy to verify that the Detailed Balance Condition, in either form (8) or (15), is satisfied.28 Furthermore, because all
sites are eventually selected, the entire state space forms a single communicating class so that the Markov chain is irreducible
for reasonable forms of π(x0i | X−i ). Finally, if we determine that with nonzero probability a state possibly does not change
in value during a single time-step transition, then the chain is aperiodic. Then, with Convergence Conditions C1, C2 and C30
satisfied, we have limk→∞ πk → π independently of the choice of the initial distribution π0 = λ.
With this choice for P (X0 k X) it is easy30 to see that the Detailed Balance Condition (8) is satisfied. If R(X0 k X) is nonzero,
we can write
R(X k X0 ) π(X0 )
P (X0 k X) = 1 ∧ R(X0 k X) , A(X, X0 ) R(X0 k X), (21)
R(X0 k X)π(X)
where A(X, X0 ) is the acceptance probability. If A(X, X0 ) = 1, then the proposal transition X → X0 determined by drawing
a sample from the proposal distribution R(X0 k X), is accepted (with probability 1 = A(X, X0 )). On the other hand, if
A(X, X0 ) < 1, then the proposal transition X → X0 is accepted with lower probability A(X, X0 ) < 1 and rejected (in which
case X → X) with probability 1 − A(X, X0 ) > 0.
Note that A(X, X0 ) = 1 suggests that the transition probability is possibly imbalanced in favor of transitions into X from X0
(relative to the Detailed Balance condition (8)), so we keep transitions X → X0 drawn from the proposal distribution P (X0 k X)
in order to rectify the imbalance. On the other hand, A(X, X0 ) < 1 suggests that the transition probability is imbalanced in
favor of transitions into X0 , so the algorithm compensates by decreasing the probability of a transition X → X0 .
Note that A(X, X0 ) ≡ 1 for all nonzero probability transitions between X and X0 means that the proposal distribution
satisfies the Detailed Balance condition (8) (i.e., there is never any imbalance), and in this case P (X0 k X) = R(X0 k X) for
all X and X0 . In particular, if we select the proposal distribution R(X0 k X) to satisfy the Gibbs Sampler conditions shown in
Eq. (19),
(
0 0 if X and X0 differ at more than one site
R(X k X) = π(x0i | X−i )
n if X and X0 possibly differ (only) at site i
Then P (X0 k X) = R(X0 k X) = 0 when X and X0 differ at more than one site, and otherwise A(X, X0 ) ≡ 1 (because, as we’ve
seen, the Gibbs Sampler transition probabilities satisfy the Detailed Balance condition (8)). Therefore P (X0 k X) = R(X0 k X)
for all X and X0 , showing that the Gibbs Sampler is a Metropolis-Hastings Algorithm.
28 When examining Eq. (8), note that if X differs from X0 only at site i, then X0 must differ from X at the same site i.
29 For any real numbers a and b, a ∧ b , min(a, b) = min(b, a) = b ∧ a.
30 Just use the definition (20) and the fact that a ∧ b = b ∧ a.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 11
For example, set f (Xk ) = 1(Xk = X) = δ Xk , X so that Eπ {f (X)} = π(X). Then,
N −1
1 X
δ Xk , X → π(X) as N → ∞ almost surely.
N
k=0
1. First, the sample values generated during the burn-in time are not drawn from the steady-state equilibrium distribution.
To diminish this problem, one usually does not begin the averaging procedure until a time has elapsed that is (hopefully)
past the burn-time time. I.e., initial samples are thrown away and only samples beyond the burn-in time are averaged.
2. Being drawn from a Markov chain, the samples are not independent and the resulting inter-sample correlations slow
down the convergence process, as well as make the convergence analysis more difficult. If only every `-th sample is
used, for some judiciously chosen value of `, then the chose samples are approximately iid and an asymptotic analysis
based on the use of iid samples can be invoked.
3. The state space is very large and only neighboring states (differing at only one site) are sequentially visited. However,
the fact that higher probability regions of the state space are visited more frequently (so that a type of “importance
sampling” is occurring) somewhat ameliorates this problem.
e−βE(X) 1
π(X) = with energy E(X) = − XT W X − bT X .
Z 2
1
The resulting Markov chain stochastic dynamical behavior of the system of spin 2 magnetic particles is known as the Glauber
Dynamics after Roy J. Glauber, the Nobelist who first proposed it in 1963 [15].
Let E(X−i ) denote the energy for the system of magnetic particles obtained by physically removing the i-th particle
1
E(X−i ) = − XT−i W X−i − bT X−i (22)
2
31 Recall that C1, C2 and C30 are sufficient for C1, C2, and C3 to hold.
32 It is common to use a singular distribution λ(X) = 1(X = X0 ) = δ X, X0 .
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 12
and define
Also note that because wii = 0, the local magnetic field, hi = hi (X−i ), at site i depends only on X−i , as well as on the full
state vector X,33
n
X
hi (X) = hi (X−i ) = wij xj + bi = eTi W X−i + bi = wTi X−i + bi (24)
j=1
(wii =0)
IDENTITY 34
Eδ (xi ) = − xi hi (X−i ) (25)
Proof:
1
E(X) = E(X−i + xi ei ) = − (X−i + xi ei )T W (X−i + xi ei ) − bT (X−i + xi ei )
2
1 T
X−i W xi ei + xi eTi W X−i − xi bT ei
= E(X−i ) − (using wii = 0)
2
= E(X−i ) − xi eTi W X−i + bi = E(X−i ) − xi hi (X−i ) = E(X−i ) + Eδ (xi ) .
Marginalizing, X X
π(X−i ) = π(xi , X−i ) = Z −1 e−βE(X−i ) e−βEδ (xi ) ,
xi =±1 xi =±1
Note that Eδ (xi ) = ±|hi (X−i )| and that the value of xi which leads to a negative value of Eδ (xi ), i.e. to a lower of the two
possible values for E(X) = E(X−i ) + Eδ (xi ), has a higher conditional probability of occurring. I.e., the system “prefers” to
transition to a lower energy state.35
Using the identity Eδ (xi ) = − xi hi (X−i ), we obtain
transitions to lower energy states, see also the discussion in Section 5.2 below.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 13
or 36
e±βhi (X−i ) e±βhi (X−i ) 1
π(xi = ±1 | X−i ) = = = . (29)
e−βhi (X−i ) + e+βhi (X−i ) e∓βhi (X−i ) + e±βhi (X−i ) 1 + e∓2βhi (X−i )
Note that in the limit β → 0 (i.e., T → ∞) it is immediately evident that
1
π(xi = +1 | X−i ) = π(xi = −1 | X−i ) = for T → ∞. (30)
2
Using the logistic regression function
ex 1
σ(x) = =
1+e x 1 + e−x
We obtain
with
n
X
hi (X−i ) = hi (X) = wij xj + bi = eTi W X−i + bi = wTi X−i + bi , (32)
j=1
or as
π(xi | X−i ) = σ − 2βEδ (xi ) (34)
with
The last form for π(xi | X−i ) makes it particularly clear that a higher energy configuration is a lower probability configura-
tion.37
Another useful form of the conditional probability π(xi | X−i ) is38
To implement the stochastic Glauber Dynamics (i.e., the Gibbs Sampler), given the current state X then, after a site i has
been randomly selected with probability p(i) = n1 , then decide whether or not to choose the realization value x0i = xnext
i =1
according to the conditional probability
π(x0i = 1 | X−i ) = σ 2βhi (X−i ) ,
which corresponds to
π(x0i = −1 | X−i ) = 1 − σ 2βhi (X−i ) = σ − 2βhi (X−i ) .
Of course any of these three probability statements suffices to completely describe the conditional probability of the binary
random variable xi .
Summarizing, the Random Scan Gibbs Sampler (19) applied to the Ising Model gives:
GLAUBER DYNAMICS
(
0 if X, X0 differ at more than one site
P (X0 k X) = (38)
P (x0i k i, X−i )P (i) = σ 2βhi (X−i )x0i 1
if X, X0 possibly differ at site i
n
The last line of (38) says that one first randomly selects a site i with uniform probability P (i) = n1 , followed by the selection
of a new realization value x0i drawn from the conditional probability π(x0i | X−i ) given in Eq. (33).
Suppose, once a site i has been chosen, that rather than randomly deciding if x0i = +1 according to the probability law
P (x0i k i, X−i ), one instead choses the most probable outcome,39
x0i = xmap
i = arg max P (x0i k i, X−i ) = arg max π(xi | X−i ). (39)
x=±1 x=±1
The “MAP update” procedure is independent of β > 0 (i.e., of the temperature) and corresponds to the update rule40
x0i = xmap
i = arg min Eδ (xi ) = arg max xi hi (X−i ) (40)
xi =±1 xi =±1
The MAP algorithm corresponds to the Glauber dynamics in the limit T → 0 (equivalently, β → ∞) and in this context is
known as the Hopfield Dynamics or the Hopfield Algorithm,41 and the corresponding zero-temperature network is called a
Hopfield Network.42
The Hopfield Algorithm is provably convergent to a local minimum of the energy function E(X) [21]. This readily follows
from Definition (23) which, with the optimality of x0i = xmap
i shown in (40), implies
Thus at every iteration the total energy either decreases or remains the same. Because the total energy is bounded from below,
and the state-space is finite, after a finite number of iterations the algorithm will have converged to a local minimum.
The Hopfield Algorithm (40) is implemented as
HOPFIELD DYNAMICS
n
X
x0i = sign hi (X−i ) = sign T
wij xj + bi
= sign w X
i −i + b i (41)
j=1
(wii =0)
where wTi is the i-th row of the weighting matrix W and i has been chosen according to the random scan procedure i ∼
P (i) = n1 .
The Hopfield algorithm step shown in (41) is just the well-known perceptron algorithm that decides which side of a sep-
arating hyperplane the vector X−i points into [20]. Thus, unlike the hierarchically multilayered perceptron networks, which
have great current popularity and utility [20, 16], the Hopfield network and dynamics corresponds to a massively intercon-
nected, massively parallel network of perceptron elements.43 Note that the Glauber Dynamics correspond to a stochastic
generalization of the Hopfield algorithm and therefore the Ising model-plus-Glauber Dynamics can be viewed as a massively
interconnected, massively parallel, network of stochastic perceptron elements, which is a huge generalization of the basic
deterministic, multilayered perceptron architecture.
39 “MAP” stands for “maximum a posteriori” [31].
40 SeeEquations (33) and (34).
41 Note that while this step of the algorithm is now deterministic, the overall algorithm still has a stochastic component as we are still implementing the
It is evident that we can implement the Glauber dynamics by determining the conditional probability of a flip at a randomly
chosen single site i at time k + 1 given X−i at time k,
provided that we chose the flip probability in a manner consistent with the Glauber dynamics (38). Note that the probability
of a flip at the chosen site i not occurring is
and
Pi (−1 → 1) = Pi (1 → 1) = π(xi = 1 | X−i ) = σ 2βhi (X−i ) ,
where the two possible spin-flip transitions are each marked with a box.44 These probabilities of these two spin-flip possibilities
occurring can be summarized by
Pi (xi → −xi ) = σ − 2βhi (X−i ) xi (42)
with
n
X
hi (X−i ) = wij xj + bi .
j=1
Note that
Pi (xi → xi ) + Pi (xi → −xi ) = 1
as expected since flipping or not flipping are the only two possibilities at the selected site i.
Equations (42) and (43) completely describe the Glauber Dynamics for the Ising model in terms of spin flipping, once a
site i has been randomly selected. However, it is illuminating to rewrite these two equation into a form that makes it clear that
transitions resulting in a lowering of the overall energy of the system are preferred (i.e., have higher probability of occurring).
To do so, recall that for X = X−i + xi ei we have
and
Eδ (xi ) = −xi hi (X−i ) .
44 Note that the emphasis has shifted from concern with a particular state-value X0 = X 0 0
−i + xi ei that is realized at time k + 1 to the transition X → X
itself as an entity of interest. It is well-known that a Markov chain on a set of states is equivalent to a larger Markov chain on the set of transitions viewed as
states [14]. In essence, this is the procedure begin pursued here, where a “flip” refers to either of the two site i transitions −1 → 1 or 1 → −1 viewed as
entities of interest in their own right.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 16
Also recall that the realizations Xk = X = X−i + xi ei and Xk+1 = X0 = X−i + x0i ei possibly differ only at the selected site
i. Define energies of transition by
Then, because spin-flip transitions can occur at most at the single site i, we have that the change in energy due to a configuration
transition X → X0 is given by
∆E(X → X0 ) = ∆Eδ (xi → x0i ) = −(x0i − xi )hi (X−i ) .
Of course, a change in energy is only due to a change in configuration at site i and the energy does not change if there is no
spin flip at site i. Note that
∆Eδ (−1 → 1) = −2hi (X−i ) = 2hi (X−i )xi and ∆Eδ (1 → −1) = 2hi (X−i ) = 2hi (X−i )xi .
Note that if ∆Eδ (xi → −xi ) < 0, then the probability of a flip increases, whereas if ∆Eδ (xi → −xi ) > 0 the probability
decreases.45 Thus the system tends to evolve to lower energy state configurations.
with
P (x0i , x0j k i, j, X−ij ) = π(x0i , x0j | X−ij ),
X−ij = X − xi ei − xj ej = X−ji
and
1 2
P (i, j) = n =
.
2
n(n − 1)
45 And if ∆Eδ (xi → −xi ) = 0, the probability of a flip is 0.5, the same as that of a non-flip.
46 For simplicity, lassume that the global configuration of the network is instantaneously communicated to all sites whenever a transition occurs.
47 This is an infinite-precision argument. In reality for a discrete-time implementation this will be approximately true to a reasonable degree of accuracy
for a small enough sampling time interval that depends on the number of units, n.
48 Note, however, that this removes the possibility of a simple distributed processing algorithm as the r sites will need to simultaneously fire in a coordinated
fashion.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 17
with
hi (X−ij ) = eTi W X−ij + bi = wTi X−ij + bi .
Similarly, one can compute arbitrary r-site Glauber dynamics. It is interesting to conjecture if higher site-order transitions
can speed up the burn-in (mixing) time of the Markov chain defined by the Glauber dynamics. Or if even only an occasional
high r update during the standard 1-site Glauber dynamics can speed up the time to equilibrium.
Various models are obtained by truncating the right-hand-side expansion to various orders of products of the components of
X. In particular, keeping terms up to second order results in the quadratic-energy Boltzmann Machine distribution used in the
standard Ising model discussed earlier.
The single-site random scan Gibbs sampler (19) requires knowledge of π(xi | X−i ), which is computable for models of
higher order than quadratic.
As an illustrative example, let’s determine π(xi | X−i ) for the cubic approximation to the exact representation (47) obtained
by truncating terms of order four and higher. This approximation is of the form
e−βE(X)
π(X) = with E(X) = B(X) + W(X, X) + C(X, X, X) (48)
Z
where B(X), W(X, Y) and C(X, Y, Z) are multilinear, real-valued functions of their vector arguments.50 These functions are
assumed to be invariant with respect to all permutations of their arguments,51 and to satisfy,
with52
wii = 0 and wij = wji for all i, j = 1, · · · , n ,
and
n
1 X
C(X, Y, Z) = − cijk xi yj zk
6
i,j,k=1
with any two indices of cijk being identical forcing the zero value
and every cijk having an invariant value modulo permutation of the indices,53
Subject to these constraint conditions, the multilinear operators applied to X = X−i + xi ei yield
W(X, X) = W(X−i , X−i ) + 2 xi W(ei , X−i ) = W(X−i , X−i ) − xi eTi W X−i = W(X−i , X−i ) − xi wiT X−i ,
and
1 T
C(X, X, X) = C(X−i , X−i , X−i ) + 3 xi C(ei , X−i , X−i ) = W(X−i , X−i ) − xi X C(i) X−i
2 −i
where the matrices C(i) , i = 1, · · · , n, have elements
C(i) jk = cijk for j, k = 1, · · · , n.
Note that for each i the matrix C(i) is symmetric and has zero diagonal elements.
If we define
1
hi (X−i ) = bi + wiT X−i + XT−i C(i) X−i (49)
2
with hi (X−i ) given by Eq. (49). Note that if the cubic energy terms are “turned off”, cijk ≡ 0, then we recover the quadratic
case shown in Equations (32)-(33).
Note that in this case the complexity of having to decide which of the K n possible site configurations to update with non-zero
probability has been simplified by attempting to update all sites according to the transition probability (52), which is in contrast
to the single-site random-scan algorithm that randomly selects one site, say xi , and updates it according to π(x0i | X−i ).
In general, Eq. (52) will not have our distribution of interest, π(X) of Eq. (5), as its equilibrium distribution. In our
previous examples, we first specified the desired equilibrium distribution, Eq. (5), and then afterwards defined the transition
probabilities of our Markov chain accordingly. Here we are first mandating the form of the transition probabilities and now
53 Note 1
that there are six possibilities. This accounts for the factor of 6
in the components-level form of C(X, X, X).
54 The material presented here draws heavily from reference [35].
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 19
must determine the consequences of this choice on the nature of the resulting equilibrium distribution, assuming it exists. Be-
cause the transition probabilities are strictly positive, there does exist a unique equilibrium distribution,55 which we designate
as π̃(X),
X
π̃(X0 ) = P̃ (X0 k X) π̃(X) . (53)
values of X
In terms of the vector equilibrium probabilities and the matrix of transition probabilities, we have
π̃ = P̃ π̃
n
Y n
Y
P̃ (X0 k X) = π(x0i | X−i ) = σ 2βhi (X)x0i . (54)
i i
To simplify matters, absorb the factor 2β into the weights W and b,57
so that
hi (X) = eTi W X + bi ← 2βhi (X) = 2β eTi W X + bi
and
π(xi | X−i ) = σ hi (X)x0i ← π(xi | X−i )σ 2βhi (X)x0i .
Also note that zi = 21 (xi + 1) ∈ {0, 1} and xi = 2zi − 1 ∈ {−1, +1} convey the same information58 so that π̃(xi ) = π̃(zi ),
π̃(X) = π̃(Z), and P (X0 k X) = P (Z0 k Z) when X and Z are so related. We also use the fact that59
0
ezi hi (X) x0i + 1
π(xi | X−i ) = σ x0i hi (X) = with z0i = . (56)
1 + ehi (X) 2
Now substitute X = 2Z − e into hi (X) and define the quantities
X
W̃ = 2W, b̃i = bi − eTi W e = bi − wij , and h̃i (Z) = eTi W̃ Z + b̃i (57)
j
to arrive at
hi (X−i ) = hi (X) = eTi W X + bi = eTi W̃ Z + b̃i = h̃i (Z) = h̃i (Z−i ) (58)
0
ezi h̃i (Z) x0i + 1
π(xi | X−i ) = = π(zi | Z−i ) for z0i = (59)
1 + eh̃i (Z) 2
55 This is a consequence of the Perron-Frobenius Theorem.
56 Recall that hi (X) = hi (X−i ) because wii = 0.
57 This can be easily reversed when desired.
58 For X = (x , · · · , x )T and Z = (z , · · · , z )T the corresponding relationships are Z = 1 (X + e) and X = 2Z − e for e = (1, · · · , 1)T .
1 n 1 n 2
59 See Eq.s (33) and (36).
60 The conditional probability π(z | Z ) given in (59) is denoted by p (x, y ) in Section 2.1 of reference [35]. Note that
i −i s s
X
z0i h̃i (Z) = z0i w̃ij zj + b̃i with w̃ii = 0.
j
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 20
for X and Z related as X = 2Z − e. Thus, equivalently to solving (53) for the Ising model one instead can solve for the
stationary distribution that satisfies
X
π̃(Z0 ) = P̃ (Z0 k Z) π̃(Z) . (61)
values of Z
One can find a stationary solution π̃(Z) by determining a solution to the detailed balance condition61
π̃(Z0 ) P̃ (Z0 k Z)
= . (62)
π̃(Z) P̃ (Z k Z0 )
Note that X
z0i h̃i (Z) = z0i eTi W̃ Z + b̃i = z̃i0 w̃ij zj + b̃i ,
j
Comparison with Eq. (62) shows that the equilibrium distribution is given by
T
eb̃ Z i 1 + eh̃i (Z)
Q
1
π̃(Z) = π̃(X) = for Z = (X + e) (64)
Z 2
where Z is the partition function (normalization factor)
X T Y
eb̃ Z
1 + eh̃i (Z) .
Z=
values of Z i
Using the facts that h̃i (Z) = hi (X) and b̃ = b − W e allows us to rewrite the equilibrium distribution Eq. (64) as
T
X−eT W X)
e 2 (b
1
1 + ehi (X)
Q
i
π̃(X) = π̃(Z) = for X = 2Z − 1, (65)
Z0
with T
W e−eT b)
Z0 = e 2 (e
1
hi (X) = eTi W X + bi = eTi h(X) and Z.
Now note that X
−eT W X = eT b − eT (W X + b) = eT b − eT h(X) = eT b − hi (X).
i
or
1 T 1
1 T 1 T
e2b X
e2b X
Q Q
i cosh 2 hi (X) i cosh 2 ei (W X + b)
π̃(X) = π̃(Z) = = (66)
Z00 Z00
with
1 − 1 eT b 0
X = 2Z − 1 and Z00 = e 2 Z.
2
Finally, we undo the value reassignments shown in Eq. (55) we obtain the equilibrium distribution
T
e βb X
cosh βeTi (W X + b)
Q
i
π̃(X) = (67)
Z00
for an appropriately defined partition function Z00 . The distribution (67) should be compared to the equilibrium distribution
for the single-site random-scan Glauber dynamics with quadratic energy E(X) = − 21 XT W X − bT X,
T T
eβ ( 2 X W X+b X)
1 T β T
e−βE(X) eβb X
e2X WX
π(X) = = = .
Z Z Z
To makes some further points of connection with reference [35], let’s return to Eq.s (60) and (64). In particular, we can
use these equations to compute the joint distribution of two time-adjacent states of the Markov chain,
n 0
! ! 0T T
k+1 k
0 0 −1
Y ezi h̃i (Z) b̃T Z
Y
h̃i (Z)
eZ h̃(Z)+b̃ Z
P̃ ( Z , Z ) = P̃ (Z k Z)π̃(Z) = Z e 1+e =
i 1+e
h̃i (Z)
i
Z
or
0T
W̃ Z+b̃T Z0 +b̃T Z
k+1 k eZ
P̃ ( Z0 , Z ) = (68)
Z
Eq. (68) is Equation (3) of reference [35]. The symmetry between Z0 and Z shown in (68),
P̃ (Z0 , Z) = P̃ (Z, Z0 ) (69)
exists because a Markov chain that has a stationary distribution which satisfies detailed balance is a reversible Markov chain.
k+1 k
As noted in [35], once P̃ ( Z0 , Z ) is in hand the marginalization
X k+1 k
P̃ ( Z0 , Z)
values of Z
must yield the stationary distribution π(Z0 ).63 This is straightforward to show64
n
k+1 k zi h̃i (Z0 )
P
b̃T Z0 ZT h̃(Z0 ) b̃T Z0
X X X
0
Z P̃ ( Z , Z) = e e =e ei=1
values of Z values of Z values of Z
Recalling that zi ∈ {0, 1} we break up the sum over values of Z = (z1 , · · · , zn )T as follows
X X X X X
= + + +··· + .
values of Z all components are zero one component nonzero two components nonzero all components nonzero
This results in
n n
zi h̃i (Z0 )
P
X X 0 X 0 0 X 0 0 0 0 0
ei=1 =1+ eh̃i (Z ) + eh̃i (Z )+h̃j (Z ) + eh̃i (Z )+h̃j (Z )+h̃k (Z ) + · · · + eh̃1 (Z )+···+h̃n (Z )
values of Z i=1 i<j i<j<k
n
Y 0
1 + eh̃i (Z )
=
i=1
63 Note that this just a two-step way of saying that Eq. (61) holds.
64 Note that Z0T W̃ Z + b̃T Z0 + b̃T Z = ZT b̃ + Z0T h̃(Z) = Z0T b̃ + ZT h̃(Z0 ).
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 22
and therefore
T 0
X k+1 k eb̃ Z Y 0
P̃ ( Z0 , Z) = 1 + eh̃i (Z ) = π̃(Z0 )
Z i
values of Z
as expected.
or latent units. Note that G = V ∪ H and V ∩ H = ∅. For convenience, and with no loss of generality, we assume that the
components of X are ordered such that
Xv
X= with Xv = XV and Xh = XH ,
Xh
and we define
nv = |V| , nn = |H| =⇒ nv + nh = n = |G|
so that the binary random vectors Xv , Xh and X satisfy
Xv ∈ RnV , Xh ∈ RnH , X ∈ Rn .
Note that with this partitioning, the binary variables xi , i = 1, · · · , nv are all of the components of Xv and the first nv
components of X.
Let us focus on the Boltzmann Machine distribution.71 Consistent with the state partitioning X = (XTv , XTh )T we partition
W and b as
Wvv Wvh b
W = and b = v (70)
Whv Whh bh
with
T T T
Wvv = Wvv Whh = Whh and Wvh = Whv , (71)
where Wvv and Whh both have zero-valued diagonals. The Boltzmann distribution then has the form
1 −βE(Xv ,Xh )
π(X) = π(Xv , Xh ) = e (72)
Z
with
1
E(X) = − XT W X − bT X = EViso (Xv ) + EH
iso
(Xh ) − XTv Wvh Xh (73)
2
where
1 1
Eviso (Xv ) = − XTv Wvv Xv − bTv Xv and Ehiso (Xh ) = − XTh Whh Xh − bTh Xh (74)
2 2
are the energies that the visible and hidden units respectively would have if they are isolated by setting the cross-coupling
energy to zero (i.e., by taking Wvh = 0). In general, for Wvh 6= 0 these are not equal to the energy functions Ev (Xv ) and
Ev (Xv ) arising from marginalizing the joint distribution.72
Our working assumption is that the existence of the latent variables XH means that the joint distribution for Xv and Xh
has the particularly nice Boltzmann Machine form (72). Namely, the Boltzmann distribution with an energy function that is
a simple quadratic form in the joint variables XT = (XTv , XTh )T . However, although the quadratic energy form is assumed to
hold for the joint distribution, in general this is not true for the marginal distribution π(Xv ) [9, 10].
Suppose the visible units Xv correspond to objects in a visible world of interest on which we make observations. Ideally
we would know the true distribution πtrue (Xv ), but often we don’t, or it is too complex to tractable work with. One might use
a Boltzmann Machine distribution on the visible units alone as a model for the unknown distribution πtrue (Xv ), but this is too
restrictive. Although it can capture all second order behavior between sites, a quadratic energy model represents higher order
site interactions (e.g., between three or more sites) poorly or not at all. However, if one marginalizes over the latent variables
of a Boltzmann Machine (Ising model) distribution
X
π(Xv ) = π(Xv , Xh ) (75)
Xv -values
71 I.e.,on the quadratic-energy Boltzmann Distribution for a binary-components random vector, aka the Ising model.
72 If the simplifying assumptions that Wvv = 0 and Whh = 0 are made, then the resulting distribution is known as the Harmonium distribution or the
Restricted Boltzmann Machine (RBM) distribution.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 24
then one obtains a binary Boltzmann distribution that is not of Boltzmann Machine form73 and which can provide a reasonably
good approximate model for a general Boltzmann distribution on Xv of the fully general form shown in Eq. (47). Indeed, for
a sufficiently large number of hidden variables, the marginalized distribution π(Xv ) can be made arbitrarily close to the true
distribution πtrue (Xv ) [9, 10]. This is called the universal approximation property of the marginalized Boltzmann machine,
which is discussed further in Section 6.3.
Motivated by the universal approximation property, for a given, fixed number of hidden units, XH , one determines the
values of the parameters, θ = vec(W, b, Z), of a joint Boltzmann machine distribution πθ (Xv , Xh ) that result in a best fit
of the marginal πθ (Xv ) to the true distribution πtrue (Xv ). This is done by collecting repeated independent measurements of
Xv (which presumably is repeatedly drawn from the unknown distribution πtrue (Xv )) and then applying the procedure of
maximum likelihood estimation (MLE)74 to obtain an estimate of the unknown parameter vector θ. The MLE procedure for
learning a best fit of πθ (Xv ) to πtrue (Xv ) is detailed below in Section 6.4.
This distribution is completely specified by knowledge of the values of the Kv = 2nv parameters
ϑtrue
v = (ϑtrue true T
0;v , · · · , ϑ12 ··· nv ;v ) ∈ R
Kv
, Kv = 2nv .
Given nv visible binary variables Xv and assuming the existence of nh hidden binary variables Xh we set X = (XvT , XhT )T ∈
n
R for
n = nv + nh (77)
and define a latent variable model on X which is of Boltzmann Machine (Ising model) form,76
1 −βEθ (X)
πθ (X) = e (78)
Zθ
with
1
Eθ (X) = − XT W X − bT X , θ = vec(W, b, Z) ∈ Rp (79)
2
and
X
Zθ = e−βEθ (X) . (80)
values of X
n2 + n + 2
n n n n(n − 1) n(n + 1) + 2
p= + + =1+n+ = = (81)
0 1 2 2 2 2
parameter values as a function of n = nv + nh . As an example, take nv = nh = 100, then
10 (100)(101) + 2
Kv = 2100 = 210 ≈ 1030 whereas p = = 50(101) + 1 = 5, 051 .
2
73 I.e., marginalization of a quadratic-energy binary-components Boltzmann Machine distribution (Ising model) does not yield a distribution of the same
quadratic energy form [9, 10]. Put another way, the class of Boltzmann Machine distributions is not closed under marginalization. However, the class of
positive distributions is closed under marginalization, so it is true that the marginal of a binary Boltzmann Machine (Ising) model is a binary Boltzmann
distribution.
74 The MLE procedure is described in Appendix A.
75 I.e., by an n-order polynomial where in each term the partial degree of each variable, x , is at most one. See Appendix D.2. The parameter ϑtrue
i 0;v
serves to normalize the distribution to sum to one and is equivalent to knowing the partition function for the Boltzmann distribution πtrue (Xv ). Note that the
parameters β has been absorbed into the coefficients.
76 See Equations (70)–(74).
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 25
However, note that the computation of the partition function for Eq. (78)
X
Zθ = e−βEθ (X)
all values of X
requires a sum over 2K terms, which is usually a truly vast number. One therefore should not be surprised to learn that a
driving force in throughout the history of recent machine learning research is the issue of how to circumvent the need to
compute the partition function.
With the latent variable Boltzmann Machine model (78) at hand, it is natural to approximate πtrue (Xv ) by the marginal-
ization
X
πθ (Xv ) = πθ (Xv , Xh ) (82)
all values of Xh
Note that the number of terms in the marginalization sum can be vast, so how to handle this in some manner will need to be
addressed.
The question at hand is how well the Kv = 2nv –parameter true distribution shown in Eq. (76) can be approximated by the
2
p = n +n+2
2 –parameter latent variable model distribution given by Equations (78) and (82). That the latent variable model
(78) can exactly represent (76) for a sufficient number of hidden units is proved in references [33, 35] and discussed below in
Section 6.3.3.
Generally Kv degrees of freedom is needed to specify the true distribution (76), and therefore a necessary condition for
the p-parameter model (78)–(82) to fit to an arbitrarily chosen true distribution is that the model has at least as many degrees
of freedom, p ≥ Kv . Solving the quadratic equation
n2 + n + 2
p= = 2nv = Kv
2
for a positive value of n = nv + nh gives the necessary condition
√
2nv +3 − 7 − 2nv − 1 nv
nh ≥ =O 2 2 . (83)
2
We see that in the most general situation, the number of hidden units necessarily must grow at least exponentially in the
number of visible units to ensure an exact fit of the marginalized Boltzmann distribution to the true visible units distribution.
Of course, the practical hope is that actual distributions encountered in the world can be approximately well-represented using
far fewer hidden neurons.
πθ (Xh | Xv ) = πtrue (Xh | Xv ) > 0 and πθ (Xv ) = πtrue (Xv ) > 0 (85)
distribution. Given a Boltzmann Machine distribution, one can compute its moments up to second order and use them to compute the MaxEnt distribution,
which must be the same distribution as the original one.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 26
Boltzmann Machine distributions in Eq. (84) will be equal if and only if their moments up to second order are equal. To denote
this define true moments up to second order by
m̂0 (θ) = Eθ {x0i } = 1, m̂i (θ) = Eθ {xi }, and m̂ij (θ) = Eθ {xi xj } (87)
2
for 1 ≤ i < j ≤ n. Note that there are p = n +n+22 moments in all, with n = nv + nh , and that for binary vectors whose
components take values 0 or 1, or values ±1, there are no other second-order moments to consider.79
In principle one can equate the model moments (87) to the true moments (86) and solve the resulting p (generally nonlinear)
equations to determine the unknown parameter vector θ, a procedure which is known as the Method of Moments (MOM) [17].
Because we are working with a marginalization model, we can recast the typical MOM equation as follows:
We see, then, that equating the model moments (87) to the true moments (86) results in the set of equations
≡1
X z }| {
m̂0 (θ | Xv ) πtrue (Xv ) − πθ (Xv ) = 0
values of Xv
X
m̂i (θ | Xv ) πtrue (Xv ) − πθ (Xv ) = 0 for 0 ≤ i ≤ n (88)
values of Xv
X
m̂ij (θ | Xv ) πtrue (Xv ) − πθ (Xv ) = 0 for 1 ≤ i < j ≤ n
values of Xv
where M(θ) is a p × Kv matrix that generally is a nonlinear function of θ. This is the MOM parameter estimation problem
v
recast in terms of the marginal distributions πtrue and πθv . Summarizing the above discussion, and that of the previous section,
we have
Let πtrue (X) = πtrue (Xv , Xh ) be a Boltzmann Machine distribution for the n-component binary random vector
X = (x1 , · · · xn ) that partitions as X = vec(Xv , Xh ) where Xv has nv -components and Xh has nh -components,
n = nv + nh . Let πθ (X) be a fully general Boltzmann Machine distribution for the sample random vector X as
shown in Equations (78)–(82) and let n be large enough that πtrue = πθ for some value of the parameter vector
v
θ. Let πtrue and πθv respectively denote the marginals πtrue (Xv ) and πθv (Xv ). Then
v
πtrue = πθ ⇐⇒ M(θ) πtrue − πθv = 0
n2 +n+2
where M(θ) is p × Kv with Kv = 2nv and for some p ≥ Kv with p = 2 .
79 See Appendix D.2.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 27
We shall see in Section 6.4 that the Boltzmann Machine learning rule of [1, 22, 23] iteratively seeks a solution to the MOM
problem (89). Once we accept that marginalization of a Boltzmann Machine (Ising model) distribution πθ (X) = πθ (Xv , Xh )
can exactly match an arbitrarily chosen Xv –distribution πtrue (Xv ) for an appropriate number, n, of elements of X, and choice
of parameter vector θ, then the primary design consideration is the number of hidden units nh that determines n = nv + nh
subject to the necessary condition shown in Eq. (83).
Note that the development is agnostic as to whether each binary variable xi takes the value 0 or 1, or the values ±1. Assume
that xi ∈ {0, 1}. Then the moments are simple to understand: mi is the probability that xi takes the value one and mij is the
probability that xi = xj = 1. Since all the probabilities under consideration are positive, these moments, as probabilities, are
v
all nonzero. Thus, in this case, all of the elements of M are nonzero,80 as are all of the values of πtrue and πθv .
Note that if the “tall” (p ≥ Kv ) matrix M(θ) has full column rank for the value θ, then it must be true that πθv = πtrue
v
at
that same value of θ. Therefore, perhaps one can find a sufficient condition for the number, nh , of hidden units by determining
n∗h = arg min {nh | rank M(θ) = Kv for some parameter vector θ} .
nh
If such a value exists, and is finite n∗h < ∞, then a sufficient condition for πθv = πtrue
v
to hold for some θ would be nh ≥ n∗h .
Thus to have a marginalized Ising model match to a general categorical distribution for Xv requires a number of hidden units
in the range
nv √2nv +3 − 7 − 2n − 1
v
O 22 = ≤ nh ≤ 2nv − nv − 1 = O 2nv (92)
2
In all cases the number of required independent parameters needed to fit a marginalized Boltzmann Machine Distribution
to the most general categorical distribution is
p ≥ Kv = 2nv (93)
where Kv is the number of parameters (equivalent to Kv − 1 probability values) need to specify the most general categorical
distribution for the binary, nv -component random vector Xv .
Examination of the proof given in [35], discussed below, shows that (91) is also a sufficient condition for the marginaliza-
tion of a Restricted Boltzmann Machine (RMB) to match to a general distribution on Xv . Because the RBM has a restricted
architecture of Boltzmann Machine type,81 if the marginalized RBM is a universal approximator, then so is the Boltzmann
Machine. In fact, reference [35] shows universality of a further restriction of the RBM architecture.
The remainder of this section details the proof of [35]. We start by imposing the RBM restriction on the Boltzmann
Machine latent variable model shown in Equations (70)–(75), which corresponds to
Restricted Boltzmann Machine (RBM) Assumption: Wvv = 0 and Whh = 0.
To simplify the notation we take Xv = V with the components of V to be vi ∈ {0, 1} for i = 1, · · · , nv , and we take Xh = H
with the components of H to be hj ∈ {0, 1} for j = 1, · · · nh . The corresponding realization values are V, vi , H, and hi .
There is no loss of generality in assuming that the binary random variables vi and hi take values 0 or 1.82 We also define
A = βWhv ∈ Rnh ×nv , r = βbh ∈ Rnh , s = βbv ∈ Rnv , and s0 = e−Z .
80 I.e,M is a positive matrix for xi ∈ {0, 1}.
81 See footnote 72.
82 A change from x ∈ {−1, +1} to x ∈ {0, 1} corresponds to the invertible transformation x ← 1 (x + 1) which does not change the functional
i i i 2 i
form of the Ising model Equations (70)–(75). (However, the values of the matrix W and the vector b are changed.) See the discussions in Sections D.2 and
5.3.4. The mathematical analysis is much more straightforward with the 0-1 choice, which is used in the proof given in [35].
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 28
With these changes the latent variable model (70)–(75) becomes the RBM model
e−Eθ (X) T T T
πθ (X) = πθ (V, H) = = eH AV+H r+s V+s0 (94)
Z
with
θ = vec(A, r, s, s0 ).
We then make the further restrictive assumption on A that HT AV has the form
nh X ni nh ni nh
(i)
X X X X
HT AV = aij hi vj = ai hi v̄j = ai Sii (V(i) )hi (95)
i=1 j=1 i=1 j=1 i=1
| {z }
Si i (V(i) )
where
ni
(i)
X
Sii (V(i) ) = v̄j (96)
j=1
for i = 1, · · · , nh , with
T
V(i) = vi1 , · · · , vini , Cardinality(V(i) ) = |V(i) | = ni ≤ nv , i = 1, · · · nh , nh = 2nv = Kv (97)
and
(i)
v̄j = vij for 1 ≤ i1 < · · · < ij < · · · < ini ≤ nv , and v̄ini = i vini with i = ±1. (98)
The model restrictions (95)–(98) merit some discussion. Let V = {1, · · · , nv } be the nodes on which the elements of the
visible unites V = (v1 , · · · , vnv )T are defined. Let P(V) be the power set of V,
Vi ∈ P(V) ⇐⇒ Vi ⊂ V
and
ni = |Vi | ≤ |V| = nv .
Note that the total number of subsets is |P(V)| = 2nv = Kv so that nh = Kv = 2nv . Let each indexed subset Vi be distinct,
so that i 6= j =⇒ Vi 6= Vj , and order the elements of Vi as
Then
ni
(i)
X X
Sii (V(i) ) = v̄j = vj + i vini (99)
j=1 j∈Vi \ini
with
ni = |V(i) | for i = 1, · · · , nh , nh = 2nv = Kv .
is a function of the elements of V that are defined on the ni nodes in the subset Vi ⊂ V where there are nh = 2nv = Kv such
subsets.
From (94), (95) and (99) we have
nh
(n )
h
X
sT V+s0
X
i T i
πθ (X) = πθ (V, H) = exp hi ai Si V(i) + ri +s V + s0 = e exp hi Ci V(i) (100)
i=1 | {z }
i=1
Ci = Ci i (V(i) )
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 29
Comment
Note that a nonzero value of ai in Eq. (100) serves to couple all of the ni = |V(i) | binary variables vij in the
subset V(i) through the single latent variable hi ,
ai hi Sii V(i) = ai hi vi1 + · · · + vini −1 + i vini .
In particular, for ni ≥ 3, the coupling through the single latent variable hi allows for higher order (beyond second
order) coupling between the binary variables in V . Note that the model (100) uses this single-latent-variable
coupling for every possible subset of the elements of V ,83 and in this way all possible higher correlations amongst
the elements of V are captured.
and πtrue (V) given by the fully general nh -th order polynomial form shown in Equation (76),
nv
X nv
X nv
X
ln πtrue (V) = ϑ0 + ϑi vi + ϑij vi vj + ϑijk vi vj vk + · · · + ϑ12 ··· nv v1 v2 · · · vnv (102)
i=1 i<j i<j<k
Comment
Note that showing πtrue (V) = πθ (V) for πθ (V) given by Eq. (101) will demonstrate that marginalizations of both
the Restricted Boltzmann Machine (RBM) and Boltzmann Machine models yield universal approximators since
the model (94)–(100) is a restricted version of both of those model families.
Recalling that hi ∈ {0, 1} we break up the sum over the Kh = 2nh values of H = (h1 , · · · , hnh )T shown in Eq. (101) into
(nh + 1) smaller sums as follows
X X X X X
= + + +··· + .
values of H all components hi are zero one component nonzero two components nonzero all components nonzero
This results in
n h
P nh nh nh
X hi Ci X X X
e
i=1 =1+ eCi + eCi +Cj + eCi +Cj +Ck + · · · + eC1 +···+Cnh
values of H i=1 i<j i<j<k
nh nh
i i
1 + eCi (V(i) ) = 1 + eai Si (V(i) )+ri
Y Y
=
i=1 i=1
Therefore
nh
i
T
1 + eai Si (V(i) )+ri
Y
πθ (V) = es V+s0
(103)
i=1
so that84
nh nh
i
log 1 + eai Si (V(i) )+ri +sT V + s0 =
X X
φi i V(i) + sT V + s0
log πθ (V) = (104)
i=1 | {z } i=1
φi i (V(i) )
with
for ρ = ϑ12 ··· nv and where POLY(V; k) denotes a k-th order polynomial in the elements of V. Note that there are nh =
2nv functions φi i V(i) shown on the right-hand-side (RHS) of (106). We will need to determine an appropriate choice of
parameters for each one of these functions in order to match the most general form of the distribution of the binary vector V .
The fact that condition (106) can be made true by an appropriate choice of parameters is a consequence of the following
lemma.
LEMMA [35]
Chose any positive integer N > 1. Consider a collection of N binary variables Z = (z1 , · · · , zN )T , zi ∈ {0, 1},
i = 1, · · · N . For real a and r and = ±1 define the parameterized function
φa,r, (Z) , log 1 + ea(z1 +···+zN −1 +zN )+r .
Choose any real number ρ. Then for = sign(ρ) there exists values of a and r such that
where POLY(Z; N − 1) is some (N − 1)–order polynomial in the elements of Z. In particular, to attain the value
of ρ one can always determine some positive value of a and set
(
( 21 − N )a = −(N − 21 )a < 0 = sign(ρ) = +1
r= (108)
( 23 − N )a = −(N − 23 )a < 0 = sign(ρ) = −1
Proof: Because φ(Z) = φa,r, (Z) is a function of a finite, binary vector Z, it always can be represented by an
N -order polynomial of the form shown on the right-hand-side of Eq. (107)85
N
X X X
φa,r, (Z) = c0 + ci zi + cij zi zj + · · · + ci1 i2 ···iN −1 zi1 · · · ziN −1 + c1···N z1 · · · zN
| {z }
i=1 i<j i1 <i2 <···<iN −1 gN (a,r,)
The trick, then, is to find the coefficient, gN (a, r, ) of the N -order term of the polynomial expansion of φa,r, (Z)
as a function of (a, r, ) and show that one can always choose values of (a, r, ) to force gN (a, r, ) = ρ regardless
of the value of ρ.
First assume that ρ ≥ 0 and take = sign(ρ) = 1. Then one can determine that
N
X N
(−1)N −j log 1 + eaj+r
gN (a, r, 1) = (109)
j=0
j
via an iterative procedure based on selectively setting various subsets of the variables zi to zero and the variables
on the complements of those subsets to one. Starting this procedure we have
g0 (a, r, 1) = c0 = log 1 + er
0
X 0
(−1)0−j log 1 + eaj+r
=
j=0
j
g1 (a, r, 1) = ci = log 1 + ea+r − c0 = log 1 + ea+r − log 1 + er
1
X 1
(−1)1−j log 1 + eaj+r
=
j=0
j
g2 (a, r, 1) = cij = log 1 + e2a+r − ci − cj − c0
= log 1 + e2a+r − 2 log 1 + ea+r + log 1 + er
2
X 2
(−1)2−j log 1 + eaj+r
=
j=0
j
g3 (a, r, 1) = cijk = log 1 + e3a+r − cij − cik − cjk − ci − cj − ck − c0
= log 1 + e3a+r − 3 log 1 + e2a+r + 3 log 1 + ea+r − log 1 + er
3
X 3
(−1)3−j log 1 + eaj+r
=
j=0
j
..
.
noting that then ja + r < 0 for j = 0, 1, · · · N − 1. In particular, to ensure that the conditions shown (110) hold
one can always enforce the constraint
1 1
r= −N a=− N − a < 0.
2 2
Given that the conditions in (110) are satisfied, then gN (ta, tr, 1) is zero for t = 0 and goes to infinity as t → ∞
thus via an appropriate choice of the values of t one can have gN (ta, tr, 1) = ρ for any nonnegative value of ρ.
Now assume thatj ρ is nonpositive, ρ ≤ 0. Define the new 0 − 1 binary variables yi by yi = zi for i = 1, · · · N − 1
and yN = 1 − zN and note that the right-hand-side of Eq. (107) becomes,
ρ z1 z2 · · · zN −1 zN + POLY(Z; N − 1)
= − ρ y1 y2 · · · yN −1 yN + y1 y2 · · · yN −1 + POLY(Z; N − 1)
| {z } | {z }
, ρ0 ≥ 0 , POLY(Y;N −1)
For ρ0 ≥ 0, we have just shown that there exists values of a0 > 0 and r0 such that
0 0
log 1 + ea (y1 +···+yN −1 +yN )+r = ρ0 y1 y2 · · · yN −1 yN + POLY(Y; N − 1)
which we rewrite as
log 1 + ea(z1 +···+zN −1 −zN )+r = ρ z1 z2 · · · zN −1 zN + POLY(Z; N − 1)
With the validity of Eq. (107) having been established, we can now verify the universal approximation condition Eq. (106)
by iteratively selecting the parameters ai (and hence the parameters ri via the constraint (108)) in the functions φi i ) in the
left-hand-side of (106) as follows:
Stage 0. Chose ai (and hence ri via (108)) so that a single (note that 1 = n0v = nnvv ) φi i (V(i) ; ai )-term on the left-hand-side
(LHS) of (106) is fitted to the single highest-order (nv -th order) monomial term on the right-hand-side (RHS) ,
φi 1 (V(i); ai ) = ρ v 1 · · · v nv + Q(nv − 1) .
ni =nv | {z } | {z }
highest-order term on RHS of (106) (nv − 1)-order polynomial in V
Call the entire right-hand-side of (106) F0 (V) and cancel the nv -order monomial term by forming
F1 (V) is a polynomial of at most degree nv − 1 with at most n1v monomial terms of order nv − 1.86
Stage 1. For each of the up to n1v monomials of order nv −1 in F1 (V), fit individual φi i V(i) ; ai terms shown on the left-hand-
side (LHS) of (106), with ni = nv − 1, to match one of those monomials for highest-order-term cancellation purposes.
Once this is done, subtract all of these fitted functions from F1 (V) to create a function F2 (V) that is a polynomial of
degree at most nv − 2 with at most n2v monomial terms of order nv − 2.
.. .. ..
. . .
Stage k − 1. Continuing in this manner, assume that an at-most (nv − (k − 1))–order polynomial function Fk−1 (V) has been con-
nv
structed in the previous Stage k − 2 that contains at most k−1 monomial terms of order (nv − (k − 1)) = (nv − k + 1).
For each of these monomial terms of order (nv − k + 1), fit one of the terms φi i V(i) ; ai , ni = nv − k + 1, on the
left-hand-side (LHS) of (106) to it for highest-order-term cancellation purposes. After all (nv − k + 1)-order mono-
mial terms have been fitted, subtract the fitted φi i V(i) ; ai functions from Fk−1 (V) to form a function Fk (V) that is a
polynomial of degree at most (nv − k) containing at most nkv monomial terms.
.. .. ..
. . .
Stage nv − 2. Assume that an at-most (nv −(nv −2)) = 2-order (quadratic order) polynomial function Fnv −2 (V) has been constructed
in the previous Stage nv − 3 that contains at most nvn−2 v
= n2v monomial terms of order (nv − (nv − 2)) = 2. For
each of these monomial terms of order 2 fit one of the terms φi i V(i) ; ai , ni = 2, on the left-hand-side (LHS) of
(106) to it for highest-order-term cancellation purposes. After all order-2 monomial terms have been fitted, subtract the
corresponding φi i (V; ai ) functions from Fnv −2(V) to form a function Fnv −1 (V) that is a polynomial of degree at most
(nv − (nv − 1)) = 1 containing at most nvn−1 = n1v = nv monomial terms.
v
For the remaining nv unused functions φi i V(i) ; ai (recall that we started the iterative process with nh = 2nv ofthem)
i
set the parameters ai to zero. Subtract the
P resulting constant values for the nv remaining functions φi V(i) ; ai from
Fnv −1 (V), Fnv −1 (V) ← Fnv −1 (V) − Constants
86 Each such nv − 1-order monomial term has the form ai1 ···inv −1 vi1 · · · vinv −1 .
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 33
Stage nv − 1. At most a polynomial of degree one in V remains. Fit the nvn−1 = n1v = nv parameter term sT V shown on the LHS
v
of Eq. (106) to the sum of the remaining nv linear order monomials. Subtract the fitted function sT V from Fnv −1 (V),
Fn (V) = Fnv −1 (V) − sT V. Then Fn (V) is at most a nonzero constant.
Stage nv . At most a nonzero constant function (zeroth order polynomial), Fn (V) = constant, remains. Fit the one (1 = nnvv =
nv
0 ) constant s0 shown on the LHS of (106) to it. Subtract s0 from Fn (V). What remains is the value zero.
Conclusion. Sum all the fitted functions φi i V(i) ; ai and sT V, and the fitted constant s0 as shown in Eq. (104) to form
nh nh
i i
T T
eφi (V(i) ;ai ) = es V+s0 1 + eai Si (V(i) )+ri
fitted
Y Y
πtrue (V) = es V+s0
= πθ (V).
i=nv +1 i=nv +1
Note from Eq.s (95)–(100) and the above procedure that every distinct, nonzero value ai 0, i > nv , is associated with a
distinct hidden unit hi . Counting the maximum number of parameters (and hence hidden units) needed from Stage 0 through,
and including, Stage nv − 2, we determine that at most
nv X nv
X nv nv nv
= − − = 2nv − n − 1
j=2
j j=0
1 0
hidden units are needed. In stages nv − 1 and nv we encounter nv visible units vi and nv + 1 additional parameters (including
the normalization parameter). Thus a sufficient condition for a marginalized Boltzmann Machine latent variable model to fit
to the most general possible binary distribution on nv visible units is for the model to have (at most87 ) nh = 2nv − nv − 1
hidden units, for a total of n = nv + nh = 2nv − 1 units in all.
The above procedure determines that the restricted model (94)–(95) has 2nv − n − 1 parameters associated with the hidden
units, n parameters associated with the visible units, and one normalization parameter, for a total of p = 2nv parameters.
From a practical perspective, this can be viewed as not an encouraging result at all given that a general categorical distribution
πtrue (V) on an nv -element binary vector V requires the specification of 2nv probabilities.
However, one should not ignore the potential positive aspects of using a marginalized Boltzmann Machine model. First,
this model requires no more than a quadratic energy on the total set of nodes (visible nodes + latent variable nodes), which
is consistent with the fact that the use of latent variables can simplify analysis and algorithm development. Secondly, the
universal approximation property shows that it is not futile to expect that a marginalized Boltzmann Machine model can
model a given binary-vector distribution (even though it might be computationally feasible to do so). Lastly, the hope when
using neural network-based probability models (“generative models”) is that nature has special structure that does not require
learning the “most general” probability model, and the very many successes demonstrated by the use of neural networks in
practice appear to suggest that this is indeed the case. These successful real-world applications of stochastic networks suggest
that Boltzmann Machine-like distributions are well-fitted to the type of stochastic structure that apparently exist in nature
[tegmark]
n2 + n + 1
θ = vec(θ0 , b, W ) = (θ0 , b1 , · · · , bn , w12 , · · · , w(n−1)n )T ∈ Rp , p= (113)
2
87 Note that we have shown exact match for the model (94)–(95), which is a restriction of the Restricted Boltzmann Machine; i.e., for a “restriction of a
restriction” of the general Boltzmann Machine Distribution. For the unrestricted Boltzmann Machine Distribution fewer hidden units are needed to obtain the
name number of parameters, p, which always needs to be at least equal to Kv = 2nv to exactly match to the most general categorical distribution on V .
88 See Equations (70)–(74) and (77)–(82). In particular note that W = W T with zero-valued diagonal elements. Also Z =−βθ0 as discussed in
θ
Appendix D.2.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 34
X X 1 X
πθ (Xv ) = πθ (X) = πθ (Xv , Xh ) = e−βEθ (Xv ,Xh ) (114)
Zθ
values of Xh values of Xh values of Xh
to πtrue (Xv ) using the procedure originally proposed in [1, 22, 23].89 Note that
and recall
X
Zθ = e−βEθ (Xv ,Xh ) (116)
values of Xv , Xh
Let us assume for awhile that πtrue (Xv ) is known and that we are interested in approximating it by a distribution of the
form πθ (Xv ). As mentioned in Appendix A, there are sound information theoretic reasons to determine an approximation that
minimizes the Kullback–Liebler Divergence
v πtrue (Xv ) X πtrue (Xv )
D(πtrue kπθv ) = E πtrue
v log = πtrue (Xv ) log (117)
πθ (Xv ) πθ (Xv )
values of Xv
v
of πθ (Xv ) from the true distribution πtrue (Xv ). Note that minimizing D(πtrue kπθ ) with respect to θ is equivalent to maximiz-
90
ing the expected per-sample log-likelihood of θ,
X
ELL(θ) , E πtrue
v {log L(θ; Xv )} = E πtrue
v {log πθ (Xv )} = πtrue (Xv ) log πθ (Xv ) . (118)
values of Xv
where ∇θ ELL(θ) is the gradient91 of ELL(θ) and λθ > 0 is a step-size parameter used to control the stability and rate of
convergence.92 Note from Eq. (118) that the gradient of ELL(θ) is equal to the expected gradient of the per-sample log-
likelihood
X
∇θ ELL(θ) = E πtrue
v {∇θ log πθ (Xv )} = πtrue (Xv ) ∇θ log πθ (Xv ), (120)
values of Xv
The last equation shows that if the value of πθ (Xv ) is less than the true value, then the effect of the gradient of the per-sample
likelihood, ∇θ πθ (Xv ), (which is the direction of maximum increase of the value of πθ (Xv )) on the overall, ensemble averaged
∇θ ELL(θ) is enhanced whereas if its value is less than the true value, it is decreased. Note that
X πtrue (Xv )
∇θ D(πtrue kπθ ) = −∇θ ELL(θ) = − E v
πtrue {∇θ log πθ (Xv )} = − ∇θ πθ (Xv ) (122)
πθ (Xv )
values of Xv
To determine the gradient of log πθ (Xv ) note from Eq.s (114) and (116) that
!
∂ ∂ X ∂ X
log πθ (Xv ) = log e−βEθ (Xv ,Xh ) − log e−βEθ (Xv ,Xh )
∂θ ∂θ ∂θ
values of Xh values of Xv , Xh
P
∂Eθ (Xv ,Xh ) ∂Eθ (Xv ,Xh )
−βEθ (Xv ,Xh ) e−βEθ (Xv ,Xh )
P
∂θ e ∂θ
values of Xh values of Xv , Xh
= −β P −βE (X ,X )
− P −βE (X ,X )
e θ v h e θ v h
values of Xh values of Xv , Xh
X ∂Eθ (Xv , Xh ) X ∂Eθ (Xv , Xh )
=β πθ (Xv , Xh ) − πθ (Xh | Xv )
∂θ ∂θ
values of Xv , Xh values of Xh
( )!
∂Eθ (Xv , Xh ) ∂Eθ (Xv , Xh )
=β Eθ − Eθ Xv ,
∂θ ∂θ
∇θ ELL(θ) = E πtrue
v {∇θ log πθ (Xv )}
n o n n oo
= β Eθ ∇θ Eθ − E πtrue Eθ ∇θ Eθ Xv
v
n n oo n n oo
= β Eθ Eθ ∇θ Eθ Xv − E πtrue Eθ ∇θ Eθ Xv (124)
v
n o
The term Eθ ∇θ Eθ is the average of the gradient ∇θ Eθ for the complete (unmarginalized) latent variable model πθ (Xv , Xh ).
n o
As noted in Appendix B (see Property 6), the term −Eθ ∇θ Eθ is the vector of moments of πθ (X) and therefore the right-
n o
hand-side of Eq. (124) is a difference of moments. Since there is no conditioning, Eθ ∇θ Eθ is an average obtained when
the nlatent variable
o model is allowed to run “freely”, and we say that the model is unpinned (or unclamped). The term
Eθ ∇θ Eθ Xv is the average of the gradient ∇θ Eθ for the latent variable model conditioned on the visible units being
pinned to a value Xv obtained from sampling from the true distribution πtrue (Xv ). Because the visible units of the latent
variable model are
n pinned
n (aka clamped)
oo to a fixed value, in this case the latent variable model is not allowednto run freely.
o
The term E πtrue Eθ ∇θ Eθ Xv is the πtrue (Xv )-average of the clamped conditional expectations Eθ ∇θ Eθ Xv ,
v
which are therefore averaged over all possible visible-unit realizations Xv = Xv generated according to the true distribution
πtrue (Xv ).
Eq. (124) says that ∇θ ELL(θ) vanishes (and hence learning stops) when the unpinned (freely running) latent variable
model produces unconditional moments that match the true distributional average over the conditional moments produced by
the pinned latent variable model. To make this clearer, let us see what this corresponds to in terms of the parameter values bi
and wij of the latent variable model (112). Note that
∂ ∂
Eθ (X) = −xi and Eθ (X) = −xi xj
∂bi ∂wij
= β E πtrue
v m̂i (θ |Xv ) − Eθ m̂i (θ |Xv )
| {z } | {z }
m̂+
i (θ) m̂−
i (θ) = Eθ {xi }
−
= β m̂+
i (θ) − m̂i (θ) (125)
and
∂
ELL(θ) = β E πtrue Eθ {xi xj | Xv } − E πtrue Eθ {xi xj | Xv }
v v
∂wij | {z } | {z }
m̂ij (θ |Xv ) m̂ij (θ |Xv )
= β E πtrue
v {m̂ij (θ |Xv )} − Eθ {m̂ij (θ |Xv )}
| {z } | {z }
m̂+
ij (θ)
m̂−
ij (θ) = Eθ {xi xj }
−
= β m+
ij (θ) − mij (θ) (127)
according to the model distribution πθ . Note that this is a one-step averaging procedure.
Note that the gradients (125) and (127) vanish (in which case learning ceases) when the moments, m̂− −
i (θ) and m̂ij (θ),
of the unpinned latent variable model respectively have learned to follow the true-data average of the conditional moments,
m̂+ +
i (θ) and m̂ij (θ), of the pinned latent variable model. If this occurs, the trained latent variable model can then naturally (i.e.,
without pinning) generate samples on the visible units that mimic the statistics of the data training samples that were drawn
from the true distribution πtrue (Xv ).
When the gradients (125) and (127) vanish we have94
X
πtrue (Xv ) − πθ (Xv ) = 0
values of Xv
X
m̂i (θ | Xv ) πtrue (Xv ) − πθ (Xv ) = 0
values of Xv
X
m̂ij (θ | Xv ) πtrue (Xv ) − πθ (Xv ) = 0
values of Xv
93 β has been absorbed into the step-size parameter λb̂ .
i
94 The first equation holds because both probabilities must sum to one.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 37
2
for i, j = 1, · · · , n, for i < j and n = nv + nh , which is the system of p = n +n+2
2 Method of Moment (MOM) equations
(89) described in Section 6.3.2. As mentioned there, and described in Section 6.3.1, a necessary condition that these equations
imply πtrue (Xv ) = πθ (Xv ) in the most general case is that p ≥ Kv = 2nv , which is equivalent to the condition (83),
√
2nv +3 − 7 − 2nv − 1 nv
nh ≥ =O 2 2 .
2
course, the hope is that for learning an approximate distribution for Xv that is of practical utility one can get by with
Of √
nv +3 −7−2n −1
nh 2 2
v
.
Discussion of value of β and annealing.
Let us now assume that the true distribution πtrue (Xv ) is either unavailable or intractable to use. In this case, we replace it
with the sample distribution π̂(Xv ) determined from N independent and identically distributed samples of Xv presumed to be
drawn from πtrue (Xv ).
APPENDICES
Note that N = |D|. For a given distribution taken from the statistical family Qθ ∈ Q, equivalently, for a particular choice of
θ ∈ Θ, we can form the joint distribution on the N -sample data set DN :
N
Y
Qθ (DN ) = Qθ (X[j]) θ fixed, DN free.
j=1
95 We say “truth distribution” rather than “true distribution” because we are assuming that it is true for mathematical convenience. Perhaps a better way to
describe P is that it is a reference distribution assumed to describe the stochastic behavior of a world of interest.
96 Note that X[j] = X(`) ∈ Rn means that the j-th sample in D takes the `-th possible realization value in the sample space X ⊂ Rn .
N
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 39
Fixing the sample data argument DN in F (θ, DN ) produces a DN -parameterized function of θ called the (full data) likelihood
function,
YN YN
L(θ; DN ) = L(θ; X[j]) = Qθ (X[j]) θ free, DN fixed.
j=1 j=1
where
L(θ; X) = Qθ (X) θ free, X fixed
is the per-sample likelihood function given the single-sample realization X.
The Maximum Likelihood Estimate of θ ∈ Θ given the N -sample iid data DN is defined by
Because often the statistical family is of exponential family type, it is usually convenient to equivalently maximize the log-
likelihood function
X N
L(θ; DN ) = log L(θ; DN ) = log Qθ (X[j]) .
j=1
When the distribution log Qθ (X) belongs to an Exponential Family Distribution,97 one typically uses the base-e logarithm
function log a = ln a.
Defining the sample average of the per-sample log-likelihood function log L(θ; X) = log Qθ (X) on the N -sample iid data
set D by
N
D E 1 X
log Qθ (X) = log Qθ (X[j])
N N j=1
then yields
D E D E
θ̂ml (N ) = arg max log L(θ; X) = arg max log Qθ (X) . (129)
θ∈Θ N θ∈Θ N
Note that under the assumption that Q is identifiable, this is equivalent to solving98
D E
Q̂(N ) = arg max log Q(X) .
Q∈Q N
I.e., at least mathematically, density estimation is parameter estimation, and parameter estimation is density estimation. Also
note from Eq. (129) that the MLE is obtained by maximizing the data-averaged per-sample log-likelihood function
D E D E
log L(θ; X) = log Qθ (X) .
N N
The MLE θ̂ml (N ) has many interesting and useful finite– and infinite–sample properties which are described well in many
references.99
97 Also known as an Exponential Model [17].
98 Under the assumption of identifiability, θ̂ml (N ) and Q̂(N ) are related in a one-to-one way by Q̂(N ) = Qθ̂ml (N ) .
99 For example, see [17, 18].
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 40
Above we discussed the sample average for the particular case of f (X) = log Qθ (X).
The Strong Law of Large Numbers (SLLN) yields
K
D E N →∞ X X
f (X) −−−−−−−→ EP {f (X)} = f (X)P (X) = pi f (X(i) ) almost surely.
N
X∈X i=1
Thus in the limit of large N a sample average can provide an accurate approximation to a distributional average.100 This fact
leads one to define, and use, the empirical distribution as a surrogate for the unknown truth distribution P .
The empirical distribution P̂ = P̂ (N ) is a function of the measured iid data D = DN = {X[1], · · · , X[N ]} and is defined
for any realization, known or unknown, as
N N
1 X 1 X 1 X
P̂ (X) = P̂ (N ) (X) = 1(X = X0 ) = 1(X = X[j]) = δX , X[j] .
|D| 0 N j=1 N j=1
X ∈D
and therefore D E
(N )
EP̂ {f (X)} = f (X) .
N
As a consequence, the SLLN can be written as
(N ) N →∞
EP̂ {f (X)} −−−−−−−→ EP {f (X)} almost surely.
Note that taking the function to be f (X) = 1(X = X(i) ) = δX , X(i) yields
n o N
(N ) i
p̂i = P̂ (X(i) ) = EP̂ 1(X = X(i) ) =
N
where
N
1 X
Ni = δ (i)
N j=1 X[j], X
is the relative frequency of occurrence of the particular realization value X(i) ∈ X in the data set of measurements DN =
{X[j], 1 ≤ j ≤ N }. Note that the empirical distribution can be written as
K
X
P̂ (X) = p̂i δX , X(i) .
i=1
a statement that is often phrased as “the property of ergodicity holds.” The SLLN will hold given appropriate sufficient conditions, such as iid finite-state
random variables having bounded moments.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 41
We interpret IP (X) as providing a measure of the amount of information one has gained by performing a measurement of the
random variable X and obtaining the realization value X = X given that we believe that the probability distribution describing
the random behavior of X is P . IP (X) is also called the surprise occurring when a measurement of X results in the realization
X . In information theory it is typical to use a base-2 logarithm so that the amount of information (or surprise) arising from
the measurement X = X is in units of bits. Note that the occurrence of a low (near zero) probability-P event is interpreted
as producing a lot of information (and is quite surprising) whereas the occurrence of a P -certain (probability one) event is
interpreted as providing no information (or surprise) at all.
If P is the truth distribution, then we take the measure of surprise provided by IP to be objectively valid. However, it is
usually the case that we do not know the truth distribution and instead are using a model of the stochastic behavior of reality
that is captured by a parameterized distribution Qθ ∈ Q. In this latter situation, we can only estimate that the amount of
information provided by the measurement X = X is I Qθ (X).
Because the real world, assumed to be behaving according to the truth distribution P , is the source of all measurements,
on the average we expect to see values of IQθ and IP given respectively by
The function
1
H(P, Qθ ) = EP {IQθ (X)} = EP log = −EP {log Qθ (X)} (131)
Qθ (X)
is the Cross-Entropy101 between the truth distribution P and the model distribution Qθ . It gives the amount of surprise we will
measure on the average when applying IQθ to data repeatedly generated by the real world according to P in an iid manner.
The Cross-Entropy between P and itself is the Self-Entropy, H(P, P ), or, more simple, the (Shannon) Entropy, H(P ), of
P:
1
H(P ) = H(P, P ) = EP {IP (X)} = EP log = −EP {log P (X)} . (132)
P (X)
The entropy H(P ) provides an objective measure of the intrinsic amount of information (surprise) that we gain (encounter) on
the average as we repeatedly, in an iid manner, query a world whose stochastic behavior is truly described by the distribution
P.
Define the Information Discrepancy ∆I(P, Qp ) between the model and truth distributions by
P (X)
∆I(P, Qp )(X) = IQθ (X) − IP (X) = log .
Qθ (X)
This gives a measure, in bits, of how much one overestimates or underestimates the amount of information obtained in a
measurement X = X assuming the model distribution Qθ when the world actually behaves according to P .102
The Kullback-Liebler (KL) Divergence, D(P kQθ ), of Qθ from P is defined to be the average Information Discrepancy
assuming that the world behaves according to the truth distribution,
n o
D(P kQθ ) = EP ∆I(P, Qp )(X) = EP {IQθ (X)} − EP {IP (X)} = H(P, Qθ ) − H(P ) . (133)
It is evident that D(P kQθ ) gives a measure of how much the estimated average amount of information (surprise) in the world,
given by H(P, Qθ ), divergences from the true amount of average information (surprise) in the world, as given by H(P ).
The Divergence can be equivalently written as
P (X) X P (X)
D(P kQθ ) = EP log = P (X) log , (135)
Qθ (X) Qθ (X)
x∈X
and it is in this form that it is commonly encountered in the literature. The KL Divergence obeys the
Gibbs Inequality 103
0 ≤ D(P kQ) for all P , Q and 0 = D(P kQ) iff Q = P almost surely. (136)
I.e., on the average we measure more surprise in the world using an incorrect distributional model Qθ than if we measure
surprise using the correct distributional model P .
3. D(P kQθ ) = 0 if and only iff the cross-entropy between P and Qθ is equal to the self-entropy of P
Note that Eq. (139) shows that the parameter vector, θkd , that is optimal in the sense of minimizing the Kullback-Liebler
divergence is also optimal in the sense that it maximizes the expected per-sample log-likelihood function,
n o n o
EP log L(θ; X) = EP log Qθ (X) .
Unfortunately, very often the truth distribution P is not known, or is intractable complex, so the procedure Eq. (139) cannot
be carried out in practice.
103 Also known as the Information Inequality.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 43
Thus we can approximate the solution to Eq. (139) by the empirical distribution approximation
n o
(N )
θ̂kd (N ) = arg max EP̂ log Qθ (X) (141)
θ∈Θ
Now compare the parameter estimates shown in equations (130) and (141). We see that they are equivalent!
θ̂ml (N ) = θ̂kd (N ) . (142)
Thus the MLE estimate θ̂ml (N ) is a finite-sample, empirical estimate of the KD Divergence estimator θkd , for which we
expect that
θ̂ml (N ) −→ θkd as N → ∞ .
1
where β = kT > 0.105
By assumption, the energy function E(X) is known and finite, and therefore P (X) > 0 for all X ∈ X. I.e., every
finite-energy Boltzmann distribution is a positive distribution.
PROPERTY 1
A positive distribution for a discrete random variable on a finite sample space is equivalent to a finite-energy
Boltzmann distribution on that sample space and vice versa.
Proof. The “vice versa” statement has already been discussed. Assuming that P is a positive distribution, then
P (X) > 0 for all X ∈ X. Select any state in X0 ∈ X to serve as a “reference state”. Define E(X0 ) to be any finite
real number 106 . Noting that P (X0 ) > 0, define the finite value E(X) for any X ∈ X by107
1 P (X)
E(X) , E(X0 ) − ln .
β P (X0 )
104 The stochastic setup is described at the beginning of Appendix A.1.
105 For completeness we set k equal to Boltzmann’s constant. When working with artificial neural networks we can take k = 1
106 It is simplest to take E(X ) = 0.
0
107 Note that in this appendix we are making it explicit that we are working with the natural logarithm. This is unlike much of the note where we
predominantly use the notation “log” and the precise nature of the logarithm depends on the context.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 44
Then
e−βE(X)
P (X) −E(X0 )
−βE(X) = ln e =⇒ P (X) =
P (X0 ) Z
with
e−E(X0 )
Z= .
P (X0 )
To facilitate some of the mathematical derivations below, It is convenient to put pj = P (X(j) ) and Ej = E(X(j) ) and
write
K
1 −βEj X
pj = e with Z = e−βEj
Z j=1
and
ln pj = −βEj − ln Z .
K K
X 1 X
Energy Average: U = E{E(X)} = pj Ej = Ej e−βEj
j=1
Z j=1
we have
∂
PROPERTY 2 U =− ln Z
∂β
Proof.
K
∂ 1 ∂ 1 X
− ln Z = − Z= Ej e−βEj = U
∂β Z ∂β Z j=1
The dispersion or standard deviation108 of the energy about its mean is defined as
p p
Energy Dispersion: DE = VarE = E{(E(X) − U )2 }
∂2 ∂
PROPERTY 3 VarE = ln Z = − U
∂β 2 ∂β
108 Also called the RMS (for “root mean-square”) value.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 45
Proof.
!
K
∂ ∂ 1 X 1 X 1 X 1 X 2 −βEj
− U =− Ej e−βEj = − Ej e−βEj Ei e −βEi
+ E e
∂β ∂β Z j Z j
Z i=1
Z j j
| {z }| {z } | {z }
U U E{E 2 (X)}
Properties 2 and 3 are a consequence of the fact that the function ln Z(β) is the energy-moments generating function for
the Boltzmann distribution.
The ratio of the dispersion to the mean, DE /U , gives a measure of how peaked the energy of the system is about its mean
value U . For n large, this ratio tends to go to zero as n → ∞. For physical systems where n is of the order of Avogadro’s
number, this ratio is essentially zero and therefore repeated macroscopic measurements of the energy (essentially) always yield
the average value U .109
The entropy of a distribution on a finite sample space X = {X(1) , · · · , X(K) } with distribution pj = P (X(j) ), is defined
110
by
K
X
Entropy: H(P ) = −k pj ln pj
j=1
The Boltzmann distribution is the unique distribution, P , on X that maximizes the entropy H(P ) subject to the
constraint that U (P ) takes a specified value U.
Assume that xi takes binary-valued realization values xi = xi ∈ {−1, +1}, for i = 1, · · · , n. The random
variables xi are binary, or dichotomous, categorical variables and X is a binary categorical random vector. We
109 To within measurement error. Noise in the macroscopic measurement device will swamp the microscopic fluctuations of the energy E(x) about its
have K = 2n and in general P (X) takes K = 2n values, which are constrained to sum to one. If we assume that
P is positive, P (X) > 0 for all X ∈ X, then any such P can be represented as P (X) = P (X; θ) = Pθ (X) where
n n n
1 X X X
ln P (X; θ) = θ0 + θi xi + θij xi xj + θijk xi xj xk + · · · + θ12 ··· n x1 x2 · · · xn
β i=1 i<j i<j<k
Note therefore that any positive distribution, P (X), of a binary random vector X can be written as a Boltzmann
Distribution
1 −βEθ (X)
P (X) = P (X; θ) = Pθ (X) = e
Zθ
for some K-dimensional parameter vector θ with
n
X n
X n
X
Eθ (X) = − θ i xi − θij xi xj − θijk xi xj xk − · · · − θ12 ··· n x1 x2 · · · xn
i=1 i<j i<j<k
and
X
Zθ = e−βθ0 = e−βEθ (X)
values of X
Note that the form of the partition function Zθ clearly gives the constraint condition that must exist amongst the
elements of θ in order to ensure that the distribution properly sums to one,
!
1 X
−βEθ (X)
θ0 = − ln e
β
values of X
The Boltzmann distribution for a binary-components categorical vector X is also known as the Boltzmann Machine dis-
tribution and the Ising model. Property 6 states the existence of a general log-linear model for any Ising model. See the
discussion in Appendix D.2. The following two properties are consequences of the log-linearity of the Ising model.
where we define
i0 = 0 and m0 (θ) = EPθ {x01 } = 1.
Proof. This is a straightforward consequence of the fact that xi ∈ {−1, +1} with i ≤ n. See the discussion in
Appendix D.2.
Proof. The general element of the vector θ is equal to θi1 i1 ···ik for 0 ≤ i1 ≤ i2 ≤ · · · ≤ ik ≤ n, 1 ≤ k ≤ n, and
equal to θ0 for k = 0. Then the corresponding component of −EPθ {∇θ Eθ (X)} is
∂
−EPθ = EPθ (X) {xi1 xi2 · · · xik } = mi1 i2 ···ik (θ)
∂θi1 i1 ···ik
for 1 ≤ k ≤ n and −1 for k = 0. Property 7 states that no other moments for the Boltzmann Machine Distribution
need be considered.
Proof. That no moments other than the ones stated need be considered is Property 7. Using the quantities
defined in the statement of Property 6, we have
∂
1
1 ∂Zθ X e−βEθ (X) ∂Eθ (X)
ln Zθ = =−
∂θi1 i2 ···ik β βZθ ∂θi1 i2 ···ik Zθ ∂θi1 i2 ···ik
values of X
X n o
= Pθ (X) xi1 xi2 · · · xik = EPθ xi1 xi2 · · · xik
values of X
1
Free Energy: F = −kT ln Z = − ln Z
β
∂F
PROPERTY 10 H=−
∂T
Proof.
∂F ∂ ln Z ∂ ln Z ∂β
− = −k ln Z + T = −k ln Z + T = −k (ln Z + βU )
∂T ∂T ∂β ∂T
X X X X
= −k pj ln Z + β Ej pj = −k pj (ln Z + βEj ) = k pj ln pj = H
j j j j
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 48
PROPERTY 11 F = U − TH
Proof.
X X X
K
U −F = Ej=1 pj + kT ln Z = kT βEj pj + pj ln Z
j j j
X X
= kT pj (βEj + ln Z) = −kT pj ln pj = T H
j j
The fact that U = F + T H for a fixed volume and constant particle system in thermal equilibrium with a heat bath (and
therefore described by the Boltzmann distribution) is a fundamental result in equilibrium statistical mechanics. It says that of
the total (average) energy U stored in the system, only a portion, the free energy F , is (at most) available to do useful work;
the remaining part, the entropy portion T H, being purposeless chaotic, thermal agitations that can never be used to perform
useful work.
∂U β ∂U 1 ∂U
Heat Capacity: CV = =− =− 2
∂T T ∂β kT ∂β
VarE
PROPERTY 12 CV =
kT 2
This is the rationale underlying the heuristic optimization procedure known as simulated annealing [25].
111 If it is exactly the case that T = 0, then F = U = E(X) (since then Var = 0 =⇒ E(X) = U while T H = 0 =⇒ F = U ) and F = E(X) can
E
only be a minimum if X makes E(x) a minimum.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 49
with β > 0, satisfies certain optimality properties. We assume that the energy is always finite,
For convenience, we define pj = P (X(j) ) and Ej = E(X(j) ). Recall that a distribution on X is subject to the constraints
K
X
pj ≥ 0 and pj = 1 .
j=1
Note that the second condition imposes a linear constraint on the space of admissible probabilities.
The important statistical mechanical quantities of Entropy, H, Average Energy, U , and Free Energy, F , are given by
[6, 24],112
K
X K
X
H(P ) = − pj ln pj , U (P ) = pj Ej , and F (P ) = U (P ) − T H(P ) ,
j=1 j=1
where the (pseudo) temperature T and inverse temperature β are reciprocally related, β = T −1 .
where λ is a continuous, real-valued Lagrange multiplier. Setting the derivative of `(P, λ) with respect to pj to zero yields
∂ 1
`(P ) = βEj + 1 + ln pj + λ = 0 =⇒ pj = e−βEj > 0
∂pj Z
where
Z = e1+λ ⇐⇒ λ = ln Z − 1 .
The normalization requirement forces Z = j e−βEj , which fixes the value of Z, and hence that of the Lagrange multiplier
P
λ. Note that the solution is unique and always satisfies the positivity constraint pj > 0.114
We have established that the stationary solution of the Lagrangian is of the form,
1 −βE(X)
P (X) = e >0
Z
for all X ∈ X = {X(1) , · · · X(K) } and finite β > 0. Note that we did not impose the nonnegativity inequality constraint pj ≥ 0
on the distribution, and therefore our solution is the unique stationary point of the Lagrangian regardless of the nonnegativity
constraint. Thus the inequality constraint can be ignored when asking if the stationary point is a minimum, a maximum, or a
112 Seealso Appendix B.
113 Physically, this corresponds to the system being in thermal equilibrium with a system at a known, fixed temperature T .
114 Recall that at the outset we made the assumption that |E(x)| < ∞ for all x ∈ X.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 50
saddle point of the linearly constrained optimization problem.115 So we ask: is the stationary solution a linearly constained
minimum of the free energy F (P )? Note that the second derivative of βF (P ) with respect to pj = P (X(j) ) is given by p1j
which is strictly positive for all pj > 0. Further note that the hessian of βF (P ) is a purely diagonal matrix containing the
second derivatives p1j on the diagonal, which means that the hessian of the free energy F (P ) is strictly positive-definite for
all P > 0. This further implies that the projection of the hessian onto any admissible subspace of first-order variations of the
values of pj about the stationary point is (locally) strictly positive-definite on that subspace.116 Thus our solution is the unique
global minimizer of the free energy subject to the normalization constraint.117
(j)
where we again assume the existence of a specified, finite energy function E(X),
P Ej = E(X ). Note that the average energy
constraint is linear in P , so that together with the normalization constraint j pj = 1, we are imposing two linear equality
constraints on the set of admissible probability values.
We now show that for this situation, the Boltzmann distribution is the unique maximum entropy distribution subject to the
given average energy constraint. In this case the inverse temperature β (equivalently T ) is a Lagrange multiplier that needs to
be determined.118
Incorporating the additional constraint that the probabilities sum to one, the Lagrangian for the problem of maximizing the
entropy subject to the given constraints is119
X X
`(P ; λ, β) = H(P ) − βU (P ) − λ pj = − pj ln pj − βpj Ej − λpj ,
j j
where β and λ are continuous, real-valued Lagrange multipliers. Take the derivative with respect to pj and set to zero to
determine the unique stationary solution,
∂ 1
`(P ) = −1 − ln pj − βEj − λ = 0 =⇒ pj = e−βEj
∂pj Z
where
Z = e1+λ ⇐⇒ λ = ln Z − 1 .
Note that Z = Z(β) = j e−βEj so that λ = λ(β) and pj = pj (β). The value of β (equivalently T ) is determined, in
P
principle, from solving the (highly nonlinear in β) constraint equation
K
X
pj (β)Ej = U.
j=1
Again, the probabilities are strictly positive, pj > 0, for all j, so we don’t have to worry about the existence of nondifferentiable
boundary point solutions.
115 I.e., because the inequality constraint is irrelevant, we do not have to worry about the possibility of a optimal value of the solution being at a nondiffer-
ential boundary point, which would correspond to at least one of the probability values, pj , being zero, pj = 0, if the inequality constraint on pj was relevant
and active.
116 Let p = (p , · · · , p )T Because of the single linear constraint eT p ≡ 1, all admissible first-order variations δp of p must satisfy eT δp = 0. I.e.,
1 K
p can only (infinitesimally) vary on the (K − 1)-dimensional subspace that is orthogonal to the vector e. The restriction of the Hessian of βF (P ) to that
subspace must be strictly positive-definite for infinitesimal variations about the stationary point, which therefore must be at least a local minimum. Because it
is the unique stationary point to the Lagrangian (and the inequality constraints are irrelevant), i.e., because there are no other possible points to consider, the
stationary solution must be a global minimum.
117 Another way to think about this is that F (P ) having a strictly positive-definite hessian on the domain P > 0 means that F (P ) is strictly convex in
P on that domain. Which means that F (P ) restricted to any subspace on that domain is strictly convex on that domain and therefore can have at most one
minimum on that domain, which must be a stationary point of the Lagrangian, which must be unique if it exists. Thus our solution, which is the unique
stationary point of the Lagrangian, is the unique minimizing solution that satisfies the linear constraint eT p = 1.
118 Remember that physically the Boltzmann distribution describes the equilibrium thermal behavior of a system in thermal equilibrium with a heat bath
at temperature T . If U is large, the system likely contains a lot of thermal energy, in which case T is likely large in value (equivalently, β is small). On the
other hand, if U is small, then likely the temperature T is small (equivalently, β is large).
119 Note that in the theory of Lagrange multipliers, we are free to choose the signs in front of the multipliers, with a particular choice usually made for
Note that there is one, and only one, unique stationary solution for the Lagrangian variation problem. That the stationary
solution maximizes the entropy subject to the two imposed linear constraints is confirmed by noting that the hessian of the
entropy function is diagonal with diagonal elements − p1j < 0 that are everywhere strictly negative in the neighborhood of the
stationary solution. This means that the hessian restricted to any subspace of allowable variations of pj about the stationary
solution is strictly negative definite.120 As a consequence, the unique stationary point is a unique global maximum on the set
of admissible solutions.121
C.3 Maximum Entropy Distribution Subject to 1st and 2nd Moment Constraints
In the previous two situations, the energy function E(x) is assumed to be a priori known. Here we show that entropy maxi-
mization subject to constraints on the first and second non-central moments of the discrete random vector X uniquely yields
the122
X ∈ X = {X(1) , · · · , X(K) } ⊂ Rn
and
K
T
X
g(P ) , EP XX T = pj X(j) X(j) ≡ R = Rk,` ∈ Rn×n .
j=1
We call the matrix R of second non-central moments, somewhat erroneously, the correlation matrix.123 Of course, we also
have the normalization condition
K
X
c(P ) , pj ≡ 1 .
j=1
Note that the first (vector) constraint f (P ) = µ imposes n linear equality constraint conditions on P while the second
(matrix) constraint g(P ) = R = RT imposes n(n+1) 2 linear equality constraint conditions. Taken together with the normal-
n2 +3n+2
ization condition c(P ) = 1, we have a total of 2 linear equality constraints that must be satisfied by P , which is also
the number of Lagrange multipliers we will need to incorporate into our Lagrangian.
It is convenient to note that for b ∈ Rn , X
bT f (P ) = pj bT X(j)
j
n×n
and for W ∈ R , X T X T
tr g(P )W = pj tr X(j) X(j) W = pj X(j) W X(j) .
j j
120 In this case, because of the existence of two linear constraints, the space of admissible infinitesimal variations has dimension (K − 2) where K is the
therefore on any subspace of that domain. Thus, on any subspace of the domain P > 0 the entropy has at most one unique maximum, which we identify as
the unique stationary solution of the Lagrangian.
122 This result is the discrete state version of the well-known fact that for continuous random vectors the maximum entropy distribution subject to constraints
second central moment of X, so in fact we could drop the adjective “non-central” in the description of this particular problem.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 52
Because the last expression is a quadratic form, with no loss of generality we can take W = W T , which then means that W
contains n(n+1)
2 independent elements.
2
Maximizing the entropy H(P ) = − j pj ln pj subject to the n +3n+2
P
2 linear equality constraints f (P ) ≡ µ, g(P ) ≡ R,
124
and c(P ) ≡ 1, leads to the Lagrangian
β β
`(P ; λ, βb, W ) = H(P ) + β bTf (P ) + tr g(P )W − λc(P )
2 2
2
where λ, βb, and β
2W = 2W
β T
comprise 1 + n + n(n+1)
2 = n +3n+2
2 continuous, real-valued Lagrange multipliers cor-
n2 +3n+2
responding to the 2 linear equality constraints. We set the derivative of the Lagrangian with respect to pj equal to
zero,
∂ β T
`(P ) = −1 − ln pj + β bT X(j) + X(j) W X(j) − λ = 0
∂pj 2
to find the unique stationary solution
1 −βEj
pj = e >0
Z
where
1 T
Ej = E(X (j) )) = − X(j) W X(j) − bT X(j)
2
and the partition function (normalizing factor) Z satisfies
Z = e1+λ ⇐⇒ λ = ln Z − 1 .
2 2
In principle the n +3n+2
2 constraints allow one to solve for the n +3n+2
2 Lagrange multipliers. Again note that the probabilities
are strictly positive, pj > 0, for all j, so we don’t have to worry about the existence of nondifferentiable boundary point
solutions.
The unique stationary solution to the Lagrangian maximizes the entropy subject to the imposed first and second moment
constraints because the hessian of the entropy function is diagonal with diagonal elements − p1j < 0 that are everywhere strictly
negative in the neighborhood of the stationary solution. This means that the hessian restricted to any subspace of allowable
variations of pj about the stationary solution is strictly negative definite.125 As a consequence, the solution is a unique global
maximum on the set of admissible solutions.126
A careful reading of the literature on categorical data analysis shows that there is a close relationship between the positive
distribution discrete random variable models considered in this note and those used in the modeling of categorical data [9, 29,
10, 2, 13].
124 We can, and do, take β to be any fixed, positive value, β > 0, by definition. Of course b and W (and λ) must still be determined from the requirement
therefore on any subspace of that domain. Thus, on any subspace of the domain P > 0 the entropy has at most one unique maximum, which we identify as
the unique stationary solution of the Lagrangian.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 53
determination of a Lagrange multiplier λ which sets the value of Z, and vice versa.
129 For fixed D the likelihood is just the multinomial distribution P (D ) of Equation (145) viewed as a function of the probabilities P (X(`) ).
N N
130 With P positive, for the maximum likelihood method to be sound, at a minimum we want N large enough that N 6= 0 for all ` = 1, · · · , K. We also
`
want N large enough so that the estimation errors for the K estimates are reasonably small.
131 See footnote 10.
132 For both distributions, the elements of the parameter vector ϑ are given by the θ-parameters shown on the far right hand side of the equations.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 54
In the next subsection, where we discuss binary categorical variables, we provide a different motivation for viewing the
three distributions as reasonable approximations to the true K-probabilities categorical distribution (144).
The Logistic Regression Categorical (LRC) distribution (148) has only133 n+1 parameters to learn instead of the K = mn
parameters of the true distribution (144), and its log-probability depends at most linearly on realization values X. However
this simplification comes at a price—note that the LRC distribution can only model very simple behavior in that it corresponds
to the assumption of complete independence among the components of X,
P (X; ϑ) = P (x1 ; ϑ) · · · P (xn ; ϑ) with P (xi ; ϑ) ∝ e−θi xi .
The Quadratic Exponential Categorical Distribution (equivalently, the BQE distribution) of Eq. (149) is an approximation
to the true multivariate categorical distribution (144) that captures pairwise dependencies between the random categorical
components xi and xj . Empirically, these dependencies are captured in contingency tables [2, 13], which can be thought of as
multidimensional arrays of joint outcome counts of the realization values for an ordered set of random variables. For example,
if the ordered set of random variables is (xi , xj ) and they can both take binary values a, b then we have a two-dimensional
2 × 2 array of counts for the possible joint outcomes (a, a), (a, b), (b, b), (b, a). These raw counts must sum up to the total
amount of collected data, Nij , and the joint outcome counts divided by Nij give the sample frequencies of the joint outcomes.
If we have three categorical variables (xi , xj , xk ) taking binary values then we have a three-dimensional 2 × 2 × 2 array of
joint outcome counts. In the n-components case where X = (x1 , · · · , xn )T we have an n-dimensional array of joint outcome
counts between the components of X. Such arrays of outcome counts are known as contingency tables. It is one of the goals
of categorical data analysis to determine which variables xi are contingent on (i.e., statistically dependent on) variables xj .
Since the dependency structure is contained in the joint distribution P (X) = P (x1 , · · · , xn ), determining good estimates of
this distribution is a primary objective of categorical data analysis. Obviously, the LR distribution will not detect any nontrivial
contingencies since it encodes the independence assumption. However the QECD distribution (149) can detect contingencies.
2
Notice that the QECD (aka BGDQE) distribution requires the specification or estimation of n +3n+2
2 parameters, which
should be compared to the K = mn parameters required for the true multivariate categorical distribution (144). For example,
if m = 4 and n = 10, we have 66 versus K = 410 = 1, 048, 576 parameters respectively. Note that in this case the simple
LRD, which cannot encode dependencies between the variables, requires only 11 parameters.
difficult problem.
134 K can still be quite vast, even in relatively simple situations. See footnote 10.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 55
For mathematical convenience, and with no loss of generality,135 unless otherwise indicated, we take the components of
X to take the values xi = xi ∈ {−1, +1}.136 With this choice, we have x2i = 1 which has several nice consequences.
One useful consequence of x2i = 1 is that higher powers of xi reduce to lower powers. Specifically, let ` be any nonnegative
integer, then
(
` 1 when ` is even
xi = (151)
xi when ` is odd
Thus in an expansion in powers of xi there is no need to consider powers other than zero and one.137
Another consequence of the fact that x2j = 1 is that in sums of products of the components of X, higher-order products
reduce to lower-order products when component indices are equal.138 For example,
i = j =⇒ xi xj = 1 and j = ` =⇒ xi xj xk x` = xi xk . (152)
Another useful property due to the choice xi = ±1 is that for any real-valued quantity α
Let x1 , · · · , xn be binary random variables with realization values xi = ±1, i = 1, · · · , n. For any function
f (x1 , · · · , xn ),
X X
f (x1 , · · · , xn ) = f (α1 x1 , · · · , αn xn ) (154)
x1 = ±1 ,··· , xn = ±1 α1 = ±1 ,··· , αn = ±1
Because the true K-parameter binear categorical distribution P of Eq. (144) is assumed to be positive, we can write
1
ln P (X) = g(X) ∀X ∈ X (155)
β
for some function g and any fixed β > 1. Now select g to be of the following form,139
n
X n
X n
X
g(X) = θ0 + θ i xi + θij xi xj + θijk xi xj xk + · · · + θ12 ··· n x1 x2 · · · xn (156)
i=1 i<j i<j<k
or, equivalently,140
n
1 X
g(X) = θ0 + θ T x + xT Θ x + θijk xi xj xk + · · · + θ12 ··· n x1 x2 · · · xn , Θ = ΘT (157)
2
i<j<k
Because equalities between the component indices cause higher-order products of the components to reduce to lower order
products, we can use strict inequalities in the sums. Note that Eq. (156) is the most fully general polynomial expansion of g(X)
135 Any two dichotomous random variables Y ∈ {α , α } and Z ∈ {β , β } are related by a simple affine transformation Y = aZ + b. For example if
1 2 1 2
Y ∈ {0, 1} and Z ∈ {−1, +1} we have Y = 1+Z 2
and Z = 2Y − 1.
136 An alternative nice choice to to have x take values 0 or 1.
i
137 Note that this property also holds if we instead take x ∈ {0, 1}, as then xk = x for all integers k ≥ 1.
i i i
138 This property also holds if we instead take x ∈ {0, 1}, as then xk = x for all integers k ≥ 1.
i i i
139 Here, and in the subsequent development, we follow references [9] and [10].
140 Note that Θ = ΘT where [Θ] = θ for i 6= j and [Θ] = 0 otherwise.
ij ij ii
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 56
because all products and powers of higher order than shown in the sum will reduce to terms already in the sum. Also note that
Eq. (156) is linear in the unknown θ-parameters.
Are there enough θ-parameters in (155)–(156) to specify the K = 2n values of the true binary distribution (144)? Note
that in each sum in (156) there are as many parameters as there are terms in the sum. The number of terms in an order-r sum
number of ways that r of the n components
is the
n 141
of X can be chosen without regard to order and with no repetition, which is
r . Thus the number of θ-parameters is
n n n n n n
+ + + ··· + + + = (1 + 1)n = 2n = K .
0 1 2 n−2 n−1 n
We see, then, that the number of independent parameters need to specify g is K, the same as for the true distribution P . In
(156), conceptually, the parameter θ0 serves to normalize the distribution to sum to one and the remaining K − 1 parameters
are used to set the values of the unnormalized probabilities. Knowledge of the K values of P (X) (and hence of g(X)), yields
K linear equations in the K θ-parameters that can be solved to uniquely determine the parameter values provided that the
collection of K = 2n polynomial product terms
form a linearly independent set of functions. This is indeed the case (the somewhat tedious details142 are given below),
and therefore the representation given by equations (155)–(157) can completely match any possible binary distribution (144)
[9, 10].
and let the parameterized distribution P (X; ϑ) be the representation given by Equations (155)–(157). We call the fully pa-
rameterized representation (155)–(157) for the true K-parameter binear categorical distribution P of Eq. (144) the General
Log-Linear Binary distribution model for the true categorical distribution P (x):
xi cannot be directly applied to the case where xi is discrete. If someone is aware of a published proof, or an outline of a proof, let me know and I’ll add the
reference this note.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 57
for all
f (·) ∈ F = {f | f : X → R} . (161)
In words, every function of X = (x1 , · · · , xn )T can be represented by an n-th order polynomial of partial degree one with
respect to each component variable xi , i = 1, · · · n. Note that the space of functions F is finite dimension,
dim(F) = |B| = K = 2n .
Note, In particular, that the elements of B can be used represent the log-probability of any positive distribution over X by
taking the function f ∈ F to satisfy
1
f (X) = − ln P (X) ∈ R (162)
β
Note that f (X) constrained as in Eq. (162) means that f belongs to a submanifold of F which is defined by the constraint
that the probability P must sum to one, a requirement that imposes a constraint condition between θ0 and the remaining
components of the parameter vector θ in the representation (160).
It is evident that a rational way to approximate the true K-probabilities distribution (144) for a binary components cate-
gorical vector X is to simplify the GLLB distribution P (x; ϑ) by setting some of the component parameters of ϑ identically
equal to zero in order to selectively remove higher-order terms from the full expansion shown in (159). When we do this, we
will say we are working with a reduced parameter vector and log-linear distribution, which we still denote by ϑ and P (x; ϑ)
respectively, allowing context to disambiguate the situation.144 Three such approximations that one can use are:
n n
Number of parameters = 0 + 1 = 1 + n.
n n n n(n−1) n2 +n+2
Number of parameters = 0 + 1 + 2 =1+n+ 2 = 2 .
143 The parameter β is known and fixed to be any positive real number.
144 Such models are known as non-saturated or unsaturated log-linear models in the literature on the analysis of categorical variables [2].
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 58
n n n n n2 +n+2 n(n−1)(n−2)
Number of parameters = 0 + 1 + 2 + 3 = 2 + 6 .
To show that (158) forms a linearly independent set we need to demonstrate that
n
X n
X n
X X
0 = θ0 + θi xi + θij xi xj + θijk xi xj xk + · · · + θi1 i2 ···in−1 xi1 xi2 · · · xin−1 + θ12 ··· n x1 x2 · · · xn (167)
i=1 i<j i<j<k i1 <i2 <···<in−1
if and only if all of the K = 2n coefficients are zero, θ0 = θ1 = · · · = θ12, ··· n = 0. If we assemble all of the coefficients in
equation (167) into a vector θ ∈ RK , K = 2N , then the condition for independence of the terms in B can be succinctly stated
as the requirement that (167) holds if and only if α = 0.
It will be easier to work with the binary variables Z = (z1 , · · · , zn )T ,
xi + 1
zi = ∈ {0, 1}.
2
Using the equivalent transformation xi = 2zi − 1, Eq. (167) is transformed to the equation
n
X n
X n
X X
0 = α0 + αi zi + αij zi zj + αijk zi zj zk + · · · + αi1 i2 ···in−1 zi1 zi2 · · · zin−1 + α12 ··· n z1 z2 · · · zn (168)
i=1 i<j i<j<k i1 <i2 <···<in−1
If we assemble the K = 2n components of (168) into a vector α ∈ RK , then the terms in Z are linearly independent provided
that equation (168) is true if and only if α = 0.
LEMMA
The elements of the set B of (158) are linearly independent if and only if the elements of the set Z of (169) are
linearly independent.
PROOF
One way to proceed is to show that the linear, invertible mapping xi = 2zi − 1 induces a linear, invertible mapping
between the collective elements of Z and X , which therefore guarantees the equivalence (i.e., an “if and only if”
relationship) of linear independence of the elements of B and linear independence of Z. However, we use a
different approach based on showing that
α = 0 ⇐⇒ θ = 0 (170)
for α of Eq. (168) and θ of Eq. (167), where the right hand sides of (168) and (167) are identical functions related
by the transformation x = 2x − e, as this then implies
elements of Z are linearly independent
z }| {
Eq. (167) ⇐⇒ Eq. (168) ⇐⇒ α = 0 ⇐⇒ θ = 0
x = 2z−e
| {z }
elements of X are linearly independent
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 59
Note that then “Eq. (168) ⇐⇒ α = 0” if and only if “Eq. (167) ⇐⇒ θ = 0”, which is the statement of the
lemma.
To show the validity of (170), we sequentially make substitutions xi = 2zi − 1 beginning with the last term in
Eq. (167) and moving to the next-to-last term, etc., until all terms in (167) have been exhausted. At the same
time that these substitutions are iteratively being made, we continually apply the consequences of the previous
iterations.
First Step. Making the substitution xi = 2zi − 1 in the last term of Eq. (167) results in
Second Step. Now making the substitution xi = 2zi − 1 in the penultimate term of Eq. (167) and using the
result of our previous step yields
:0
n−1
= 2n−1 θi1 i2 ···in−1 for all i1 < i2 < · · · < in−1
αi1 i2 ···in−1 = 2 θi1 i2 ···in−1 −
θ12···n
which implies
αi1 i2 ···in−1 = 0 ⇐⇒ θi1 i2 ···in−1 = 0 for all i1 < i2 < · · · < in−1 .
Third Step. Progressing to the third to last term, making the subsitution xi = 2zi − 1 and using the results of
our previous steps, setting the values of the previously considered parameters to zero, gives
X 0 0
αi1 i2 ···in−2 = 2n−2 θi1 i2 ···in−2 + = 2n−2 θi1 i2 ···in−2
:
:
θi1 i
2 ···in−2 ;k
+
θ12···n
k: ij <k<i`
for all i1 < i2 < · · · < in−2 and ij , i` ∈ {0, i1 , · · · , in−2 , in−1 }. This gives, assuming that the previously
considered parameters are set to zero,
αi1 i2 ···in−2 = 0 ⇐⇒ θi1 i2 ···in−2 = 0 for all i1 < i2 < · · · < in−2 .
k-th Step. Continuing in this manner, at the k-to-last term use the results of our previous steps, while setting the
values of the previously considered parameters for the terms outbound from k to zero, to obtain,
This gives, assuming that the previously considered parameters are set to zero,
αi1 i2 ···in−k+1 = 0 ⇐⇒ θi1 i2 ···in−k+1 = 0 for all i1 < i2 < · · · < in−k+1 .
Last Step. Finally at the k = n + 1-st stage, we have, with i0 = 0 and using the results of the previous stages,
we have, assuming that the previously considered parameters are all set to zero,
α0 = θ0 ⇐⇒ θ0 = 0.
With the Lemma proved, it remains to show that the terms in Z are linearly independent, i.e., that Eq. (168) holds if and only
if α = 0. Since sufficiency is trivial, we must show that Eq. (168) implies α = 0, where zi ∈ {0, 1}, i = 1, · · · , n. We do
this by proceeding recursively on the right-hand-side of (168), beginning with the first term α0 and proceeding term-by-term
to the last term α12 ··· n z1 z2 · · · zn . Note that at each step we use the results determined in the previous steps.
Step 2. For each i and j which are pairwise distinct (i.e., i 6= j), and ordered as i < j, set zi = zj = 1 and set
all other components to zero. Together with Eq. (168) and the results of the previous steps, this implies
αij = 0, for i, j = 1, · · · n, i < j.
Step 3. For each pairwise distinct triple i, j, k, ordered as i < j < k set zi = zj = zk = 1 and set all other
components to zero. Together with Eq. (168) and the results of the previous steps, this implies αijk = 0, for
i, j, k = 1, · · · , n, i < j < k.
Step k. For each pairwise distinct collection of k indices Ik = {i1 , · · · ik }, ordered as i1 < i2 < · · · < ik , set
zi` = 1 for i` ∈ Ik 6= ∅ and then set all other components to zero. Together with Eq. (168) and the results of
the previous steps, this implies αi1 ,··· ,ik = 0 for all index collections Ik = {i1 , · · · ik }.
Step n. Set all components equal to one, z1 = · · · zn = 1. Together with Eq. (168) and the results of the previous
steps, this implies α12···n = 0
Conclusion: α = 0.
Note that if we define I0 = ∅ in Step k, then the proof that α = 0 is summarized as:
Loop for k = 0, · · · , n;
Do Step k;
End Loop.
The multivariate Gaussian distribution (171) can be rewritten into the form
1 −E(X) 1 T
P (X) = e , E(X) = X W X − bT X , W = C −1 (172)
Z 2
and, in turn, any multivariate distribution of the form (172) with W invertible can be placed in the standard Gaussian form
(171). Thus the two forms (171) and (172) are entirely equivalent, assuming the invertibility of W . Whereas C is known as
the covariance matrix, its inverse W = C −1 is known as the concentration matrix. The Gaussian distribution (171)–(172) has
many special and important properties, a few of which we list here:146
Properties of the Quadratic Exponential Continuous (Gaussian) Distribution (171) and (172)
G1. Distributional Closure Property I: Marginals of a multivariate Gaussian are Gaussian.
G2. Distributional Closure Property II: The conditional pdfs P (A|B) computed from any two disjoint subsets A and B of
the elements of a Gaussian vector X are Gaussian.
G3. The Gaussian pdf is the maximum (differential) Shannon entropy pdf subject to constraints on the first two moments of
X. W is a matrix of Lagrange multipliers, which determines C = W −1 .
G4. C = W −1 is the covariance matrix. In particular, C is symmetric and positive definite.
G5. xi ⊥
⊥ xj ⇐⇒ cij = [C]ij = 0.
G7. xi ⊥
⊥ xj (X − xi ei − xj ej ) ⇐⇒ wij = [W ]ij = 0.
It is interesting to ask which of these properties, or analogies of these properties, if any, also hold for the Quadratic
Exponential Binary (QEB) distribution (164) which describes the behavior of a categorical random vector X with binary
components. As we have noted, the QEB distribution is equivalent to the Ising (aka Boltzmann Machine) distribution,
1 −βE(X) 1
P (X) = e , E(X) = − XT W X − bT X . (173)
Z 2
We will list the properties properties of the Quadratic Exponential Binary (QEB) distribution corresponding to the Gaus-
sian Properties G1–G7 as B1–B7, and will discuss them point-by-point. However, relative to the properties G1–G7 of the
Gaussian distribution, many of the properties of the QEB are negative properties.
Properties of the Quadratic Exponential Binary (QEB) Distribution (164) , (166) and (173)
The negative results B1 and B2 are discussed in references [9] and [10].
B3. The QEB distribution is the maximum entropy distribution subject to constraints on the first and second moments of the
binary-components categorical vector X
B4 There is no straightforward analogy to property G4 for the QEB distribution. W −1 does not exist because its diagonal
elements are all zero. Furthermore, although it is symmetric, in general W is not positive-semidefinite.
It is possible to modify W to a positive definite version W +dI without modifying the QEB distribution by adding a sufficiently
large positive constant d > 0,147 but W + dI in general will not have the interpretation of being the covariance matrix of X.148
B5 There is no straightforward analogy to property G5. W −1 does not exist in a meaningful way that one can identify as
being a covariance matrix.
B6 There is no straightforward analogy to property G6. Because it has diagonal elements, W is not positive-semidefinite.
Although, as per the discussion for property B4, one can create positive definite versions of W , the matrix W −1 would still,
in general, not be a covariance matrix, which is a necessary condition for W to be a concentration matrix [31].
B7 xi ⊥
⊥ xj (X − xi ei − xj ej ) ⇐⇒ wij = [W ]ij = 0.
Thus property G7 which holds for the Gaussian distribution is also true for the QEB distribution. This is noted in references
[10, 11] but without proof. For completeness, a proof of property B7 is given below.
An important consequence of propery B7 is that it is straightforward to construct a probabilistic dependency graph for the
QEB distribution if one knows the matrix W .149 One simply creates an edge between sites i and j, viewed as vertices on a
graph, if and only if wij 6= 0.
X̄ = X−ij = X − xi ei − xj ej
147 This is because x2i = 1 allows us to harmlessly transform the right hand side of Eq. (164) by the changes θ0 ← (θ0 − nd) and θii ← d, i = 1, · · · , n.
148 Which value of c is the correct value to use? More generally, one can make a modification W + D > 0, where D > 0 is a diagonal matrix, but which
matrix D do we use?
149 See [27] for a discussion of dependency graphs.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 62
A necessary condition for P (xi , xj | X̄) = P (xi | X̄)P (xj | X̄) to hold is that P (xi , xj | X̄) is capable of being factored into two
functions as
If it is impossible for the joint conditional probability to have such a factorization, then xi and xj cannot be conditionally
independent. We will show that a necessary condition for the conditional probability to factor as shown in (174) is that
wij = 0. After we have shown necessity, we will then show sufficiency,
Note that
1 T
−E(X) = X W X + bT X = (X̄ + xi ei + xj ej )T W (X̄ + xi ei + xj ej ) + bT (X̄ + xi ei + xj ej )
2
= −E(X̄ij ) + xi h̄i + xj h̄j + wij xi xj .
Therefore,
e−βE(X̄) β (xi h̄i +xj h̄j +wij xi xj )
P (X) = P (xi , xj , X̄) = e
Z
and
e−βE(X̄)
eβ (xi h̄i +xj h̄j +wij xi xj ) .
X X
P (X̄) = P (xi , xj , X̄) =
Z
xi , xj ∈ {−1,+1} xi , xj ∈ {−1,+1}
Thus,
= eβ (xi h̄i +xj h̄j +wij xi xj ) + eβ (−xi h̄i +xj h̄j −wij xi xj ) + eβ (xi h̄i −xj h̄j −wij xi xj ) + eβ (−xi h̄i −xj h̄j +wij xi xj )
The conditional probability will factor as shown in Eq. (174) if and only if the denominator factors, which will be the case
if and only wij = 0. This the vanishing of wij is a necessary condition for conditional independence of xi and xj given X̄.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 63
resulting in
P (xi , xj | X̄) = P (xi | X̄)P (xj | X̄)
as claimed.
where
XG = XC1 ∪ · · · ∪ XCc .
In practice, for most practical models of interest, both procedures are exploited.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 64
References
[1] David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. “A Learning Algorithm for Boltzmann Machines”.
Cognitive science, 9(1):147–169, 1985.
[2] A. Agresti. Categorical Data Analysis. Wiley-Interscience, second edition, 2002.
[3] Daniel J Amit. Modeling Brain Function: The World of Attractor Neural Networks. Cambridge University Press, 1992.
[4] P. Billingsley. Probability and Measure. Wiley-Interscience, 3rd edition, 1995.
[5] P. Brémaud. Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues. Springer, 1999.
[6] Herbert B Callen. Thermodynamics and an Introduction to Thermostatistics. John wiley & sons, 2nd edition, 1985.
[7] I.A. Cosma and L. Evers. Markov Chains and Monte Carlo Methods – Lecture Notes. AIMS - African Institute for
Mathematical Sciences, 2010.
[8] T.H. Cover and J.A. Thomas. Elements of Information Theory. Wiley, second edition, 2006.
[9] D.R. Cox. “The Analysis of Multivariate Binary Data”. Journal of the Royal Statistical Society, Series C (Applied
Statistics), 21(2):113–120, 1972.
[10] D.R. Cox and N. Wermuth. “A Note on the Quadratic Exponential Binary Distribution”. Biometrika, 81(2):403–408,
1994.
[11] D.R. Cox and N. Wermuth. “On Some Models for Multivariate Binary Variables Parallel in Complexity with the Multi-
variate Gaussian Distribution”. Biometrika, 89(2):462–4469, 2002.
[12] P.J. Dhrymes. Mathematics for Econometrics. Springer, 4th edition, 2013.
[13] S. Fienberg. The Analysis of Cross-Classified Categorical Data. Springer, second edition, 2007.
[14] G.D. Forney Jr. “The Viterbi Algorithm”. Proceedings of the IEEE, 61(3):268–278, March 1973.
[15] R.J. Glauber. “Time-Dependent Statistics of the Ising Model”. Journal of Mathematical Physics, 4(2):294–307, 1963.
[16] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
[17] C. Gourieroux and A. Monfort. Statistics & Econometric Models – Volume One. Cambridge University Press, 1995.
[18] C. Gourieroux and A. Monfort. Statistics & Econometric Models – Volume Two. Cambridge University Press, 1996.
[19] G Grimmett. Probability on Graphs: Random Processes on Graphs and Lattices. Cambridge University Press, 2011.
[20] S. Haykin. Neural Networks and Learning Machines. Prentice Hall, 3rd edition, 2008.
[21] John Hertz, Anders Krogh, and Richard G Palmer. Introduction to the Theory of Neural Computation. Addison-Wesley,
1991.
[22] Geoffrey E Hinton and Terrence J Sejnowski. “Optimal perceptual inference”. In Proceedings of the IEEE conference
on Computer Vision and Pattern Recognition, pages 448–453. IEEE New York, 1983.
[23] Geoffrey E Hinton and Terrence J Sejnowski. “Learning and releaming in Boltzmann machines”. Parallel Distributed
Processing, 1, 1986.
[24] Kerson Huang. Statistical Mechanics. Wiley, 1987.
[25] S. Kirkpatrick, C. D. Gelatt Jr., and M.P. Vecchi. “Optimization by Simulated Annealing ”. Science, 220(4598):671–680,
1983.
[26] K. Kreutz-Delgado. “Real Vector Derivatives, Gradients & Nonlinear Least-Squares”. Lecture Notes - Report Number
ECE275A-LS2-F17v1.0, 2017.
[27] S.L. Lauritzen. Graphical Models. Claredon Press, 1996.
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 65
[28] D.J.C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003.
[29] M. Nerlove and S.J. Press. Univariate and Multivariate Log-Linear and Logistic Models. Technical report, Rand Corpo-
ration Technical Report R-1306-EDA/NI, 1973.
[30] J.R. Norris. Markov chains. Cambridge university press, 1998.
[34] G. Tkacik, E. Schneidman, M.J. Berry II, and W. Bialek. “Spin Glass Models for a Network of Real Neurons”.
arXiv:0912.5409v1, 2009.
[35] L. Younes. Synchronous Boltzmann Machines can be Universal Approximators. Applied Mathematics Letters, 9(3):109–
113, 1996.