Stochastic Neural Networks v2.0c PDF

Stochastic Artificial Neural Networks
The Ising Model, Gibbs Sampling, Glauber Dynamics

and the Boltzmann Machine Learning Algorithm
Draft Lecture Notes – Summer 2016, 2017, 2018
Ken Kreutz–Delgado
Electrical and Computer Engineering
Irwin and Joan Jacobs School of Engineering
University of California, San Diego
DRAFT VERSION PRL-SNNIM-2017.v2.0c — January 31, 2019
Contents
1 Introduction 2
2 Statistical Mechanics and the Boltzmann Distribution 2

2.1 The Boltzmann Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 The Boltzmann Distribution on a Random Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 The Ising Model of a Neural Network – The Boltzmann Machine 5

3.1 Energy Function for the Ising Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Markov Chain Monte Carlo (MCMC) and the Gibbs Sampler 6

4.1 General Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2 Random Fields and MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3 The Gibbs Sampler on a Random Field with Distribution π . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.4 The Metropolis-Hasting Algorithm on a Random Field with Distribution π . . . . . . . . . . . . . . . . . . 10
4.5 Exploiting Ergodicity of a Markov Chain Stochastic Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5 Gibbs Sampler on the Boltzmann Machine 11

5.1 Glauber Dynamics on the Ising Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 Glauber Dynamics as Random “Spin Flipping” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.3 Comments and Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.3.1 Parallel Implementation of the Single-Site Random-Scan Gibbs Sampler . . . . . . . . . . . . . . . 16
5.3.2 Simultaneous Multiple-Site Random-Scan Updating in the Gibbs Sampler . . . . . . . . . . . . . . . 16
5.3.3 Gibbs Sampling for Higher-Order Energy Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.3.4 Completely Synchronous All-Sites Ising Model Sampler . . . . . . . . . . . . . . . . . . . . . . . . 18
6 Boltzmann Machine as a Distribution Estimator 22

6.1 Generative Model for a Random Categorical Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.2 Visible versus Hidden Variables on a Random Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3 Universal Approximation Property of the Marginalized Ising Model . . . . . . . . . . . . . . . . . . . . . . 24
6.3.1 Necessary Number of Hidden Units – I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.3.2 Necessary Number of Hidden Units – II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.3.3 Universal Approximation Property – Sufficient Number of Hidden Units . . . . . . . . . . . . . . . . 27
KEYWORDS: Ising Model; Spin Glass; Boltzmann Machine; Restricted Boltzmann Machine; Stochastic Binary Neural Networks; Glauber Dynamics;
Multivariate Categorical Variables; Multivariate Binary Random Variables; Quadratic Exponential Binary Distribution.
1
Stochastic Neural Networks — K. Kreutz-Delgado — Version PRL-SNNIM-2017.v2.0c 2
6.4 Boltzmann Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.4.1 Exact KL Divergence-Based Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.4.2 Data-Driven Learning - Maximum Likelihood Estimation (MLE) . . . . . . . . . . . . . . . . . . . 37
6.5 The Restricted Boltzmann Machine (RBM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
A Information Theory and Maximum Likelihood Estimation 38

A.1 Maximum Likelihood Estimation (MLE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
A.2 The Empirical Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
A.3 Entropy, Cross Entropy and the Kullback-Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . 41
A.4 Empirical Cross-Entropy Optimization and MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
B Properties of the Boltzmann Distribution 43
C Optimality of the Boltzmann Distribution 49

C.1 Minimum Free Energy Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
C.2 Maximum Entropy Distribution Subject to Average Energy Constraint . . . . . . . . . . . . . . . . . . . . . 50
C.3 Maximum Entropy Distribution Subject to 1st and 2nd Moment Constraints . . . . . . . . . . . . . . . . . . 51
D The Boltzmann Distribution & Categorical Data Analysis 52

D.1 General Multivariate Categorical Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
D.2 Binary Categorical Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
E The Quadratic Exponential Continuous and Binary Distributions 60
F Probabilistic Graphical Models & The Boltzmann Distribution 63
1 Introduction
Drawing from basic concepts of Equilibrium Statistical Mechanics and Thermodynamics [6, 24]; the classic textbooks on
spin-glass model-based neural networks by Amit [3] and by Hertz, Krogh and Palmer [21]; and the theory of finite-state
Markov Chains and Markov Chain Monte-Carlo (MCMC) stochastic sampling [30, 5]; a summary is given of the basic theory
of stochastic binary artificial neural networks.
We show that the Ising distribution can be developed from 1) statistical mechanics; 2) from entropy maximization; and 3)
as an approximation to the probabilities fully describing a binary categorical random vector.
2 Statistical Mechanics and the Boltzmann Distribution

2.1 The Boltzmann Distribution
The primary models used to represent stochastic neural networks are provided by the theory of equilibrium statistical mechan-
ics which describes the stochastic behavior of a many-particle system in thermal equilibrium with a heat bath.
Let a discrete random vector X, which takes on a finite number, K < ∞, of values, represent the collective state of a
system of a finite number of interacting particles.1 Mathematically X = (x1 , · · · , xn )T is a random vector with discrete
components xi that takes its values in a finite probability space (X, P(X), P ) with sample space X = {X(1) , · · · , X(K) } ⊂ Rn
with K = |X| < ∞, P(X) = power set of X,2 and P = PX = probability mass function on X that induces a distribution
on P(X). We also refer to the random vector X as the state and the sample space X as the state space. The realization
vectors X = X, X ∈ X ⊂ Rn , are all taken to be real and finite with components xi , so that X = (x1 , · · · xn )T . Throughout
we attempt to adhere to the convention that roman font denotes random variables and sans serif font denotes realization
values. We also try to adhere to the convention that upper case fonts denote vectors and lower case fonts denote components
of vectors. Thus a realization of the roman font random component xi is the lower case sans serif variable xi , xi = xi ,
whereas the upper case san serif variable X(`) denotes one of the admissible realization vectors for the upper case roman
(`)
font random vector X in the sample space X. The i-th component of the sample space vector X(`) is denoted by xi . We
1 For example, these particles could be randomly moving and inter-colliding gas molecules subject to electrostatic, magnetic, and/or gravitational inter-
action forces, each of which can be in a finite number of individual particle states.
2 P(X) is the algebra of events associated with a measurement of X.
have P (X) = PX (X) = probability that X = X ∈ X = {X(1) , · · · , X(K) }, so that P (X) takes its values in the finite set
{p1 , · · · , pn } where pi = P (X(i) ). Therefore P (X) can be written as
K
X
P (X) = pi δX , X(i) .
i=1
The various expressions

X X K
X
f (X) = f (X) = f (X(i) )
X X∈X i=1
P
denote a sum over the realization values of the state X. Of course we have P (X) ≥ 0 for all X ∈ X and X P (X) = 1.
The system with random state X is assumed to have a fixed, constant number of particles, n, and a constant volume. The
system of interacting particles is also assumed to be in thermal equilibrium with a heat bath at a constant temperature T . Heat
can flow between the system and the heat bath, but no mechanical energy (i.e., no work) or particles can be exchanged between
them. For a given realization X = X, the interaction energy of the system is given by a known energy function E(X), where
we assume that the energy is always finite, |E(X)| < ∞ for all X ∈ X.
Given the above setup, the theory of equilibrium thermodynamics [24, 6] says that the system is described by a Boltzmann
distribution3 which gives the probability that the system described by the random vector X is in a particular realization state,
e−βE(X) 1
P (X) = > 0, β= >0 (1)
Z kT
where it is assumed that the partition function Z = Z(β) is a finite-valued normalization factor
K
X (i)
Z= e−βE(X )
< ∞.
i=1
Note that a Boltzmann distribution is positive, P (X) > 0, for all states X ∈ X under our assumption that the energy is finite.
In the equilibrium statistical mechanical formalism used to describe a many-particle physical system, the energy E(X) is
a well-defined physical quantity and the Boltzmann distribution arises as a consequence of the theory. For real-world physical
systems k is set equal to Boltzmann constant, an important physical parameter that relates temperature to thermal energy. In
this note, where we are exploiting the mathematical model of the Boltzmann distribution for information processing purposes,
we can set k = 1 and refer to T as the pseudo-temperature.4
Several properties of the Boltzmann distribution are derived and described in Appendices B and C. In particular:
• The Boltzmann distribution is the maximum entropy (MaxEnt) distribution given the constraint that the system has a
fixed known average energy. (See Appendix C.2.)
• The Boltzmann distribution is the minimum free energy distribution. (See Appendix C.1.)
Often the energy function is a quadratic form in x,
e−βE(X) 1
P (X) = , E(X) = − XT W X − bT X (2)
Z 2
where the elements of the n × n matrix W = W T and the n × 1 vector b are continuous and real. This distribution is the
Boltzmann Distribution with Quadratic Energy (BDQE).
The quadratic form (2) can arise from invoking physical models, as is done in Section 3 below for the Ising spin glass
model. Alternative, one can justify the use of the quadratic-energy Boltzmann distribution by invoking the following property:
• The Boltzmann Distribution with Quadratic Energy (2) is the maximum entropy (MaxEnt) distribution subject to con-
straints on the first and second moments of X. (See Appendix C.3.)
3 This distribution is also variously known as the Boltzmann-Gibbs distribution, the Gibbs-Boltzmann distribution, the Gibbs distribution, or the canonical
distribution.
4 I.e., in this note we are looking at the properties of stochastic artificial neural networks (ANNs), not the properties of a true physical system where the
Boltzmann constant would have physical relevance. We will usually drop the adjective “artificial”, with the understanding that we are not concerned with
modeling the behavior of a true physical system and, consequently, we have agreed to set k = 1 throughout the note.
The Boltzmann distribution with quadratic energy (2) also arising in the theory of categorical random vectors where it is a
(multivariate) quadratic exponential categorical (QEC) distribution5 . Note that it is the discrete random variable analogue to
the Gaussian distribution used to model the behavior of continuous random variables.6
Thus use of the Boltzmann distribution (1) with quadratic energy (2) as a viable stochastic model can be motivated from a
variety of perspectives, including arguments drawn from the following areas:
• Physics, invoking equilibrium statistical mechanics (as discussed in the body of this note).
• Information Theory, invoking the principle of maximum entropy arguments (see Appendix C.3).
• Statistics, invoking the theory of multivariate categorical variables (see Appendix D).
• Statistics, invoking the theory of probabilistic graphical models (see Appendix F).
2.2 The Boltzmann Distribution on a Random Field

Here we now interpret X as being a finite-valued random vector describing a random field.7 In particular, X is a discrete
random vector that takes a finite number of realization vectors, X(`) ,
X = X ∈ X = {X(1) , · · · , X(K) } , |X| = K < ∞ ,
that each encode a vector of realized site values. We call the random vector X the state of the random field where the
components of the random vector X correspond to finite random variables on the sites of the random field. The random field
perspective is used when the sites i, and the variables xi attached to the sites, are each associated with a fixed physical location
in space.8
A (finite-site) random field has n sites (or nodes, or vertices) and to each site i is associated a random variable xi that takes
a finite number, m, of possible realization values xi = xi ∈ X = {ξ (1) , · · · , ξ (m) }, |X| = m, with ξ (j) 6= 0.9 Note that
(`)
Xj = ξ (j) is the j-th component of a realization X = X(`) of the random vector X. The random vector X denotes a random
configuration state of the random field as
X = (x1 , · · · , xn )T .
X takes realization values
X = X = (x1 , · · · , xn )T ∈ X ⊂ Rn
in the state space
X = X × · · · × X = Xn with K = |X| = |Xn | = |X|n = mn .
| {z }
n times
Summarizing, to each site i is associated a particle where xi denotes the random i-th particle-state value. Thus the state X
denotes the random configuration of a system of n interacting particles. Note that even in the simplest nontrivial case where
each xi is binary, m = 2, the size of K = |S|, i.e., the number of possible configurations of the random field, quickly becomes
astronomical as the number of particles, n, increases in size.10
The system of interacting particles with random state vector X is assumed to have a constant number of particles, n, a
constant volume, and to be in thermal equilibrium with a heat bath at temperature T . Heat can flow between the system and
the heat bath, but no mechanical energy (work) or particles can be exchanged between them. Thus, according to the theory
of equilibrium statistical mechanics [6, 24], a Boltzmann distribution describes the probability that the multi-particle system
described by the random vector X = (x1 , · · · , xn )T is in a particular realization state X = X,
e−βE(X) 1
P (X) = > 0, β= >0
Z kT
5 See Appendix D.
6 See Appendix E and the discussion in [9, 10].
7 This is a change in perspective from component particles being free to move around (as for a gas), to particles that are now viewed as being pinned to a
specific “site”.
8 For example the fixed locations of a network of physical spin- 1 particles attached to sites on a crystalline solid, or the immobile neurons in an artificial
2
brain.
9 Setting x ≡ 0 encodes the removal of site i from consideration, as discussed below. Note that we have the same m realization values for each site i.
i
Also note that the case m = 2 was discussed in Section 3
10 E.g., m = 2 for a binary image and n = 10,000 for a 100 × 100 pixel grid gives K = 210,000 = (210 )1,000 ≈ 103,000 . To get a sense of the
vastness of this number, note that the number elementary particles in the universe is estimated to be about 1086 while the age of the universe in nanoseconds
is estimated to be less than 1027 .
where E(X) is the energy the system when it is in the realized configuration S = X. Again note that a Boltzmann distribution
is positive, P (X) > 0, for all realized state vectors X under the assumption that the energy is finite, |E(X)| < ∞ for all X ∈ X.
The partition function Z = Z(β) is a finite normalization factor
X
Z= e−βE(X) < ∞ .
X∈X
Because of the frequently astronomical size of the value of K, the computation of the partition function Z is usually
intractable, precluding the determination of the Boltzmann distribution in closed form. For this reason, and others, Markov
Chain Monte Carlo stochastic sampling techniques are frequently resorted to, as discussed in Section 4 below.
If we limit the site values (components of X) to be binary,11 |X| = m = 2, with realization values xi = ±1, the resulting
random field model is known as a spin glass model. If in addition the energy function is a quadratic function of the form (2),
1
E(X) = − XT W X − XT b (3)
2
where the n × n matrix W = W T and the n × 1 vector b are real and continuous, then we have the Ising spin glass model
and the Ising distribution describing the behavior of a magnetic particle system [24]. As we elaborate below, the binary-sites
Ising spin glass model is used to model stochastic binary node neural networks, resulting in the Boltzmann Machine (BM)
distribution [1, 21].
The Ising Model is the Boltzmann Distribution with Quadratic Energy (BDQE, see Appendix C.3) restricted to the state
vector X having binary components which was mentined at the end of Section 2.1. The binary-components Boltzmann distri-
bution with quadratic energy function is also known in the statistics literature as the quadratic exponential binary distribution
[10].
3 The Ising Model of a Neural Network – The Boltzmann Machine

The standard model of a stochastic neural network, often referred to as a Boltzmann Machine [1, 23, 21], is equivalent to that
of an Ising spin glass system, which is an ensemble of n spin 12 magnetic particles subject to an external magnetic field and
mutual magnetic interactions.12 The particle spins are assumed to be measured only along a single direction (which we can
assume to be the vertical direction) and, consistent with the standard quantum mechanical formalism of spin 21 particles, the
measured value of each particle i can only take the value “up”, xi = xi = 1, or “down”, xi = xi = −1 [32].13
The system of magnetic particles, with random state vector S, is assumed to have a constant number of particles, n, a
constant volume, and to be in thermal equilibrium with a heat bath at temperature T . Heat can flow between the system and
the heat bath, but no mechanical energy (work) or particles can be exchanged between them. As discussed above, this yields
the Boltzmann distribution over a field of binary random variables, X = (x1 , · · · , xn )T , taking values xi = ±1,
e−βE(X) 1
P (X) = > 0, β= >0
Z kT
where E(X) is the energy of the spin glass system when the system is in the realized configuration S = S. For the Ising model
(aka Boltzmann Machine), the energy function E(X) is of the quadratic form (3), as we now proceed to show for a simple
bilinear energy for inter-particles interactions.
3.1 Energy Function for the Ising Model

1
The energy of interaction between two spin 2 magnetic particles i and j, i 6= j, with instantiated spin realization values
xi = xi and xj = xj is
spin-spin interaction energy of particles i and j = −xi wij xj = −xj wji xj .
11 In the statistics literature, a random variable x that takes a finite number of values m ≥ 2 is known as a categorical (or qualitative) random variable
i
[2, 13]. When m = 2, xi is a binary (or dichotomous) categorical variable. When m > 2, xi is sometimes called a polytomous categorical variable. More
generally the number of values, mi , of the categorical variable xi can be variable specific. Thus taking mi = m for all i as we done is a simplification of the
general case.
12 The spin-glass model of stochastic neural networks is developed in a variety of textbooks. This note draws primarily from references [3, 21]. See also
[24]. We will use the terminology “Ising Model” and “Boltzmann Machine” interchangeably.
13 x denotes the i-th spin random variable, x denotes an unknown realization value for spin particle i, while stating that x = +1 indicates a known
i i i
realization value of +1. The particles can all be simultaneously measured, i.e., the vector X = x can be instantaneously measured. If the system is quantum
mechanical, prior to a measurement the particles can be in an entangled state.
Note that this pairwise, bilinear interaction model forces the symmetry condition wij = wji . We take the energy of self-
interaction to be zero by setting wii = 0, for all i.14
In addition, each particle i feels the local (vertical) field intensity bi of an external (vertical) magnetic field, resulting in
spin-external field interaction energy of particle i = −xi bi .
1
Summing up these energies for a system of n spin 2 magnetic particles yields the total energy
n n n n
X X 1 X X
E(X) = − wij xi xj − bi x i = − wij xi xj − xi bi
i,j=1 i=1
2 i,j=1 i=1
i<j
using the facts that wij = wji and wii = 0. Note that we can write the total energy in vector-matrix form as the vector-matrix
quadratic form
1
Total Energy for the Ising Model = E(X) = − XT W X − XT b (4)
2
with
W = WT and diag(W ) = 0 .
The local magnetic field at particle site i is denoted by15
n
X
hi = hi (X) = wij xj + bi , i = 1, · · · , n
j=1
which we can stack into a n-vector as16

h = h(X) = W X + b .
Then we can write,
hi (X) = eTi h(X) = eTi W X + bi = wTi X + bi ,
where wTi is the i-th row of the weighting matrix W .17 Note that
1 T
E(X) = − hT X + X WX.
2
The second term on the right hand side is needed to compensate for the double counting of the particle-to-particle (spin-spin)
interaction energies occurring in the first term.
4 Markov Chain Monte Carlo (MCMC) and the Gibbs Sampler

4.1 General Background
Let π(X) = πX (X) be a distribution of interest on the random field state space X that is assumed to be positive, π(X) > 0 for
all X ∈ X, so that division by π is always well defined. A positive distribution π always can be written in Boltzmann form18
e−βE(X)
π(X) = (5)
Z
where E : X → R is an energy function (or potential) defined on S that is assumed to be finite-valued.
Because typically |X| = K 1 (see footnote 10), it is generally intractable to compute the partition function Z and/or
draw a sample X = X from the distribution π. For this reason, we will construct a homogenous (i.e., time-independent)
14 This can be done with no loss of generality because x2 ≡ 1 so that w x2 = w is a constant that merely changes the energy baseline ground state
i ii i ii
value.
15 Because w = 0, we actually have h (X) = h (X ) where X
ii i i −i −i denotes the fact that particle i has been being ignored. This is discussed latter below.
16 Here we need the entire realization vector X.
17 Note that h (X) = 0 defines a separating hyperplane in Rn [20].
i
18 Note that this has nothing to do with equilibrium statistical mechanics, but is a purely mathematical fact due to the fact that π is positive. For specified,
π(X)
arbitrary choices of a reference state X0 and a value for E(X0 ), one takes E(X) = E(X0 ) − β1 ln π(X ) . See Property 1 of Appendix B.
0
discrete-time finite-state Markov chain which has π has its asymptotic, equilibrium (steady-state) distribution. This type
of dynamic stochastic sampling procedure is known as Markov chain stochastic sampling or Markov Chain Monte Carlo
(MCMC) sampling [30, 5].
A homogeneous discrete-time Markov Chain on K = |X| states is defined by a K × K stochastic matrix P whose
components are independent of time,
Pk+1,k (Xk+1 = X0 k Xk = X) = P (X0 k X) with X0 , X ∈ X, |X| = K .

time invariant
To prevent confusion, we denote conditioning with respect to a previous time-step random variable using a double vertical
stroke.19 Note that that a realization in the first argument of the homogeneous transition probability P (· k ·) is an event that, in
time, immediately follows a realization in the second argument.
At time k = 0, the Markov chain is initialized to a probability π0 = λ > 0, eT λ = 1, and then subsequently evolves
according to20,21 X
πk+1 (X0 ) = P (X0 k X) πk (X) with X0 , X ∈ X, |X| = K .
X∈X
In vector-matrix form this is written as,

πk+1 = P πk .
Note that we have set up the problem so that πk is aK × 1 column vector, πk ∈ RK+ , and as a consequence it is the columns
of the stochastic matrix P that sum to one. The K × K matrix P is commonly called the transition matrix or the kernel matrix
of the Markov chain. It is easy to verify that eT πk = 1 for all k ≥ 0.
We are interested in conditions that ensure that the Markov chain is stable and converges to a unique stationary solution
π∞ = π, where π is our distribution of interest shown in Eq. (5). This will occur if the transition matrix P is chosen to ensure
the following three Convergence Conditions:22
C1. The chain is irreducible. This means that the entire state-space X is a single communicating class so that all states in X
will be visited infinitely often, regardless of the state that the system is started in, and with a finite mean-time to return
for each state.
C2. The chain is aperiodic. This ensures that after the chain has been running for a long enough time, then all states are
visited at all subsequent times with nonzero probability.
C3. The desired asymptotic distribution π and the transition matrix P are in balance,
Balance: π = Pπ (6)
I.e., the desired distribution π is an invariant distribution for the transition matrix P.
If these three conditions hold, then π is the unique invariant distribution for P and limk→∞ πk = π independently of the choice
of the initial distribution π0 = λ. In this case π is the unique equilibrium distribution.
For a large state space, K 1, and/or complicated form for π it can be difficult to find a transition matrix P that satisfies
the Balance Condition (6). For this reason it is usually the case that one replaces the Balance Condition (6) with the stronger
condition of Detailed Balance.
To do so, note that the components-level statement of balance,
X
Balance: π(X0 ) = P (X0 k X) π(X) ∀ X, X0 ∈ X (7)
X∈X
19 With the single vertical stroke denoting same-time-step conditioning. Thus P ( · k · ) refers to a transition probability while P ( · ) and P ( · | · ) do not.
20 This is known as the Chapman-Kolmogorov Equation. The proof is straightforward:
0 0 0 0 0
X X X
P (X k X) πk (X) = P (Xk+1 = X k Xk = X)P (Xk = X) = P (Xk+1 = X , xk = X) = P (Xk+1 = X ) = πk+1 (X )
X∈X X∈X X∈X
21 Note that this sum, similarly to the sum needed to compute the partition function, is generally over a vast number |X| = K of terms. Fortunately we
will not have to compute these sums, but instead will simulate the dynamical behavior of the Markov chain which will be a tractable computation.
22 See [30, 5] for rigorous treatments. A nice summary description of these conditions can be found in [20]. A self-contained and very readable set of
lectures notes is given in [7].

can be written as
!
X X X
0= P (X k X ) · π(X0 ) −
0
P (X0 k X) π(X) = P (X k X0 ) π(X0 ) − P (X0 k X) π(X) .
X∈X X∈X X∈X | {z }
| {z }
=1 , T (X,X0 )
Therefore a sufficient condition for balance between π and P to hold (i.e., for π to be an equilibrium distribution for P as
required by the Balance Condition C3 ) is that T (X, X0 ) = 0 for all X, X0 ∈ X. This yields the Detailed Balance Condition:
C3.0 The desired asymptotic distribution π and the transition matrix P satisfy the condition of
Detailed Balance: P (X0 k X) π(X) = P (X k X0 ) π(X0 ) for all X0 , X ∈ X (8)
which is a sufficient condition for π and P to be in balance (i.e., a sufficient condition for π to be an invariant distribution
for P).
Note that if P (X k X0 ) 6= 0, then the Detailed Balance Condition C30 can be written as23
P (X0 k X) π(X0 )
Detailed Balance: = for P (X k X0 ) 6= 0 (9)
P (X k X0 ) π(X)
The Detailed Balance Condition is a symmetry condition that requires the rate of the transition X → X0 to be equal to the
rate of the transition in the reverse direction, X0 → X. Because so many references merely state, without explanation, that the
Convergence Conditions C1, C2 and C30 are requirements for a Markov chain to converge to a desired equilibrium distribution
π, many people appear to be unaware that generally the Detailed Balance Condition C30 is not a necessary condition, but
only a sufficient condition for convergence to occur (assuming that C1 and C2 hold), and can be replace by the weaker (but
generally harder to verify) requirement of Balance as defined in Eq. (6).
Given a desired stationary distribution π there are many possible choices of the transition matrix P that satisfy the Conver-
gence Conditions C1, C2 and C30 . One class of choices leads to the class of Metropolis-Hasting algorithms. Another set of
choices results in the Gibbs Sampler family of stochastic sampling algorithms.24
4.2 Random Fields and MCMC

Before proceeding, it should be noted that a critical practical criterion to be met in order to have a computationally feasible
algorithm is that drawing from the transition distribution P (X0 k X) should be tractable.25 For large state spaces, K 1,
this might not be possible unless one can exploit special structure imposed on the transition probabilities, i.e. on the nature
of feasible transitions. For this reason we will look at the random variable X and its statespace X in a bit more detail. In
particular, we exploit the fact that a realization X = X ∈ X determines the configuration of a random field.
So how do we draw a new value X0 from P (X0 k X) efficiently each time we do a state transition from a current state-value
X? We do it by changing only one random field site at a time. I.e., for efficient sampling on a random field we intentionally
restrict consideration to transition probabilities that satisfy the following
Imposed Transition Constraint: P (X0 k X) = 0 if X and X0 differ at more than one site
note that this reduces the number of states that one can transition into from K = mn , a generally vast number, to nm.
The sites can be treated as nodes on a graph where the edges of this graph are determined from the pairwise conditional
dependencies between nodes that are encoded in π(X), and hence, equivalently, in the energy function (potential) E(X).26
The instantaneous configuration of the random vector X on this graph is denoted by the realization X = X, and sampling
23 Recall that our positivity assumption on π ensures that π(X) 6= 0.
24 The Metropolis-Hasting family includes the Gibbs Sampler family as a special case, as we show below.
25 Remember, we are considering MCMC samplers because drawing from the unconditional distribution π(X) is supposed to be hard. Unless we are
careful, why should drawing from the conditional distribution P (X k X0 ) be any easier?
26 This graph is known as a dependency graph [19]. An edge is drawn between site i and site j if the random variables x and x are not independent
i j
conditioned on the remaining components of X.
one site at a time is equivalent to 1) moving from node i, with current value xi , to node j, with current value xj , followed by
2) a transition from xj to a new value x0j . Doing these two steps in some principled manner corresponds to the Markov chain
transition X → X0 .
To more easily analyze this transition, it is useful to consider the subgraph and random variable that ensues when a single
site i is removed from the full graph. To indicated this mathematically, we set xi ≡ 0. The resulting induced random subgraph
on (n − 1) nodes is then associated to the random vector27
X−i = X − xi ei
with realized configuration

X−i = X − xi ei = (x1 , · · · , xi−1 , 0, xi+1 , · · · , xn )T ∈ Rm .
Of course, then
X = X−i + xi ei . (10)
It is useful to note the following probability expressions relating X, X−i , and xi :
π(X) = π(xi , X−i ) = π(xi | X−i )π(X−i ) (11)

X X
π(xi ) = π(X) = π(xi , X−i ) (12)
X−i X−i
X X
π(X−i ) = π(X) = π(xi , X−i ) (13)
xi xi
π(xi , X−i ) π(X)
π(xi | X−i ) = = (14)
π(X−i ) π(X−i )
4.3 The Gibbs Sampler on a Random Field with Distribution π

Note that if neighboring transitions X and X0 are allowed to differ only at a single site, say the i-th site, then X−i = X0−i and
the detailed balance condition (9) becomes
P (X0 k X) π(X0 ) π(x0i | X−i )
0
= = for P (X k X0 ) = P (xi , X−i k x0i , X−i ) 6= 0. (15)
P (X k X ) π(X) π(xi | X−i )
Remember that we are building the sampler to have the desired properties, so we get to choose the function form of
P (X0 k X). We do this by imposing the efficient sampling constraint
P (X0 k X) = 0 if X and X0 differ at more than one site
and otherwise using the Gibbs Sampling Rule:
P (X0 k X) , P (i, x0i k X)

= conditional probability that X0 differs from X at site i and the new value at site i is x0i , given X.
where
P (i, x0i k X) = P (x0i k i, X)P (i k X).
The Gibbs sampler choice for P (x0i k i, X) is the conditional probability
P (x0i k i, X) , π(x0i | X−i ). (16)
Note that as a consequence of this choice,
P (x0i k i, X) = P (x0i k i, X−i ). (17)

27 Defining X−i in this manner will facilitate the analysis and algorithm development of the Glauber Dynamics for the Ising neural network model done
latter below. Note that we can iterate on this procedure to define X−ij = X − xi ei − xj ej , etc.
A common choice for P (i k X) is the random scan choice where site i is chosen uniformly over all possible sites, indepen-
dently of X,
1
P (i k X) = P (i) =
. (18)
n
Thus the transition matrix for the random scan Gibbs Sampler is given by the:
RANDOM SCAN GIBBS SAMPLER

(
0 if X, X0 differ at more than one site
P (X0 k X) = (19)
P (xi k i, X−i )P (i) = π(x0i | X−i )
0 1
if X, X0 possibly differ at site i

n
1
The last line of (19) says that one first randomly selects a site i with uniform probability P (i) = n, followed by the selection
of a new realization value x0i drawn from the conditional probability π(x0i | X−i ).
It is easy to verify that the Detailed Balance Condition, in either form (8) or (15), is satisfied.28 Furthermore, because all
sites are eventually selected, the entire state space forms a single communicating class so that the Markov chain is irreducible
for reasonable forms of π(x0i | X−i ). Finally, if we determine that with nonzero probability a state possibly does not change
in value during a single time-step transition, then the chain is aperiodic. Then, with Convergence Conditions C1, C2 and C30
satisfied, we have limk→∞ πk → π independently of the choice of the initial distribution π0 = λ.
4.4 The Metropolis-Hasting Algorithm on a Random Field with Distribution π

Let R be any K × K Markov chain transition matrix with elements given by admissible transition probabilities R(X0 k X).
The transition matrix R is known as a proposal distribution, and its elements R(X0 | X) are proposal transition probabilities.
Given proposal transition probabilities R(X0 | X), the Metropolis-Hastings Algorithm [30] implements actual transition
probabilities P (X0 k X) defined by the condition29

0 0 0 0
P (X k X) π(X) = R(X k X) π(X) ∧ R(X k X ) π(X ) . (20)
With this choice for P (X0 k X) it is easy30 to see that the Detailed Balance Condition (8) is satisfied. If R(X0 k X) is nonzero,
we can write
R(X k X0 ) π(X0 )

P (X0 k X) = 1 ∧ R(X0 k X) , A(X, X0 ) R(X0 k X), (21)
R(X0 k X)π(X)
where A(X, X0 ) is the acceptance probability. If A(X, X0 ) = 1, then the proposal transition X → X0 determined by drawing
a sample from the proposal distribution R(X0 k X), is accepted (with probability 1 = A(X, X0 )). On the other hand, if
A(X, X0 ) < 1, then the proposal transition X → X0 is accepted with lower probability A(X, X0 ) < 1 and rejected (in which
case X → X) with probability 1 − A(X, X0 ) > 0.
Note that A(X, X0 ) = 1 suggests that the transition probability is possibly imbalanced in favor of transitions into X from X0
(relative to the Detailed Balance condition (8)), so we keep transitions X → X0 drawn from the proposal distribution P (X0 k X)
in order to rectify the imbalance. On the other hand, A(X, X0 ) < 1 suggests that the transition probability is imbalanced in
favor of transitions into X0 , so the algorithm compensates by decreasing the probability of a transition X → X0 .
Note that A(X, X0 ) ≡ 1 for all nonzero probability transitions between X and X0 means that the proposal distribution
satisfies the Detailed Balance condition (8) (i.e., there is never any imbalance), and in this case P (X0 k X) = R(X0 k X) for
all X and X0 . In particular, if we select the proposal distribution R(X0 k X) to satisfy the Gibbs Sampler conditions shown in
Eq. (19),
(
0 0 if X and X0 differ at more than one site
R(X k X) = π(x0i | X−i )
n if X and X0 possibly differ (only) at site i
Then P (X0 k X) = R(X0 k X) = 0 when X and X0 differ at more than one site, and otherwise A(X, X0 ) ≡ 1 (because, as we’ve
seen, the Gibbs Sampler transition probabilities satisfy the Detailed Balance condition (8)). Therefore P (X0 k X) = R(X0 k X)
for all X and X0 , showing that the Gibbs Sampler is a Metropolis-Hastings Algorithm.
28 When examining Eq. (8), note that if X differs from X0 only at site i, then X0 must differ from X at the same site i.
29 For any real numbers a and b, a ∧ b , min(a, b) = min(b, a) = b ∧ a.
30 Just use the definition (20) and the fact that a ∧ b = b ∧ a.
4.5 Exploiting Ergodicity of a Markov Chain Stochastic Sampler

Having constructed a Markov Chain subject to the Convergence Conditions C1, C2, and C3,31 , we can easily produce a
sequence of realization values, Xk , at times k = 0, 1, 2, · · · . Choose X0 to take any value drawn froma any initial distribution32
π0 = λ and then continue to produce values by drawing recursively from the transition probabilities P (Xk+1 k Xk ).
Because πk → π as k → ∞, after a “long enough time”, called the burn-in time or mixing time, k 1, the realization
values Xk are equivalent to samples drawn from the equilibrium distribution π. Note, however, that these values, while
identically distribution according to π, are not independent (having been generated by a markov chain). Thus the Ergodic
Theorem for Markov Chains is a nontrivial fact:
Ergodic Theorem [30, 5]
Assume that Convergence Conditions C1, C2 and C3 hold. Let f : X → R be any bounded function on the
state-space X. Then
N −1
1 X
f (Xk ) → Eπ {f (X)} as N → ∞ almost surely.
N
k=0

For example, set f (Xk ) = 1(Xk = X) = δ Xk , X so that Eπ {f (X)} = π(X). Then,
N −1
1 X
δ Xk , X → π(X) as N → ∞ almost surely.
N
k=0
Convergence can be slow for a variety of reasons:
1. First, the sample values generated during the burn-in time are not drawn from the steady-state equilibrium distribution.
To diminish this problem, one usually does not begin the averaging procedure until a time has elapsed that is (hopefully)
past the burn-time time. I.e., initial samples are thrown away and only samples beyond the burn-in time are averaged.
2. Being drawn from a Markov chain, the samples are not independent and the resulting inter-sample correlations slow
down the convergence process, as well as make the convergence analysis more difficult. If only every `-th sample is
used, for some judiciously chosen value of `, then the chose samples are approximately iid and an asymptotic analysis
based on the use of iid samples can be invoked.
3. The state space is very large and only neighboring states (differing at only one site) are sequentially visited. However,
the fact that higher probability regions of the state space are visited more frequently (so that a type of “importance
sampling” is occurring) somewhat ameliorates this problem.
5 Gibbs Sampler on the Boltzmann Machine

5.1 Glauber Dynamics on the Ising Neural Network
To implement the random-scan Gibbs sampler of Eq. (19) on the Ising model (aka Boltzmann Machine) of a stochastic neural
network requires that we determine the conditional probability π(xi | X−i ) given that
e−βE(X) 1
π(X) = with energy E(X) = − XT W X − bT X .
Z 2
1
The resulting Markov chain stochastic dynamical behavior of the system of spin 2 magnetic particles is known as the Glauber
Dynamics after Roy J. Glauber, the Nobelist who first proposed it in 1963 [15].
Let E(X−i ) denote the energy for the system of magnetic particles obtained by physically removing the i-th particle
1
E(X−i ) = − XT−i W X−i − bT X−i (22)
2
31 Recall that C1, C2 and C30 are sufficient for C1, C2, and C3 to hold.
32 It is common to use a singular distribution λ(X) = 1(X = X0 ) = δ X, X0 .
and define
Eδ (xi ) , Eδ (xi ; X−i ) , E(X) − E(X−i ) ⇐⇒ E(X) = E(X−i ) + Eδ (xi ) . (23)
Also note that because wii = 0, the local magnetic field, hi = hi (X−i ), at site i depends only on X−i , as well as on the full
state vector X,33
n
X
hi (X) = hi (X−i ) = wij xj + bi = eTi W X−i + bi = wTi X−i + bi (24)
j=1
(wii =0)
where wTi is the i-th row of the weighting matrix W .
IDENTITY 34
Eδ (xi ) = − xi hi (X−i ) (25)
Proof:
1
E(X) = E(X−i + xi ei ) = − (X−i + xi ei )T W (X−i + xi ei ) − bT (X−i + xi ei )
2
1 T
X−i W xi ei + xi eTi W X−i − xi bT ei

= E(X−i ) − (using wii = 0)
2
= E(X−i ) − xi eTi W X−i + bi = E(X−i ) − xi hi (X−i ) = E(X−i ) + Eδ (xi ) .

Note that we have shown that
E(X) = E(X−i ) − xi hi (X−i ) (26)
With E(X) = E(X−i ) + Eδ (xi ), Eδ (xi ) = −xi hi (X−i ), we have that
π(X) = π(xi , X−i ) = Z −1 e−βEδ (xi ) e−βE(X−i ) .
Marginalizing, X X
π(X−i ) = π(xi , X−i ) = Z −1 e−βE(X−i ) e−βEδ (xi ) ,
xi =±1 xi =±1
we obtain the desired conditional probability
π(xi , X−i ) π(X) e−βEδ (xi )

π(xi | X−i ) = = = −βE (x =−1) for xi = ±1 . (27)
π(X−i ) π(X−i ) e δ i + e−βEδ (xi =+1)
Note that Eδ (xi ) = ±|hi (X−i )| and that the value of xi which leads to a negative value of Eδ (xi ), i.e. to a lower of the two
possible values for E(X) = E(X−i ) + Eδ (xi ), has a higher conditional probability of occurring. I.e., the system “prefers” to
transition to a lower energy state.35
Using the identity Eδ (xi ) = − xi hi (X−i ), we obtain
e+xi βhi (X−i )

π(xi | X−i ) = for xi = ±1 (28)
e−βhi (X−i ) + e+βhi (X−i )
33 This was first noted in footnote 15.
n
1 1
34 Note that this gives E(X) = − hT X + XT W X = XT W X,
P
2
Eδ (xi ) + 2
i=1
35 This is consistent with the fact that the Boltzmann distribution is the minimum free energy distribution, as shown in Appendix C.1. Regarding preferential
transitions to lower energy states, see also the discussion in Section 5.2 below.
or 36
e±βhi (X−i ) e±βhi (X−i ) 1
π(xi = ±1 | X−i ) = = = . (29)
e−βhi (X−i ) + e+βhi (X−i ) e∓βhi (X−i ) + e±βhi (X−i ) 1 + e∓2βhi (X−i )
Note that in the limit β → 0 (i.e., T → ∞) it is immediately evident that
1
π(xi = +1 | X−i ) = π(xi = −1 | X−i ) = for T → ∞. (30)
2
Using the logistic regression function
ex 1
σ(x) = =
1+e x 1 + e−x
We obtain
π(xi = ±1 | X−i ) = σ(±2βhi (X−i )) , (31)
with
n
X
hi (X−i ) = hi (X) = wij xj + bi = eTi W X−i + bi = wTi X−i + bi , (32)
j=1
which we can also write as

π(xi | X−i ) = σ 2βhi (X−i )xi (33)
or as

π(xi | X−i ) = σ − 2βEδ (xi ) (34)
with
Eδ (xi ) = −xi hi (X−i ) . (35)
The last form for π(xi | X−i ) makes it particularly clear that a higher energy configuration is a lower probability configura-
tion.37
Another useful form of the conditional probability π(xi | X−i ) is38
e2βhi (X−i )zi e2βhi (X)zi xi + 1

π(xi | X−i ) = = with zi = (36)
1 + e2βhi (X−i ) 1 + e2βhi (X) 2
To implement the stochastic Glauber Dynamics (i.e., the Gibbs Sampler), given the current state X then, after a site i has
been randomly selected with probability p(i) = n1 , then decide whether or not to choose the realization value x0i = xnext
i =1
according to the conditional probability
π(x0i = 1 | X−i ) = σ 2βhi (X−i ) ,
which corresponds to
π(x0i = −1 | X−i ) = 1 − σ 2βhi (X−i ) = σ − 2βhi (X−i ) .
For convenience, these two equivalent statements are stated as

π(x0i | X−i ) = σ 2βhi (X−i )x0i (37)
36 Note how the overall value of the denominator doesn’t change if the sign is flipped consistently.
37 Recall that σ(x) → 1 as x → ∞ and σ(x) → 0 as x → −∞, strictly monotonically in both directions.
38 This is straightforward to show using Eq. (28).
Of course any of these three probability statements suffices to completely describe the conditional probability of the binary
random variable xi .
Summarizing, the Random Scan Gibbs Sampler (19) applied to the Ising Model gives:
GLAUBER DYNAMICS
(
0 if X, X0 differ at more than one site
P (X0 k X) = (38)
P (x0i k i, X−i )P (i) = σ 2βhi (X−i )x0i 1
if X, X0 possibly differ at site i

n
The last line of (38) says that one first randomly selects a site i with uniform probability P (i) = n1 , followed by the selection
of a new realization value x0i drawn from the conditional probability π(x0i | X−i ) given in Eq. (33).
Suppose, once a site i has been chosen, that rather than randomly deciding if x0i = +1 according to the probability law
P (x0i k i, X−i ), one instead choses the most probable outcome,39
x0i = xmap
i = arg max P (x0i k i, X−i ) = arg max π(xi | X−i ). (39)
x=±1 x=±1
The “MAP update” procedure is independent of β > 0 (i.e., of the temperature) and corresponds to the update rule40
x0i = xmap
i = arg min Eδ (xi ) = arg max xi hi (X−i ) (40)
xi =±1 xi =±1
The MAP algorithm corresponds to the Glauber dynamics in the limit T → 0 (equivalently, β → ∞) and in this context is
known as the Hopfield Dynamics or the Hopfield Algorithm,41 and the corresponding zero-temperature network is called a
Hopfield Network.42
The Hopfield Algorithm is provably convergent to a local minimum of the energy function E(X) [21]. This readily follows
from Definition (23) which, with the optimality of x0i = xmap
i shown in (40), implies
E(X0 ) − E(X) = Eδ (xmap

i ) − Eδ (xi ) ≤ 0.
Thus at every iteration the total energy either decreases or remains the same. Because the total energy is bounded from below,
and the state-space is finite, after a finite number of iterations the algorithm will have converged to a local minimum.
The Hopfield Algorithm (40) is implemented as
HOPFIELD DYNAMICS
 
n
X
x0i = sign hi (X−i ) = sign  T
 
 wij xj + bi 
 = sign w X
i −i + b i (41)
j=1
(wii =0)
where wTi is the i-th row of the weighting matrix W and i has been chosen according to the random scan procedure i ∼
P (i) = n1 .
The Hopfield algorithm step shown in (41) is just the well-known perceptron algorithm that decides which side of a sep-
arating hyperplane the vector X−i points into [20]. Thus, unlike the hierarchically multilayered perceptron networks, which
have great current popularity and utility [20, 16], the Hopfield network and dynamics corresponds to a massively intercon-
nected, massively parallel network of perceptron elements.43 Note that the Glauber Dynamics correspond to a stochastic
generalization of the Hopfield algorithm and therefore the Ising model-plus-Glauber Dynamics can be viewed as a massively
interconnected, massively parallel, network of stochastic perceptron elements, which is a huge generalization of the basic
deterministic, multilayered perceptron architecture.
39 “MAP” stands for “maximum a posteriori” [31].
40 SeeEquations (33) and (34).
41 Note that while this step of the algorithm is now deterministic, the overall algorithm still has a stochastic component as we are still implementing the
“random scan” step of randomly selecting a site i.

42 Thus the Hopfield network is the zero-temperature limit of the Ising model with quadratic energy.
43 And therefore the Hopfied network is a generalization of the multilayer perceptron. Alternatively, the multilayered perceptron is a special case of a
general Hopfield network.

5.2 Glauber Dynamics as Random “Spin Flipping”

Consider an interconnected system of random binary particles with state vector Xk evolves according to the Glauber dynamics.
A time-k-to-time-(k + 1) transition at a randomly selected site i of the binary Ising model occurs if the spin of the particle
xi ∈ {−1, +1} at that site “flips”, xi → −xi , otherwise a transition does not occur. Thus, assuming that a single site i has
been chosen for consideration for a possible transition Xk → Xk+1 , we have the two possibilities
Xk = X → Xk+1 = X0 6= X ⇐⇒ xi → x0i = −xi (a flip occurred)

0
Xk = X → Xk+1 = X = X ⇐⇒ xi → x0i = xi (a flip did not occur)
It is evident that we can implement the Glauber dynamics by determining the conditional probability of a flip at a randomly
chosen single site i at time k + 1 given X−i at time k,
Flip Probability: Pi (xi → −xi ) , P (xi → x0i = −xi k X−i ),
provided that we chose the flip probability in a manner consistent with the Glauber dynamics (38). Note that the probability
of a flip at the chosen site i not occurring is
Pi (xi → xi ) , P (xi → x0i = xi k X−i ) = 1 − Pi (xi → −xi ) .
The Glauber Dynamics update procedure for a chosen site i is equivalent to

Pi (−1 → −1) = Pi (1 → −1) = π(xi = −1 | X−i ) = σ − 2βhi (X−i )
and

Pi (−1 → 1) = Pi (1 → 1) = π(xi = 1 | X−i ) = σ 2βhi (X−i ) ,
where the two possible spin-flip transitions are each marked with a box.44 These probabilities of these two spin-flip possibilities
occurring can be summarized by

Pi (xi → −xi ) = σ − 2βhi (X−i ) xi (42)
with
n
X
hi (X−i ) = wij xj + bi .
j=1
Similarly the probabilities of a spin-flip not occurring can be summarized by

Pi (xi → xi ) = σ 2βhi (X−i ) xi = 1 − Pi (xi → −xi ) (43)
Note that
Pi (xi → xi ) + Pi (xi → −xi ) = 1
as expected since flipping or not flipping are the only two possibilities at the selected site i.
Equations (42) and (43) completely describe the Glauber Dynamics for the Ising model in terms of spin flipping, once a
site i has been randomly selected. However, it is illuminating to rewrite these two equation into a form that makes it clear that
transitions resulting in a lowering of the overall energy of the system are preferred (i.e., have higher probability of occurring).
To do so, recall that for X = X−i + xi ei we have
E(X) = E(X−i ) + Eδ (xi )
and
Eδ (xi ) = −xi hi (X−i ) .
44 Note that the emphasis has shifted from concern with a particular state-value X0 = X 0 0
−i + xi ei that is realized at time k + 1 to the transition X → X
itself as an entity of interest. It is well-known that a Markov chain on a set of states is equivalent to a larger Markov chain on the set of transitions viewed as
states [14]. In essence, this is the procedure begin pursued here, where a “flip” refers to either of the two site i transitions −1 → 1 or 1 → −1 viewed as
entities of interest in their own right.
Also recall that the realizations Xk = X = X−i + xi ei and Xk+1 = X0 = X−i + x0i ei possibly differ only at the selected site
i. Define energies of transition by
∆E(X → X0 ) , E(X0 ) − E(X) and ∆Eδ (xi → x0i ) , Eδ (x0i ) − Eδ (xi ) .
Then, because spin-flip transitions can occur at most at the single site i, we have that the change in energy due to a configuration
transition X → X0 is given by
∆E(X → X0 ) = ∆Eδ (xi → x0i ) = −(x0i − xi )hi (X−i ) .
Of course, a change in energy is only due to a change in configuration at site i and the energy does not change if there is no
spin flip at site i. Note that
∆Eδ (−1 → 1) = −2hi (X−i ) = 2hi (X−i )xi and ∆Eδ (1 → −1) = 2hi (X−i ) = 2hi (X−i )xi .
Thus Eq. (42) can be written as

Pi (xi → −xi ) = σ − β∆Eδ (xi → −xi ) (44)
Note that if ∆Eδ (xi → −xi ) < 0, then the probability of a flip increases, whereas if ∆Eδ (xi → −xi ) > 0 the probability
decreases.45 Thus the system tends to evolve to lower energy state configurations.
5.3 Comments and Generalizations

5.3.1 Parallel Implementation of the Single-Site Random-Scan Gibbs Sampler
The single site, random scan implementation of the Gibbs sampler is amenable to parallel implementation on n distributed
processing units, each unit corresponding to a site i and the site variable xi of a n-site random field. If each variable makes an
individual random-in-time decision to update (“fire”),46 then with probability one only a single unit will fire at any instant,47
fulfilling the condition that a transition involves, at most, only a single site change of value. Over an interval for which
with high probability a single node will fire, because of symmetry all nodes are equally likely to be the one that fires, so the
probability of any single node being the one that fires is n1 . For the Ising model, the node fires according to the rule (33).
5.3.2 Simultaneous Multiple-Site Random-Scan Updating in the Gibbs Sampler

The restriction that at most a transition can involve only a single site can be generalized to transitions that involve at most r
transitions at a time conditioned on the values of the random variables on the remaining n − r sites.48 For example let r = 2.
Then a random scan corresponds to randomly picking from n sites two sites that can possibly fire from the n2 = n(n−1)

2
possibilities. This leads to a provably convergent Two-Site MCMC Gibbs sampler:
TWO-SITE RANDOM SCAN GIBBS SAMPLER

(
0 0 if X, X0 differ at more than two sites
P (X k X) = (45)
P (x0i , x0j k i, j, X−ij )P (i, j) if X, X0 possibly differ at sites i and j
with
P (x0i , x0j k i, j, X−ij ) = π(x0i , x0j | X−ij ),
X−ij = X − xi ei − xj ej = X−ji
and
1 2
P (i, j) = n =
.
2
n(n − 1)
45 And if ∆Eδ (xi → −xi ) = 0, the probability of a flip is 0.5, the same as that of a non-flip.
46 For simplicity, lassume that the global configuration of the network is instantaneously communicated to all sites whenever a transition occurs.
47 This is an infinite-precision argument. In reality for a discrete-time implementation this will be approximately true to a reasonable degree of accuracy
for a small enough sampling time interval that depends on the number of units, n.
48 Note, however, that this removes the possibility of a simple distributed processing algorithm as the r sites will need to simultaneously fire in a coordinated
fashion.
For the Ising model, this yields the49
TWO-SITE GLAUBER DYNAMICS

1
π(x0i , x0j | X−ij ) = (46)
−2βwij x0i x0j −2βx0j hj (X−ij )
+ e−2β (xi hi (X−ij )+xj hj (X−ij ))
0 0
1+e e −2βx0i hi (X−ij ) +e
with
hi (X−ij ) = eTi W X−ij + bi = wTi X−ij + bi .
Similarly, one can compute arbitrary r-site Glauber dynamics. It is interesting to conjecture if higher site-order transitions
can speed up the burn-in (mixing) time of the Markov chain defined by the Glauber dynamics. Or if even only an occasional
high r update during the standard 1-site Glauber dynamics can speed up the time to equilibrium.
5.3.3 Gibbs Sampling for Higher-Order Energy Models

As discussed in Appendix D.2, an exact expansion of the fully general log-probability of a binary-components random n-vector
X = (x1 , · · · , xn )T , xi ∈ {−1. + 1} is
n n n
1 X X X
ln π(X) = θ0 + θi xi + θij xi xj + θijk xi xj xk + · · · + θ12 ··· n x1 x2 · · · xn . (47)
β i=1 i<j i<j<k
Various models are obtained by truncating the right-hand-side expansion to various orders of products of the components of
X. In particular, keeping terms up to second order results in the quadratic-energy Boltzmann Machine distribution used in the
standard Ising model discussed earlier.
The single-site random scan Gibbs sampler (19) requires knowledge of π(xi | X−i ), which is computable for models of
higher order than quadratic.
As an illustrative example, let’s determine π(xi | X−i ) for the cubic approximation to the exact representation (47) obtained
by truncating terms of order four and higher. This approximation is of the form
e−βE(X)
π(X) = with E(X) = B(X) + W(X, X) + C(X, X, X) (48)
Z
where B(X), W(X, Y) and C(X, Y, Z) are multilinear, real-valued functions of their vector arguments.50 These functions are
assumed to be invariant with respect to all permutations of their arguments,51 and to satisfy,
W(ei , ei ) = C(ej , ei , ei ) = C(ei , ej , ei ) = C(ei , ei , ej ) = 0 for all i, j = 1, · · · , n .
To make this concrete we take

n
X
B(X) = −bT X = − bi x i ,
i=1
n
1 1 X
W(X, Y) = − XT W Y = − wij xi yj
2 2 i,j=1
with52
wii = 0 and wij = wji for all i, j = 1, · · · , n ,
and
n
1 X
C(X, Y, Z) = − cijk xi yj zk
6
i,j,k=1
with any two indices of cijk being identical forcing the zero value
cjii = ciji = ciij = 0 for all i, ` = 1, · · · , n

49 This is derived in Appendix E; see Eq. (177).
50 I.e.,
they are cartesian tensors.
51 This is a generalization of the concept of symmetric matrix.
52 Note that there are two possibilities for w . This accounts for the factor of 1
ij 2
in the components-level form of W(X, X).
and every cijk having an invariant value modulo permutation of the indices,53
cijk = ckij = cjki = cjik = ckji = cikj for all i, j, k = 1, · · · , n.
Subject to these constraint conditions, the multilinear operators applied to X = X−i + xi ei yield
B(X) = B(X−i ) + xi B(ei ) = B(X−i ) − xi bi ,
W(X, X) = W(X−i , X−i ) + 2 xi W(ei , X−i ) = W(X−i , X−i ) − xi eTi W X−i = W(X−i , X−i ) − xi wiT X−i ,
and
1 T
C(X, X, X) = C(X−i , X−i , X−i ) + 3 xi C(ei , X−i , X−i ) = W(X−i , X−i ) − xi X C(i) X−i
2 −i
where the matrices C(i) , i = 1, · · · , n, have elements

C(i) jk = cijk for j, k = 1, · · · , n.
Note that for each i the matrix C(i) is symmetric and has zero diagonal elements.
If we define
1
hi (X−i ) = bi + wiT X−i + XT−i C(i) X−i (49)
2
then the energy function in (48) can be written as
E(X) = E(X−i ) − xi hi (X−i ) . (50)
Making the identification

Eδ (xi ) = E(X) − E(X−i ) = −xi hi (X−i ) ,
we can now proceed exactly in the same manner that follows Eq. (26) to arrive at

π(xi | X−i ) = σ 2βhi (X−i )xi (51)
with hi (X−i ) given by Eq. (49). Note that if the cubic energy terms are “turned off”, cijk ≡ 0, then we recover the quadratic
case shown in Equations (32)-(33).
5.3.4 Completely Synchronous All-Sites Ising Model Sampler 54

For the single-site random-scan Gibbs sampler, the processing units fire one-at-a-time at random, each independently of
the other. But what if the processing units are designed to wait until information about the current configuration has been
communicated to all processing units, at which time then all sites i = 1, · · · n fire at the same time, i.e., synchronously, each
according to the Gibbs rule π(x0i | X−i )? In this case the transition probabilities have been designed to have the simple product
form,
n
Y
P̃ (X0 k X) = π(x0i | X−i ) > 0 (52)
i
Note that in this case the complexity of having to decide which of the K n possible site configurations to update with non-zero
probability has been simplified by attempting to update all sites according to the transition probability (52), which is in contrast
to the single-site random-scan algorithm that randomly selects one site, say xi , and updates it according to π(x0i | X−i ).
In general, Eq. (52) will not have our distribution of interest, π(X) of Eq. (5), as its equilibrium distribution. In our
previous examples, we first specified the desired equilibrium distribution, Eq. (5), and then afterwards defined the transition
probabilities of our Markov chain accordingly. Here we are first mandating the form of the transition probabilities and now
53 Note 1
that there are six possibilities. This accounts for the factor of 6
in the components-level form of C(X, X, X).
54 The material presented here draws heavily from reference [35].
must determine the consequences of this choice on the nature of the resulting equilibrium distribution, assuming it exists. Be-
cause the transition probabilities are strictly positive, there does exist a unique equilibrium distribution,55 which we designate
as π̃(X),
X
π̃(X0 ) = P̃ (X0 k X) π̃(X) . (53)
values of X
In terms of the vector equilibrium probabilities and the matrix of transition probabilities, we have
π̃ = P̃ π̃
where, as mentioned, in general π̃ 6= π.

For the Ising model with π(xi | X−i ) = σ 2βhi (X)x0i , the proposed synchronous-updates transition matrix is56
n
Y n
Y
P̃ (X0 k X) = π(x0i | X−i ) = σ 2βhi (X)x0i . (54)
i i
To simplify matters, absorb the factor 2β into the weights W and b,57
W ← 2βW and b ← 2βb (55)
so that
hi (X) = eTi W X + bi ← 2βhi (X) = 2β eTi W X + bi
and
π(xi | X−i ) = σ hi (X)x0i ← π(xi | X−i )σ 2βhi (X)x0i .
Also note that zi = 21 (xi + 1) ∈ {0, 1} and xi = 2zi − 1 ∈ {−1, +1} convey the same information58 so that π̃(xi ) = π̃(zi ),
π̃(X) = π̃(Z), and P (X0 k X) = P (Z0 k Z) when X and Z are so related. We also use the fact that59
0
ezi hi (X) x0i + 1
π(xi | X−i ) = σ x0i hi (X) = with z0i = . (56)
1 + ehi (X) 2
Now substitute X = 2Z − e into hi (X) and define the quantities
X
W̃ = 2W, b̃i = bi − eTi W e = bi − wij , and h̃i (Z) = eTi W̃ Z + b̃i (57)
j
to arrive at
hi (X−i ) = hi (X) = eTi W X + bi = eTi W̃ Z + b̃i = h̃i (Z) = h̃i (Z−i ) (58)
where the last equality holds because w̃ii = 0.

Combining Eq.s (57)–(58), we have60
0
ezi h̃i (Z) x0i + 1
π(xi | X−i ) = = π(zi | Z−i ) for z0i = (59)
1 + eh̃i (Z) 2
55 This is a consequence of the Perron-Frobenius Theorem.
56 Recall that hi (X) = hi (X−i ) because wii = 0.
57 This can be easily reversed when desired.
58 For X = (x , · · · , x )T and Z = (z , · · · , z )T the corresponding relationships are Z = 1 (X + e) and X = 2Z − e for e = (1, · · · , 1)T .
1 n 1 n 2
59 See Eq.s (33) and (36).
60 The conditional probability π(z | Z ) given in (59) is denoted by p (x, y ) in Section 2.1 of reference [35]. Note that
i −i s s
X
z0i h̃i (Z) = z0i w̃ij zj + b̃i with w̃ii = 0.
j
Which combining with Eq. (54) results in

n n 0 n
Y Y ezi h̃i (Z) Y
P̃ (X0 k X) = π(x0i | X−i ) = = π(zi | Z−i ) = P̃ (Z0 k Z) (60)
i i 1 + eh̃i (Z) i
for X and Z related as X = 2Z − e. Thus, equivalently to solving (53) for the Ising model one instead can solve for the
stationary distribution that satisfies
X
π̃(Z0 ) = P̃ (Z0 k Z) π̃(Z) . (61)
values of Z
One can find a stationary solution π̃(Z) by determining a solution to the detailed balance condition61
π̃(Z0 ) P̃ (Z0 k Z)
= . (62)
π̃(Z) P̃ (Z k Z0 )
Note that X
z0i h̃i (Z) = z0i eTi W̃ Z + b̃i = z̃i0 w̃ij zj + b̃i ,
j
which implies that

Z0 h̃(Z) = Z0T W̃ Z + b̃T Z
for
h̃(Z) = W̃ Z + b̃ with W̃ = 2W and b̃ = b − W e.
62
Some straightforward algebra then shows that

n
Q exp z0i h̃i (Z)
0T 0 T 0 Q 0
P̃ (Z0 k Z) eZ h̃(Z) 1 + eh̃i (Z ) eb̃ Z i 1 + eh̃i (Z )
Q
i=1 1+exp h̃i (Z)
= n = T 0 Qi = . (63)
P̃ (Z k Z0 ) b̃T Z
Z h̃(Z ) h̃i (Z)
Q h̃i (Z)
Q exp zi h̃i (Z0 )
e i 1 + e e i 1 + e
0
i=1 1+exp h̃i (Z )
Comparison with Eq. (62) shows that the equilibrium distribution is given by
T
eb̃ Z i 1 + eh̃i (Z)
Q
1
π̃(Z) = π̃(X) = for Z = (X + e) (64)
Z 2
where Z is the partition function (normalization factor)
X T Y
eb̃ Z
1 + eh̃i (Z) .

Z=
values of Z i
Using the facts that h̃i (Z) = hi (X) and b̃ = b − W e allows us to rewrite the equilibrium distribution Eq. (64) as
T
X−eT W X)
e 2 (b
1
1 + ehi (X)
Q
i
π̃(X) = π̃(Z) = for X = 2Z − 1, (65)
Z0
with T
W e−eT b)
Z0 = e 2 (e
1
hi (X) = eTi W X + bi = eTi h(X) and Z.
Now note that X
−eT W X = eT b − eT (W X + b) = eT b − eT h(X) = eT b − hi (X).
i
This allows us to write Eq. (65) as

1 T 1 1
e 2 hi (X) + e− 2 hi (X)

e2b X
Q
i
π̃(X) = 1 T
e− 2 e b Z0
61 Recall that the transition probabilities are all positive. That the detailed balance condition can be satisfied means that the Markov chain defined by
P̃ (Z0 k Z) is reversible, as pointed out in [35].

62 Note that Z0T W̃ Z = ZT W̃ Z0 .
or
1 T 1
1 T 1 T

e2b X
e2b X
Q Q
i cosh 2 hi (X) i cosh 2 ei (W X + b)
π̃(X) = π̃(Z) = = (66)
Z00 Z00
with
1 − 1 eT b 0
X = 2Z − 1 and Z00 = e 2 Z.
2
Finally, we undo the value reassignments shown in Eq. (55) we obtain the equilibrium distribution
T
e βb X
cosh βeTi (W X + b)
Q
i
π̃(X) = (67)
Z00
for an appropriately defined partition function Z00 . The distribution (67) should be compared to the equilibrium distribution
for the single-site random-scan Glauber dynamics with quadratic energy E(X) = − 21 XT W X − bT X,
T T
eβ ( 2 X W X+b X)
1 T β T
e−βE(X) eβb X
e2X WX
π(X) = = = .
Z Z Z
To makes some further points of connection with reference [35], let’s return to Eq.s (60) and (64). In particular, we can
use these equations to compute the joint distribution of two time-adjacent states of the Markov chain,
n 0
! ! 0T T
k+1 k
0 0 −1
Y ezi h̃i (Z) b̃T Z
Y
h̃i (Z)
eZ h̃(Z)+b̃ Z
P̃ ( Z , Z ) = P̃ (Z k Z)π̃(Z) = Z e 1+e =
i 1+e
h̃i (Z)
i
Z
or
0T
W̃ Z+b̃T Z0 +b̃T Z
k+1 k eZ
P̃ ( Z0 , Z ) = (68)
Z
Eq. (68) is Equation (3) of reference [35]. The symmetry between Z0 and Z shown in (68),
P̃ (Z0 , Z) = P̃ (Z, Z0 ) (69)
exists because a Markov chain that has a stationary distribution which satisfies detailed balance is a reversible Markov chain.
k+1 k
As noted in [35], once P̃ ( Z0 , Z ) is in hand the marginalization
X k+1 k
P̃ ( Z0 , Z)
values of Z
must yield the stationary distribution π(Z0 ).63 This is straightforward to show64
n
k+1 k zi h̃i (Z0 )
P
b̃T Z0 ZT h̃(Z0 ) b̃T Z0
X X X
0
Z P̃ ( Z , Z) = e e =e ei=1
values of Z values of Z values of Z
Recalling that zi ∈ {0, 1} we break up the sum over values of Z = (z1 , · · · , zn )T as follows
X X X X X
= + + +··· + .
values of Z all components are zero one component nonzero two components nonzero all components nonzero
This results in
n n
zi h̃i (Z0 )
P
X X 0 X 0 0 X 0 0 0 0 0
ei=1 =1+ eh̃i (Z ) + eh̃i (Z )+h̃j (Z ) + eh̃i (Z )+h̃j (Z )+h̃k (Z ) + · · · + eh̃1 (Z )+···+h̃n (Z )
values of Z i=1 i<j i<j<k
n
Y 0
1 + eh̃i (Z )

=
i=1
63 Note that this just a two-step way of saying that Eq. (61) holds.
64 Note that Z0T W̃ Z + b̃T Z0 + b̃T Z = ZT b̃ + Z0T h̃(Z) = Z0T b̃ + ZT h̃(Z0 ).
and therefore
T 0
X k+1 k eb̃ Z Y 0
P̃ ( Z0 , Z) = 1 + eh̃i (Z ) = π̃(Z0 )
Z i
values of Z
as expected.
6 Boltzmann Machine as a Distribution Estimator

6.1 Generative Model for a Random Categorical Vector
As we have noted, any positive distribution π for a general65 random categorical vector X taking K values on the state space
X ⊂ Rn , |S| = K < ∞, can be written in Boltzmann form66
1 −βE(X)
π(X) = e .
Z
Furthermore, subject to constraints placed on the first two moments of the random vector X, the maximum entropy distribution
is of Boltzmann form with quadratic energy67
1
E(X) = − XT W X − bT X .
2
This shows that the “Boltzmann Machine-like” Boltzmann Distribution with Quadratic Energy (BDQE) is more general than
the binary Ising spin glass distribution.68
However, because of its utility and mathematical tractability, in this note we have primarily restricted ourselves to the
binary-sited case, in which case the choice of a quadratic energy function yields the Boltzmann Machine binary-sites distribu-
tion, aka the Ising model, which is also known as the Quadratic Exponential Binary (QEB) distribution.69
Suppose the random vector X represents a “universe” of interdependent random objects of potential interest (encoded as
the components xi of X), but that we do not know the “true” distribution, πtrue (X), that describes the stochastic dependencies
and frequencies of occurrence and cooccurrence of these objects, other than that the true distributionis positive. We assume
that a Boltzmann distribution π(X) with a specified, parameterized functional form of the energy function E(X) can provide
a parametric model of the unknown distribution πtrue (X). If the parameters of the model distribution π(X) can be adjusted
to attempt to provide a good “fit” to the true distribution, then π(X) is called a density estimator in mathematical statistics.
However, in the machine learning and neural networks literature, the parameterized Boltzmann distribution is usually called
a generative model. This is not meant to be interpreted in the causal sense, say, that a force generates a kinetic response
as encoded in Newton’s laws of motion. Rather, it is to be interpreted in the sense that given a Boltzmann distribution of a
finite random vector, one can generate realization samples via an MCMC stochastic sampling procedure; e.g., in the manner
described in the previous sections. It is in thia sense that the Boltzmann density estimate is a “generative model.”
6.2 Visible versus Hidden Variables on a Random Field

Let X represent the state of a random field as described in Section 2.2. In general, only part of the random field encoded in the
vector X is actually observable. We call the components of X that are observable the visible components of X, and denote
them as XV .
To make this clear, recall that we call the element (or unit) xi of the random vector X = (x1 , · · · , xn )T the random
variable at node i (or site i) of the random field. All the nodes i = 1, · · · , K, of the random field are interpreted as nodes G
on a dependency graph G = (G, E) over the components of X.70 Those nodes which correspond to the visible units, XV , of
X are the nodes V of a subgraph of G. The components of X, denoted by XH , on the sites H = G − V are called the hidden
65 I.e.,
with components that are polytomous and not necessarily binary.
66 SeeSection 4.1.
67 See Appendix C.3.
68 See Appendix D and references [9, 29, 10, 2, 13].
69 In this note we use “Boltzmann Machine distribution,” “Ising model distribution”, and “QEB distribution” interchangeably. They all refer to the
binary-sites quadratic Boltzmann distribution discussed in Appendices D.2 and E.

70 See [27] for a discussion of dependency graphs and Appendix E for how one easily identifies the edges E of this graph for a Boltzmann Machine.
or latent units. Note that G = V ∪ H and V ∩ H = ∅. For convenience, and with no loss of generality, we assume that the
components of X are ordered such that

Xv
X= with Xv = XV and Xh = XH ,
Xh
and we define
nv = |V| , nn = |H| =⇒ nv + nh = n = |G|
so that the binary random vectors Xv , Xh and X satisfy
Xv ∈ RnV , Xh ∈ RnH , X ∈ Rn .
Note that with this partitioning, the binary variables xi , i = 1, · · · , nv are all of the components of Xv and the first nv
components of X.
Let us focus on the Boltzmann Machine distribution.71 Consistent with the state partitioning X = (XTv , XTh )T we partition
W and b as

Wvv Wvh b
W = and b = v (70)
Whv Whh bh
with
T T T
Wvv = Wvv Whh = Whh and Wvh = Whv , (71)
where Wvv and Whh both have zero-valued diagonals. The Boltzmann distribution then has the form
1 −βE(Xv ,Xh )
π(X) = π(Xv , Xh ) = e (72)
Z
with
1
E(X) = − XT W X − bT X = EViso (Xv ) + EH
iso
(Xh ) − XTv Wvh Xh (73)
2
where
1 1
Eviso (Xv ) = − XTv Wvv Xv − bTv Xv and Ehiso (Xh ) = − XTh Whh Xh − bTh Xh (74)
2 2
are the energies that the visible and hidden units respectively would have if they are isolated by setting the cross-coupling
energy to zero (i.e., by taking Wvh = 0). In general, for Wvh 6= 0 these are not equal to the energy functions Ev (Xv ) and
Ev (Xv ) arising from marginalizing the joint distribution.72
Our working assumption is that the existence of the latent variables XH means that the joint distribution for Xv and Xh
has the particularly nice Boltzmann Machine form (72). Namely, the Boltzmann distribution with an energy function that is
a simple quadratic form in the joint variables XT = (XTv , XTh )T . However, although the quadratic energy form is assumed to
hold for the joint distribution, in general this is not true for the marginal distribution π(Xv ) [9, 10].
Suppose the visible units Xv correspond to objects in a visible world of interest on which we make observations. Ideally
we would know the true distribution πtrue (Xv ), but often we don’t, or it is too complex to tractable work with. One might use
a Boltzmann Machine distribution on the visible units alone as a model for the unknown distribution πtrue (Xv ), but this is too
restrictive. Although it can capture all second order behavior between sites, a quadratic energy model represents higher order
site interactions (e.g., between three or more sites) poorly or not at all. However, if one marginalizes over the latent variables
of a Boltzmann Machine (Ising model) distribution
X
π(Xv ) = π(Xv , Xh ) (75)
Xv -values
71 I.e.,on the quadratic-energy Boltzmann Distribution for a binary-components random vector, aka the Ising model.
72 If the simplifying assumptions that Wvv = 0 and Whh = 0 are made, then the resulting distribution is known as the Harmonium distribution or the
Restricted Boltzmann Machine (RBM) distribution.
then one obtains a binary Boltzmann distribution that is not of Boltzmann Machine form73 and which can provide a reasonably
good approximate model for a general Boltzmann distribution on Xv of the fully general form shown in Eq. (47). Indeed, for
a sufficiently large number of hidden variables, the marginalized distribution π(Xv ) can be made arbitrarily close to the true
distribution πtrue (Xv ) [9, 10]. This is called the universal approximation property of the marginalized Boltzmann machine,
which is discussed further in Section 6.3.
Motivated by the universal approximation property, for a given, fixed number of hidden units, XH , one determines the
values of the parameters, θ = vec(W, b, Z), of a joint Boltzmann machine distribution πθ (Xv , Xh ) that result in a best fit
of the marginal πθ (Xv ) to the true distribution πtrue (Xv ). This is done by collecting repeated independent measurements of
Xv (which presumably is repeatedly drawn from the unknown distribution πtrue (Xv )) and then applying the procedure of
maximum likelihood estimation (MLE)74 to obtain an estimate of the unknown parameter vector θ. The MLE procedure for
learning a best fit of πθ (Xv ) to πtrue (Xv ) is detailed below in Section 6.4.
6.3 Universal Approximation Property of the Marginalized Ising Model

6.3.1 Necessary Number of Hidden Units – I
The unknown true distribution, πtrue (Xv ), on nv binary visible units, Xv = XV , is a general positive binary distribution, and
therefore is exactly representable by the fully general Boltzmann distribution form (47)75
nv
X nv
X nv
X
ln πtrue (Xv ) = ϑtrue
0;v + ϑtrue
i;v xi + ϑtrue
ij;v xi xj + ϑtrue true
ijk;v xi xj xk + · · · + ϑ12 ··· nv ;v x1 x2 · · · xnv (76)
i=1 i<j i<j<k
This distribution is completely specified by knowledge of the values of the Kv = 2nv parameters
ϑtrue
v = (ϑtrue true T
0;v , · · · , ϑ12 ··· nv ;v ) ∈ R
Kv
, Kv = 2nv .
Given nv visible binary variables Xv and assuming the existence of nh hidden binary variables Xh we set X = (XvT , XhT )T ∈
n
R for
n = nv + nh (77)
and define a latent variable model on X which is of Boltzmann Machine (Ising model) form,76
1 −βEθ (X)
πθ (X) = e (78)
Zθ
with
1
Eθ (X) = − XT W X − bT X , θ = vec(W, b, Z) ∈ Rp (79)
2
and
X
Zθ = e−βEθ (X) . (80)
values of X
Specification of the latent variable model πθ (X) generally requires knowledge of
n2 + n + 2

n n n n(n − 1) n(n + 1) + 2
p= + + =1+n+ = = (81)
0 1 2 2 2 2
parameter values as a function of n = nv + nh . As an example, take nv = nh = 100, then
10 (100)(101) + 2
Kv = 2100 = 210 ≈ 1030 whereas p = = 50(101) + 1 = 5, 051 .
2
73 I.e., marginalization of a quadratic-energy binary-components Boltzmann Machine distribution (Ising model) does not yield a distribution of the same
quadratic energy form [9, 10]. Put another way, the class of Boltzmann Machine distributions is not closed under marginalization. However, the class of
positive distributions is closed under marginalization, so it is true that the marginal of a binary Boltzmann Machine (Ising) model is a binary Boltzmann
distribution.
74 The MLE procedure is described in Appendix A.
75 I.e., by an n-order polynomial where in each term the partial degree of each variable, x , is at most one. See Appendix D.2. The parameter ϑtrue
i 0;v
serves to normalize the distribution to sum to one and is equivalent to knowing the partition function for the Boltzmann distribution πtrue (Xv ). Note that the
parameters β has been absorbed into the coefficients.
76 See Equations (70)–(74).
However, note that the computation of the partition function for Eq. (78)
X
Zθ = e−βEθ (X)
all values of X
requires a sum over 2K terms, which is usually a truly vast number. One therefore should not be surprised to learn that a
driving force in throughout the history of recent machine learning research is the issue of how to circumvent the need to
compute the partition function.
With the latent variable Boltzmann Machine model (78) at hand, it is natural to approximate πtrue (Xv ) by the marginal-
ization
X
πθ (Xv ) = πθ (Xv , Xh ) (82)
all values of Xh
Note that the number of terms in the marginalization sum can be vast, so how to handle this in some manner will need to be
addressed.
The question at hand is how well the Kv = 2nv –parameter true distribution shown in Eq. (76) can be approximated by the
2
p = n +n+2
2 –parameter latent variable model distribution given by Equations (78) and (82). That the latent variable model
(78) can exactly represent (76) for a sufficient number of hidden units is proved in references [33, 35] and discussed below in
Section 6.3.3.
Generally Kv degrees of freedom is needed to specify the true distribution (76), and therefore a necessary condition for
the p-parameter model (78)–(82) to fit to an arbitrarily chosen true distribution is that the model has at least as many degrees
of freedom, p ≥ Kv . Solving the quadratic equation
n2 + n + 2
p= = 2nv = Kv
2
for a positive value of n = nv + nh gives the necessary condition
√
2nv +3 − 7 − 2nv − 1 nv
nh ≥ =O 2 2 . (83)
2
We see that in the most general situation, the number of hidden units necessarily must grow at least exponentially in the
number of visible units to ensure an exact fit of the marginalized Boltzmann distribution to the true visible units distribution.
Of course, the practical hope is that actual distributions encountered in the world can be approximately well-represented using
far fewer hidden neurons.
6.3.2 Necessary Number of Hidden Units – II

Let us assume that a marginalization of a Boltzmann Machine (Ising model) distribution πθ (X) = πθ (Xv , Xh ) can exactly
match πtrue (Xv ) for an appropriate number, nh , of hidden units and choice of parameter vector θ.77 Let us further assume that
there is a true distribution on X, πtrue (X) = πtrue (Xv , Xh ) and that there exists a parameter vector θ for the latent variable
model such that
πtrue (X) = πθ (X) > 0 for all X . (84)
As a consequence of this assumption
πθ (Xh | Xv ) = πtrue (Xh | Xv ) > 0 and πθ (Xv ) = πtrue (Xv ) > 0 (85)
for all Xv and Xh .

It is a fact that two distributions on a finite sample space are equal if and only if all of their moments are equal [4].
Furthermore, a Boltzmann Machine distribution is completely specified by its moments up to second order.78 Thus the two
77 This is rigorously proved in [33, 35] and discussed in Section 6.3.3 below. Note that Eq. (83) specifies the minimum number, n , of hidden units
h
required for our assumption to be valid.
78 See Section C.3. Given moments up to second moments, one can create the associated unique MaxEnt distribution, which is a Boltzmann Machine
distribution. Given a Boltzmann Machine distribution, one can compute its moments up to second order and use them to compute the MaxEnt distribution,
which must be the same distribution as the original one.
Boltzmann Machine distributions in Eq. (84) will be equal if and only if their moments up to second order are equal. To denote
this define true moments up to second order by
m0 = Etrue {x0i } = 1, mi = Etrue {xi }, and mij = Etrue {xi xj } (86)
and corresponding latent variable model moments by
m̂0 (θ) = Eθ {x0i } = 1, m̂i (θ) = Eθ {xi }, and m̂ij (θ) = Eθ {xi xj } (87)
2
for 1 ≤ i < j ≤ n. Note that there are p = n +n+22 moments in all, with n = nv + nh , and that for binary vectors whose
components take values 0 or 1, or values ±1, there are no other second-order moments to consider.79
In principle one can equate the model moments (87) to the true moments (86) and solve the resulting p (generally nonlinear)
equations to determine the unknown parameter vector θ, a procedure which is known as the Method of Moments (MOM) [17].
Because we are working with a marginalization model, we can recast the typical MOM equation as follows:
0 = mij − m̂ij (θ) = Etrue {Etrue {xi xj | Xv }} − Eθ {Eθ {xi xj | Xv }}

n o n o
= Etrue Eθ {xi xj | Xv } − Eθ Eθ {xi xj | Xv } (See Eq. (85))
| {z } | {z }
m̂ij (θ | Xv ) m̂ij (θ | Xv )
X X
= πtrue (Xv ) m̂ij (θ | Xv ) − πθ (Xv ) m̂ij (θ | Xv )
values of Xv values of Xv
X
= m̂ij (θ | Xv ) πtrue (Xv ) − πθ (Xv ) .
values of Xv
We see, then, that equating the model moments (87) to the true moments (86) results in the set of equations
≡1
X z }| {
m̂0 (θ | Xv ) πtrue (Xv ) − πθ (Xv ) = 0
values of Xv
X
m̂i (θ | Xv ) πtrue (Xv ) − πθ (Xv ) = 0 for 0 ≤ i ≤ n (88)
values of Xv
X
m̂ij (θ | Xv ) πtrue (Xv ) − πθ (Xv ) = 0 for 1 ≤ i < j ≤ n
values of Xv
Which we can write in matrix-vector form as

v
M(θ) πtrue − πθv = 0 (89)
where M(θ) is a p × Kv matrix that generally is a nonlinear function of θ. This is the MOM parameter estimation problem
v
recast in terms of the marginal distributions πtrue and πθv . Summarizing the above discussion, and that of the previous section,
we have
Theorem – Marginalization MOM for the Boltzmann Distribution
Let πtrue (X) = πtrue (Xv , Xh ) be a Boltzmann Machine distribution for the n-component binary random vector
X = (x1 , · · · xn ) that partitions as X = vec(Xv , Xh ) where Xv has nv -components and Xh has nh -components,
n = nv + nh . Let πθ (X) be a fully general Boltzmann Machine distribution for the sample random vector X as
shown in Equations (78)–(82) and let n be large enough that πtrue = πθ for some value of the parameter vector
v
θ. Let πtrue and πθv respectively denote the marginals πtrue (Xv ) and πθv (Xv ). Then

v
πtrue = πθ ⇐⇒ M(θ) πtrue − πθv = 0
n2 +n+2
where M(θ) is p × Kv with Kv = 2nv and for some p ≥ Kv with p = 2 .
79 See Appendix D.2.
We shall see in Section 6.4 that the Boltzmann Machine learning rule of [1, 22, 23] iteratively seeks a solution to the MOM
problem (89). Once we accept that marginalization of a Boltzmann Machine (Ising model) distribution πθ (X) = πθ (Xv , Xh )
can exactly match an arbitrarily chosen Xv –distribution πtrue (Xv ) for an appropriate number, n, of elements of X, and choice
of parameter vector θ, then the primary design consideration is the number of hidden units nh that determines n = nv + nh
subject to the necessary condition shown in Eq. (83).
Note that the development is agnostic as to whether each binary variable xi takes the value 0 or 1, or the values ±1. Assume
that xi ∈ {0, 1}. Then the moments are simple to understand: mi is the probability that xi takes the value one and mij is the
probability that xi = xj = 1. Since all the probabilities under consideration are positive, these moments, as probabilities, are
v
all nonzero. Thus, in this case, all of the elements of M are nonzero,80 as are all of the values of πtrue and πθv .
Note that if the “tall” (p ≥ Kv ) matrix M(θ) has full column rank for the value θ, then it must be true that πθv = πtrue
v
at
that same value of θ. Therefore, perhaps one can find a sufficient condition for the number, nh , of hidden units by determining
n∗h = arg min {nh | rank M(θ) = Kv for some parameter vector θ} .
nh
If such a value exists, and is finite n∗h < ∞, then a sufficient condition for πθv = πtrue
v
to hold for some θ would be nh ≥ n∗h .
6.3.3 Universal Approximation Property – Sufficient Number of Hidden Units

A necessary, but not sufficient, condition for a marginalized Ising model to exactly match the most general true distribution on
the visible units is given by Eq. (83),
√
2nv +3 − 7 − 2nv − 1 nv
nh ≥ =O 2 2 . (90)
2
The analysis given in reference [35], which is detailed below, proves that a sufficient condition for an exact match of a
marginalized Ising model to the most general distribution on Xv is that the number of hidden units is at most

2nv − nv − 1 = O 2nv . (91)
Thus to have a marginalized Ising model match to a general categorical distribution for Xv requires a number of hidden units
in the range
nv √2nv +3 − 7 − 2n − 1
v
O 22 = ≤ nh ≤ 2nv − nv − 1 = O 2nv (92)
2
In all cases the number of required independent parameters needed to fit a marginalized Boltzmann Machine Distribution
to the most general categorical distribution is
p ≥ Kv = 2nv (93)
where Kv is the number of parameters (equivalent to Kv − 1 probability values) need to specify the most general categorical
distribution for the binary, nv -component random vector Xv .
Examination of the proof given in [35], discussed below, shows that (91) is also a sufficient condition for the marginaliza-
tion of a Restricted Boltzmann Machine (RMB) to match to a general distribution on Xv . Because the RBM has a restricted
architecture of Boltzmann Machine type,81 if the marginalized RBM is a universal approximator, then so is the Boltzmann
Machine. In fact, reference [35] shows universality of a further restriction of the RBM architecture.
The remainder of this section details the proof of [35]. We start by imposing the RBM restriction on the Boltzmann
Machine latent variable model shown in Equations (70)–(75), which corresponds to
Restricted Boltzmann Machine (RBM) Assumption: Wvv = 0 and Whh = 0.
To simplify the notation we take Xv = V with the components of V to be vi ∈ {0, 1} for i = 1, · · · , nv , and we take Xh = H
with the components of H to be hj ∈ {0, 1} for j = 1, · · · nh . The corresponding realization values are V, vi , H, and hi .
There is no loss of generality in assuming that the binary random variables vi and hi take values 0 or 1.82 We also define
A = βWhv ∈ Rnh ×nv , r = βbh ∈ Rnh , s = βbv ∈ Rnv , and s0 = e−Z .
80 I.e,M is a positive matrix for xi ∈ {0, 1}.
81 See footnote 72.
82 A change from x ∈ {−1, +1} to x ∈ {0, 1} corresponds to the invertible transformation x ← 1 (x + 1) which does not change the functional
i i i 2 i
form of the Ising model Equations (70)–(75). (However, the values of the matrix W and the vector b are changed.) See the discussions in Sections D.2 and
5.3.4. The mathematical analysis is much more straightforward with the 0-1 choice, which is used in the proof given in [35].
With these changes the latent variable model (70)–(75) becomes the RBM model
e−Eθ (X) T T T
πθ (X) = πθ (V, H) = = eH AV+H r+s V+s0 (94)
Z
with
θ = vec(A, r, s, s0 ).
We then make the further restrictive assumption on A that HT AV has the form
 
nh X ni nh ni nh
(i)
X X X X
HT AV = aij hi vj = ai hi  v̄j  = ai Sii (V(i) )hi (95)
i=1 j=1 i=1 j=1 i=1
| {z }

Si i (V(i) )
where
ni
(i)
X
Sii (V(i) ) = v̄j (96)
j=1
for i = 1, · · · , nh , with
T
V(i) = vi1 , · · · , vini , Cardinality(V(i) ) = |V(i) | = ni ≤ nv , i = 1, · · · nh , nh = 2nv = Kv (97)
and
(i)
v̄j = vij for 1 ≤ i1 < · · · < ij < · · · < ini ≤ nv , and v̄ini = i vini with i = ±1. (98)
The model restrictions (95)–(98) merit some discussion. Let V = {1, · · · , nv } be the nodes on which the elements of the
visible unites V = (v1 , · · · , vnv )T are defined. Let P(V) be the power set of V,
Vi ∈ P(V) ⇐⇒ Vi ⊂ V
and
ni = |Vi | ≤ |V| = nv .
Note that the total number of subsets is |P(V)| = 2nv = Kv so that nh = Kv = 2nv . Let each indexed subset Vi be distinct,
so that i 6= j =⇒ Vi 6= Vj , and order the elements of Vi as
Vi = {i1 , · · · , ini } with {1 ≤ i1 < · · · < ini ≤ nv }.
Then
ni
(i)
X X
Sii (V(i) ) = v̄j = vj + i vini (99)
j=1 j∈Vi \ini
with
ni = |V(i) | for i = 1, · · · , nh , nh = 2nv = Kv .
is a function of the elements of V that are defined on the ni nodes in the subset Vi ⊂ V where there are nh = 2nv = Kv such
subsets.
From (94), (95) and (99) we have
 

 

nh
(n )
h
X
 

sT V+s0
X
i T i

πθ (X) = πθ (V, H) = exp hi ai Si V(i) + ri +s V + s0 = e exp hi Ci V(i) (100)
 

 i=1 | {z } 
 i=1

Ci = Ci i (V(i) )

Comment
Note that a nonzero value of ai in Eq. (100) serves to couple all of the ni = |V(i) | binary variables vij in the
subset V(i) through the single latent variable hi ,

ai hi Sii V(i) = ai hi vi1 + · · · + vini −1 + i vini .

In particular, for ni ≥ 3, the coupling through the single latent variable hi allows for higher order (beyond second
order) coupling between the binary variables in V . Note that the model (100) uses this single-latent-variable
coupling for every possible subset of the elements of V ,83 and in this way all possible higher correlations amongst
the elements of V are captured.
Our goal is to determine values for

θ = vec(s0 , s, r, a, )
to ensure that
πtrue (V) = πθ (V)
for
(n )
h
sT V+s0
X X X
hi Cii

πθ (V) = πθ (V, H) = e exp V(i) (101)
values of H values of H i=1
and πtrue (V) given by the fully general nh -th order polynomial form shown in Equation (76),
nv
X nv
X nv
X
ln πtrue (V) = ϑ0 + ϑi vi + ϑij vi vj + ϑijk vi vj vk + · · · + ϑ12 ··· nv v1 v2 · · · vnv (102)
i=1 i<j i<j<k
Comment
Note that showing πtrue (V) = πθ (V) for πθ (V) given by Eq. (101) will demonstrate that marginalizations of both
the Restricted Boltzmann Machine (RBM) and Boltzmann Machine models yield universal approximators since
the model (94)–(100) is a restricted version of both of those model families.
Recalling that hi ∈ {0, 1} we break up the sum over the Kh = 2nh values of H = (h1 , · · · , hnh )T shown in Eq. (101) into
(nh + 1) smaller sums as follows
X X X X X
= + + +··· + .
values of H all components hi are zero one component nonzero two components nonzero all components nonzero
This results in
n h
P nh nh nh
X hi Ci X X X
e
i=1 =1+ eCi + eCi +Cj + eCi +Cj +Ck + · · · + eC1 +···+Cnh
values of H i=1 i<j i<j<k
nh nh
i i
1 + eCi (V(i) ) = 1 + eai Si (V(i) )+ri
Y Y
=
i=1 i=1
Therefore
nh
i
T

1 + eai Si (V(i) )+ri
Y
πθ (V) = es V+s0
(103)
i=1
so that84
nh nh
i

log 1 + eai Si (V(i) )+ri +sT V + s0 =
X X
φi i V(i) + sT V + s0

log πθ (V) = (104)
i=1 | {z } i=1

φi i (V(i) )
83 All nh = 2nv = Kv of them!

84 Our definition of φi (V) corrects a slight typo in [35].
with
φi i V(i) = φi i (V(i) ; ai , ri , i )

i

= log 1 + eai Si (V(i) )+ri
P
a v + v +ri
= log 1 + e i j∈Vi \ini j i ini

a v +···+vin −1 +i vin +ri
= log 1 + e i i1 i i . (105)
Note that equating (104) to (102) yields the condition

nh
X
s0 + sT V + φi i V(i) = ρ v1 v2 · · · vnv + POLY(V; nv − 1)

(106)
| {z }
i=1
= right-hand-side of eq. (102)
for ρ = ϑ12 ··· nv and where POLY(V; k) denotes a k-th order polynomial in the elements of V. Note that there are nh =
2nv functions φi i V(i) shown on the right-hand-side (RHS) of (106). We will need to determine an appropriate choice of

parameters for each one of these functions in order to match the most general form of the distribution of the binary vector V .
The fact that condition (106) can be made true by an appropriate choice of parameters is a consequence of the following
lemma.
LEMMA [35]
Chose any positive integer N > 1. Consider a collection of N binary variables Z = (z1 , · · · , zN )T , zi ∈ {0, 1},
i = 1, · · · N . For real a and r and = ±1 define the parameterized function

φa,r, (Z) , log 1 + ea(z1 +···+zN −1 +zN )+r .
Choose any real number ρ. Then for = sign(ρ) there exists values of a and r such that
φa,r, (Z) = ρ z1 z2 · · · zN + POLY(Z; N − 1) (107)
where POLY(Z; N − 1) is some (N − 1)–order polynomial in the elements of Z. In particular, to attain the value
of ρ one can always determine some positive value of a and set
(
( 21 − N )a = −(N − 21 )a < 0 = sign(ρ) = +1
r= (108)
( 23 − N )a = −(N − 23 )a < 0 = sign(ρ) = −1
to ensure that (107) holds.
Proof: Because φ(Z) = φa,r, (Z) is a function of a finite, binary vector Z, it always can be represented by an
N -order polynomial of the form shown on the right-hand-side of Eq. (107)85
N
X X X
φa,r, (Z) = c0 + ci zi + cij zi zj + · · · + ci1 i2 ···iN −1 zi1 · · · ziN −1 + c1···N z1 · · · zN
| {z }
i=1 i<j i1 <i2 <···<iN −1 gN (a,r,)
The trick, then, is to find the coefficient, gN (a, r, ) of the N -order term of the polynomial expansion of φa,r, (Z)
as a function of (a, r, ) and show that one can always choose values of (a, r, ) to force gN (a, r, ) = ρ regardless
of the value of ρ.
First assume that ρ ≥ 0 and take = sign(ρ) = 1. Then one can determine that
N
X N
(−1)N −j log 1 + eaj+r

gN (a, r, 1) = (109)
j=0
j
85 See Appendix D.2.

via an iterative procedure based on selectively setting various subsets of the variables zi to zero and the variables
on the complements of those subsets to one. Starting this procedure we have

g0 (a, r, 1) = c0 = log 1 + er
0
X 0
(−1)0−j log 1 + eaj+r

=
j=0
j

g1 (a, r, 1) = ci = log 1 + ea+r − c0 = log 1 + ea+r − log 1 + er
1
X 1
(−1)1−j log 1 + eaj+r

=
j=0
j

g2 (a, r, 1) = cij = log 1 + e2a+r − ci − cj − c0

= log 1 + e2a+r − 2 log 1 + ea+r + log 1 + er
2
X 2
(−1)2−j log 1 + eaj+r

=
j=0
j

g3 (a, r, 1) = cijk = log 1 + e3a+r − cij − cik − cjk − ci − cj − ck − c0

= log 1 + e3a+r − 3 log 1 + e2a+r + 3 log 1 + ea+r − log 1 + er
3
X 3
(−1)3−j log 1 + eaj+r

=
j=0
j
..
.
and the inductive pattern leading to (109) is established.
Assuming that ρ ≥ 0, choose a > 0 and r such that
Na + r > 0 and (N − 1)a + r < 0, (110)
noting that then ja + r < 0 for j = 0, 1, · · · N − 1. In particular, to ensure that the conditions shown (110) hold
one can always enforce the constraint
1 1
r= −N a=− N − a < 0.
2 2
Given that the conditions in (110) are satisfied, then gN (ta, tr, 1) is zero for t = 0 and goes to infinity as t → ∞
thus via an appropriate choice of the values of t one can have gN (ta, tr, 1) = ρ for any nonnegative value of ρ.
Now assume thatj ρ is nonpositive, ρ ≤ 0. Define the new 0 − 1 binary variables yi by yi = zi for i = 1, · · · N − 1
and yN = 1 − zN and note that the right-hand-side of Eq. (107) becomes,
ρ z1 z2 · · · zN −1 zN + POLY(Z; N − 1)

= − ρ y1 y2 · · · yN −1 yN + y1 y2 · · · yN −1 + POLY(Z; N − 1)
| {z } | {z }
, ρ0 ≥ 0 , POLY(Y;N −1)
=ρ0 y1 y2 · · · yN −1 yN + POLY(Y; N − 1).
For ρ0 ≥ 0, we have just shown that there exists values of a0 > 0 and r0 such that
0 0

log 1 + ea (y1 +···+yN −1 +yN )+r = ρ0 y1 y2 · · · yN −1 yN + POLY(Y; N − 1)
and that in particular we can take

1 0 1
r0 = − N − a = − N a0 (111)
2 2
for some appropriately chosen a0 > 0. In terms of Z, we have

0 0 0

log 1 + ea (z1 +···+zN −1 −zN )+(a +r ) = ρ z1 z2 · · · zN −1 zN + POLY(Z; N − 1)
which we rewrite as

log 1 + ea(z1 +···+zN −1 −zN )+r = ρ z1 z2 · · · zN −1 zN + POLY(Z; N − 1)
by setting a = a0 and r = a0 + r0 . Note that the constraint condition (111) becomes

1 3 3
r−a= − N a =⇒ r = −N a=− N − a
2 2 2
for ρ ≤ 0.
With the validity of Eq. (107) having been established, we can now verify the universal approximation condition Eq. (106)
by iteratively selecting the parameters ai (and hence the parameters ri via the constraint (108)) in the functions φi i ) in the
left-hand-side of (106) as follows:
Stage 0. Chose ai (and hence ri via (108)) so that a single (note that 1 = n0v = nnvv ) φi i (V(i) ; ai )-term on the left-hand-side

(LHS) of (106) is fitted to the single highest-order (nv -th order) monomial term on the right-hand-side (RHS) ,

φi 1 (V(i); ai ) = ρ v 1 · · · v nv + Q(nv − 1) .

ni =nv | {z } | {z }
highest-order term on RHS of (106) (nv − 1)-order polynomial in V
Call the entire right-hand-side of (106) F0 (V) and cancel the nv -order monomial term by forming
F1 (V) = F0 (V) − φ11 V(i) ; a1 .

F1 (V) is a polynomial of at most degree nv − 1 with at most n1v monomial terms of order nv − 1.86

Stage 1. For each of the up to n1v monomials of order nv −1 in F1 (V), fit individual φi i V(i) ; ai terms shown on the left-hand-

side (LHS) of (106), with ni = nv − 1, to match one of those monomials for highest-order-term cancellation purposes.
Once this is done, subtract all of these fitted functions from F1 (V) to create a function F2 (V) that is a polynomial of
degree at most nv − 2 with at most n2v monomial terms of order nv − 2.
.. .. ..
. . .
Stage k − 1. Continuing in this manner, assume that an at-most (nv − (k − 1))–order polynomial function Fk−1 (V) has been con-
nv
structed in the previous Stage k − 2 that contains at most k−1 monomial terms of order (nv − (k − 1)) = (nv − k + 1).
For each of these monomial terms of order (nv − k + 1), fit one of the terms φi i V(i) ; ai , ni = nv − k + 1, on the

left-hand-side (LHS) of (106) to it for highest-order-term cancellation purposes. After all (nv − k + 1)-order mono-
mial terms have been fitted, subtract the fitted φi i V(i) ; ai functions from Fk−1 (V) to form a function Fk (V) that is a
polynomial of degree at most (nv − k) containing at most nkv monomial terms.

.. .. ..
. . .
Stage nv − 2. Assume that an at-most (nv −(nv −2)) = 2-order (quadratic order) polynomial function Fnv −2 (V) has been constructed
in the previous Stage nv − 3 that contains at most nvn−2 v
= n2v monomial terms of order (nv − (nv − 2)) = 2. For
each of these monomial terms of order 2 fit one of the terms φi i V(i) ; ai , ni = 2, on the left-hand-side (LHS) of

(106) to it for highest-order-term cancellation purposes. After all order-2 monomial terms have been fitted, subtract the
corresponding φi i (V; ai ) functions from Fnv −2(V) to form a function Fnv −1 (V) that is a polynomial of degree at most
(nv − (nv − 1)) = 1 containing at most nvn−1 = n1v = nv monomial terms.
v

For the remaining nv unused functions φi i V(i) ; ai (recall that we started the iterative process with nh = 2nv ofthem)

i
set the parameters ai to zero. Subtract the
P resulting constant values for the nv remaining functions φi V(i) ; ai from
Fnv −1 (V), Fnv −1 (V) ← Fnv −1 (V) − Constants
86 Each such nv − 1-order monomial term has the form ai1 ···inv −1 vi1 · · · vinv −1 .
Stage nv − 1. At most a polynomial of degree one in V remains. Fit the nvn−1 = n1v = nv parameter term sT V shown on the LHS
v

of Eq. (106) to the sum of the remaining nv linear order monomials. Subtract the fitted function sT V from Fnv −1 (V),
Fn (V) = Fnv −1 (V) − sT V. Then Fn (V) is at most a nonzero constant.
Stage nv . At most a nonzero constant function (zeroth order polynomial), Fn (V) = constant, remains. Fit the one (1 = nnvv =

nv

0 ) constant s0 shown on the LHS of (106) to it. Subtract s0 from Fn (V). What remains is the value zero.
Conclusion. Sum all the fitted functions φi i V(i) ; ai and sT V, and the fitted constant s0 as shown in Eq. (104) to form

nh nh
i i
T T

eφi (V(i) ;ai ) = es V+s0 1 + eai Si (V(i) )+ri
fitted
Y Y
πtrue (V) = es V+s0
= πθ (V).
i=nv +1 i=nv +1
Note from Eq.s (95)–(100) and the above procedure that every distinct, nonzero value ai 0, i > nv , is associated with a
distinct hidden unit hi . Counting the maximum number of parameters (and hence hidden units) needed from Stage 0 through,
and including, Stage nv − 2, we determine that at most
nv X nv
X nv nv nv
= − − = 2nv − n − 1
j=2
j j=0
1 0
hidden units are needed. In stages nv − 1 and nv we encounter nv visible units vi and nv + 1 additional parameters (including
the normalization parameter). Thus a sufficient condition for a marginalized Boltzmann Machine latent variable model to fit
to the most general possible binary distribution on nv visible units is for the model to have (at most87 ) nh = 2nv − nv − 1
hidden units, for a total of n = nv + nh = 2nv − 1 units in all.
The above procedure determines that the restricted model (94)–(95) has 2nv − n − 1 parameters associated with the hidden
units, n parameters associated with the visible units, and one normalization parameter, for a total of p = 2nv parameters.
From a practical perspective, this can be viewed as not an encouraging result at all given that a general categorical distribution
πtrue (V) on an nv -element binary vector V requires the specification of 2nv probabilities.
However, one should not ignore the potential positive aspects of using a marginalized Boltzmann Machine model. First,
this model requires no more than a quadratic energy on the total set of nodes (visible nodes + latent variable nodes), which
is consistent with the fact that the use of latent variables can simplify analysis and algorithm development. Secondly, the
universal approximation property shows that it is not futile to expect that a marginalized Boltzmann Machine model can
model a given binary-vector distribution (even though it might be computationally feasible to do so). Lastly, the hope when
using neural network-based probability models (“generative models”) is that nature has special structure that does not require
learning the “most general” probability model, and the very many successes demonstrated by the use of neural networks in
practice appear to suggest that this is indeed the case. These successful real-world applications of stochastic networks suggest
that Boltzmann Machine-like distributions are well-fitted to the type of stochastic structure that apparently exist in nature
[tegmark]
6.4 Boltzmann Machine Learning

6.4.1 Exact KL Divergence-Based Learning
The unknown distribution πtrue (Xv ) is a general positive distribution, and therefore representable by the fully general form
(47), as described in Appendix D.2. However, as this requires the determination of the values of Kv = 2nv parameters, this
is usually not a computationally tractable option. Instead, invoking the universal approximation property of a marginalized
Boltzmann Machine (aka Ising model) described in Section 6.3, we fit the marginalized binary, quadratic energy model88
1 −βEθ (X) 1
πθ (X) = e = e−βEθ (X) , Eθ (X) = − XT W X − bT X , (112)
Zθ 2
n2 + n + 1
θ = vec(θ0 , b, W ) = (θ0 , b1 , · · · , bn , w12 , · · · , w(n−1)n )T ∈ Rp , p= (113)
2
87 Note that we have shown exact match for the model (94)–(95), which is a restriction of the Restricted Boltzmann Machine; i.e., for a “restriction of a
restriction” of the general Boltzmann Machine Distribution. For the unrestricted Boltzmann Machine Distribution fewer hidden units are needed to obtain the
name number of parameters, p, which always needs to be at least equal to Kv = 2nv to exactly match to the most general categorical distribution on V .
88 See Equations (70)–(74) and (77)–(82). In particular note that W = W T with zero-valued diagonal elements. Also Z =−βθ0 as discussed in
θ
Appendix D.2.
X X 1 X
πθ (Xv ) = πθ (X) = πθ (Xv , Xh ) = e−βEθ (Xv ,Xh ) (114)
Zθ
values of Xh values of Xh values of Xh
to πtrue (Xv ) using the procedure originally proposed in [1, 22, 23].89 Note that
πθ (Xv , Xh ) πθ (X) e−βEθ (Xv ,Xh )

πθ (Xh | Xv ) = = = (115)
e−βEθ (Xv ,Xh )
P
πθ (Xv ) πθ (Xv )
values of Xh
and recall
X
Zθ = e−βEθ (Xv ,Xh ) (116)
values of Xv , Xh
Let us assume for awhile that πtrue (Xv ) is known and that we are interested in approximating it by a distribution of the
form πθ (Xv ). As mentioned in Appendix A, there are sound information theoretic reasons to determine an approximation that
minimizes the Kullback–Liebler Divergence

v πtrue (Xv ) X πtrue (Xv )
D(πtrue kπθv ) = E πtrue
v log = πtrue (Xv ) log (117)
πθ (Xv ) πθ (Xv )
values of Xv
v
of πθ (Xv ) from the true distribution πtrue (Xv ). Note that minimizing D(πtrue kπθ ) with respect to θ is equivalent to maximiz-
90
ing the expected per-sample log-likelihood of θ,
X
ELL(θ) , E πtrue
v {log L(θ; Xv )} = E πtrue
v {log πθ (Xv )} = πtrue (Xv ) log πθ (Xv ) . (118)
values of Xv
Maximization of ELL(θ) is commonly performed by use of the gradient ascent method
θ̂ ← θ̂ + λθ̂ ∇θ ELL(θ̂) (119)
where ∇θ ELL(θ) is the gradient91 of ELL(θ) and λθ > 0 is a step-size parameter used to control the stability and rate of
convergence.92 Note from Eq. (118) that the gradient of ELL(θ) is equal to the expected gradient of the per-sample log-
likelihood
X
∇θ ELL(θ) = E πtrue
v {∇θ log πθ (Xv )} = πtrue (Xv ) ∇θ log πθ (Xv ), (120)
values of Xv
which can be written as

X πtrue (Xv )
∇θ ELL(θ) = ∇θ πθ (Xv ) . (121)
πθ (Xv )
values of Xv
The last equation shows that if the value of πθ (Xv ) is less than the true value, then the effect of the gradient of the per-sample
likelihood, ∇θ πθ (Xv ), (which is the direction of maximum increase of the value of πθ (Xv )) on the overall, ensemble averaged
∇θ ELL(θ) is enhanced whereas if its value is less than the true value, it is decreased. Note that

X πtrue (Xv )
∇θ D(πtrue kπθ ) = −∇θ ELL(θ) = − E v
πtrue {∇θ log πθ (Xv )} = − ∇θ πθ (Xv ) (122)
πθ (Xv )
values of Xv
89 See also [21, 3, 20].

90 See the discussion in Appendix A.
T
91 We take ∇ f (x) = ∂ ∂
x ∂x
f (x) where ∂x f (x) is a row vector of partial derivatives. This convention is commonly used in robotics, econometrics,
geometrical physics and differential geometry. See, e.g., [12, 26].
92 The θ̂ dependence of λ emphasizes the fact that the step-size parameter can, in generally, depend on the iteration step. It is common to take λ = λ =
θ̂ θ
constant.
To determine the gradient of log πθ (Xv ) note from Eq.s (114) and (116) that
!  
∂ ∂ X ∂ X
log πθ (Xv ) = log e−βEθ (Xv ,Xh ) − log  e−βEθ (Xv ,Xh ) 
∂θ ∂θ ∂θ
values of Xh values of Xv , Xh
 P 
∂Eθ (Xv ,Xh ) ∂Eθ (Xv ,Xh )
−βEθ (Xv ,Xh ) e−βEθ (Xv ,Xh )
P
∂θ e ∂θ
 values of Xh values of Xv , Xh 
= −β  P −βE (X ,X )
− P −βE (X ,X )

 e θ v h e θ v h 
values of Xh values of Xv , Xh
 

X ∂Eθ (Xv , Xh ) X ∂Eθ (Xv , Xh )
=β πθ (Xv , Xh ) − πθ (Xh | Xv )
∂θ ∂θ
values of Xv , Xh values of Xh
( )!
∂Eθ (Xv , Xh ) ∂Eθ (Xv , Xh )
=β Eθ − Eθ Xv ,
∂θ ∂θ
using obvious definitions for Eθ {·} and Eθ {· | Xv }. Therefore

n o n o
∇θ log πθ (Xv ) = β Eθ ∇θ Eθ (X) − Eθ ∇θ Eθ (X) Xv (123)

with Eθ = Eθ (X) = Eθ (Xv , Xh ).

n o
With (123) in hand, and noting that Eθ ∇θ Eθ is a constant, Eq. (120) becomes
∇θ ELL(θ) = E πtrue
v {∇θ log πθ (Xv )}
n o n n oo
= β Eθ ∇θ Eθ − E πtrue Eθ ∇θ Eθ Xv
v

n n oo n n oo
= β Eθ Eθ ∇θ Eθ Xv − E πtrue Eθ ∇θ Eθ Xv (124)
v

n o
The term Eθ ∇θ Eθ is the average of the gradient ∇θ Eθ for the complete (unmarginalized) latent variable model πθ (Xv , Xh ).
n o
As noted in Appendix B (see Property 6), the term −Eθ ∇θ Eθ is the vector of moments of πθ (X) and therefore the right-
n o
hand-side of Eq. (124) is a difference of moments. Since there is no conditioning, Eθ ∇θ Eθ is an average obtained when
the nlatent variable
o model is allowed to run “freely”, and we say that the model is unpinned (or unclamped). The term
Eθ ∇θ Eθ Xv is the average of the gradient ∇θ Eθ for the latent variable model conditioned on the visible units being

pinned to a value Xv obtained from sampling from the true distribution πtrue (Xv ). Because the visible units of the latent
variable model are
n pinned
n (aka clamped)
oo to a fixed value, in this case the latent variable model is not allowednto run freely.
o
The term E πtrue Eθ ∇θ Eθ Xv is the πtrue (Xv )-average of the clamped conditional expectations Eθ ∇θ Eθ Xv ,
v

which are therefore averaged over all possible visible-unit realizations Xv = Xv generated according to the true distribution
πtrue (Xv ).
Eq. (124) says that ∇θ ELL(θ) vanishes (and hence learning stops) when the unpinned (freely running) latent variable
model produces unconditional moments that match the true distributional average over the conditional moments produced by
the pinned latent variable model. To make this clearer, let us see what this corresponds to in terms of the parameter values bi
and wij of the latent variable model (112). Note that
∂ ∂
Eθ (X) = −xi and Eθ (X) = −xi xj
∂bi ∂wij
for i, j = 1, · · · , n with n = nv + nh . Then Eq. (124) gives

∂
ELL(θ) = β E πtrue
v Eθ {xi | Xv } − Eθ Eθ {xi | Xv }
∂bi | {z } | {z }
m̂i (θ |Xv ) m̂i (θ |Xv )

= β E πtrue
v m̂i (θ |Xv ) − Eθ m̂i (θ |Xv )
| {z } | {z }
m̂+
i (θ) m̂−
i (θ) = Eθ {xi }

−
= β m̂+
i (θ) − m̂i (θ) (125)
yielding the gradient ascent learning rule93

−
b̂i ← b̂i + λb̂i m̂+
i (θ) − m̂i (θ) (126)
and

∂
ELL(θ) = β E πtrue Eθ {xi xj | Xv } − E πtrue Eθ {xi xj | Xv }
v v
∂wij | {z } | {z }
m̂ij (θ |Xv ) m̂ij (θ |Xv )

= β E πtrue
v {m̂ij (θ |Xv )} − Eθ {m̂ij (θ |Xv )}
| {z } | {z }
m̂+
ij (θ)
m̂−
ij (θ) = Eθ {xi xj }

−
= β m+
ij (θ) − mij (θ) (127)
yielding the gradient ascent learning rule

ŵij ← ŵij + λŵij m̂+
ij (θ) − m̂−
ij (θ) (128)
The notation is interpreted as follows:

• m+
ij (θ) indicates that one first computes the conditional (i.e., pinned) model moments
m̂ij (θ | Xv ) = Eθ {xi xj |Xv }

v
using the model distribution πθ , followed by averaging the pinned model moments using the true distribution πtrue ,
m̂+
ij (θ) = E πtrue {m̂ij (θ | Xv )}
v
Note that this is a two-step averaging procedure.

• m̂−
ij (θ) indicates that one does not compute pinned moments and that one does not average over the true distribution.
Instead one directly computes the model moments
m̂−
ij (θ) = Eθ {xi xj }
according to the model distribution πθ . Note that this is a one-step averaging procedure.
Note that the gradients (125) and (127) vanish (in which case learning ceases) when the moments, m̂− −
i (θ) and m̂ij (θ),
of the unpinned latent variable model respectively have learned to follow the true-data average of the conditional moments,
m̂+ +
i (θ) and m̂ij (θ), of the pinned latent variable model. If this occurs, the trained latent variable model can then naturally (i.e.,
without pinning) generate samples on the visible units that mimic the statistics of the data training samples that were drawn
from the true distribution πtrue (Xv ).
When the gradients (125) and (127) vanish we have94
X
πtrue (Xv ) − πθ (Xv ) = 0
values of Xv
X
m̂i (θ | Xv ) πtrue (Xv ) − πθ (Xv ) = 0
values of Xv
X
m̂ij (θ | Xv ) πtrue (Xv ) − πθ (Xv ) = 0
values of Xv
93 β has been absorbed into the step-size parameter λb̂ .
i
94 The first equation holds because both probabilities must sum to one.
2
for i, j = 1, · · · , n, for i < j and n = nv + nh , which is the system of p = n +n+2
2 Method of Moment (MOM) equations
(89) described in Section 6.3.2. As mentioned there, and described in Section 6.3.1, a necessary condition that these equations
imply πtrue (Xv ) = πθ (Xv ) in the most general case is that p ≥ Kv = 2nv , which is equivalent to the condition (83),
√
2nv +3 − 7 − 2nv − 1 nv
nh ≥ =O 2 2 .
2
course, the hope is that for learning an approximate distribution for Xv that is of practical utility one can get by with
Of √
nv +3 −7−2n −1
nh 2 2
v
.
Discussion of value of β and annealing.
6.4.2 Data-Driven Learning - Maximum Likelihood Estimation (MLE)
Let us now assume that the true distribution πtrue (Xv ) is either unavailable or intractable to use. In this case, we replace it
with the sample distribution π̂(Xv ) determined from N independent and identically distributed samples of Xv presumed to be
drawn from πtrue (Xv ).
6.5 The Restricted Boltzmann Machine (RBM)

APPENDICES
A Information Theory and Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) is a widely used method for estimating the parameters of a parametric statistical
model given independent and identically distributed (iid) data samples assumed to be drawn from that model. Here, we discuss
the interpretation of the MLE procedure as a finite-sample approximation to minimizing the (Kullback-Liebler) Divergence,
the latter being a well-known measure of discrepancy between two probability distributions. Restriction to the case of finite
random vectors allows for a straightforward heuristic description.
A.1 Maximum Likelihood Estimation (MLE)

Let X = (x1 , · · · , xn ) be a random vector with discrete components xi that takes its values in a finite probability space
(X, P(X), P ) with sample space X = {X(1) , · · · , X(K) } ⊂ Rn with K = |X| < ∞, P(X) = power set of X, and P = PX =
probability mass function on X that induces a distribution on P(X). The realization vectors X = X, X ∈ X ⊂ Rn , are
all taken to be real and finite with components xi , so that X = (x1 , · · · xn )T . Throughout we attempt to adhere to the
convention that roman font denotes random variables and sans serif font denotes realization values. We also try to adhere
to the convention that upper case fonts denote vectors and lower case fonts denote components of vectors. Thus a realization
of the roman font random component xi is the lower case sans serif variable xi , xi = xi , whereas the upper case san serif
variable X(`) denotes one of the admissible realization vectors for the upper case roman font random vector X in the sample
(`)
space X. The i-th component of the sample space vector X(`) is denoted by xi . We have P (X) = PX (X) = probability that
X = X ∈ X = {X(1) , · · · , X(K) }, so that P (X) takes its values in the finite set {p1 , · · · , pn } where pi = P (X(i) ). Therefore
P (X) can be written as
XK
P (X) = pi δX , X(i) .
i=1
The various expressions
X X K
X
f (X) = f (X) = f (X(i) )
X X∈X i=1
P
denote a sum over the realization values of X. Of course we have P (X) ≥ 0 for all X ∈ X and X P (X) = 1.
95
We will refer to P as the truth distribution for the random vector X, under the assumption that the model (X, P(X), P )
given in the previous paragraph does provide an accurate description of the random behavior of X. Unfortunately, we do not
know the truth distribution P and we must infer an approximation to it that is contained in a posited family of parameterized
distributions: n X o
Q = Q = Qθ : Qθ (X) ≥ 0 and Qθ (X) = 1 for all θ ∈ Θ ⊂ Rp
X
The triple (X, P(X), Q) (for short, just Q) is called a parametric statistical family. We assume that the statistical family is
identifiable:
θ = θ0 ⇐⇒ Qθ (X) = Qθ0 (X) ∀ X ∈ X and θ, θ0 ∈ Θ .
This means that selecting a unique distribution from Q is equivalent to selecting a corresponding unique parameter vector θ
and vice versa.
We assume that we can collect N iid realization samples of X, X[j] ∈ X, j = 1, · · · , K, drawn from the truth distribution
P which we assemble into the sample data set 96
D = DN = {X[1], · · · , X[N ]} ∈ X × · · · × X = XN .
| {z }
N times
Note that N = |D|. For a given distribution taken from the statistical family Qθ ∈ Q, equivalently, for a particular choice of
θ ∈ Θ, we can form the joint distribution on the N -sample data set DN :
N
Y
Qθ (DN ) = Qθ (X[j]) θ fixed, DN free.
j=1
95 We say “truth distribution” rather than “true distribution” because we are assuming that it is true for mathematical convenience. Perhaps a better way to
describe P is that it is a reference distribution assumed to describe the stochastic behavior of a world of interest.
96 Note that X[j] = X(`) ∈ Rn means that the j-th sample in D takes the `-th possible realization value in the sample space X ⊂ Rn .
N
Allowing θ to vary freely over Θ defines a two-argument function,

N
Y
F (θ, DN ) = Qθ (X[j]) θ free, DN free.
j=1
Fixing the sample data argument DN in F (θ, DN ) produces a DN -parameterized function of θ called the (full data) likelihood
function,
YN YN
L(θ; DN ) = L(θ; X[j]) = Qθ (X[j]) θ free, DN fixed.
j=1 j=1
where
L(θ; X) = Qθ (X) θ free, X fixed
is the per-sample likelihood function given the single-sample realization X.
The Maximum Likelihood Estimate of θ ∈ Θ given the N -sample iid data DN is defined by
θ̂ml (N ) = arg max L(θ; DN ) .

θ∈Θ
Because often the statistical family is of exponential family type, it is usually convenient to equivalently maximize the log-
likelihood function
X N
L(θ; DN ) = log L(θ; DN ) = log Qθ (X[j]) .
j=1
When the distribution log Qθ (X) belongs to an Exponential Family Distribution,97 one typically uses the base-e logarithm
function log a = ln a.
Defining the sample average of the per-sample log-likelihood function log L(θ; X) = log Qθ (X) on the N -sample iid data
set D by
N
D E 1 X
log Qθ (X) = log Qθ (X[j])
N N j=1
then yields
D E D E
θ̂ml (N ) = arg max log L(θ; X) = arg max log Qθ (X) . (129)
θ∈Θ N θ∈Θ N
Note that under the assumption that Q is identifiable, this is equivalent to solving98
D E
Q̂(N ) = arg max log Q(X) .
Q∈Q N
I.e., at least mathematically, density estimation is parameter estimation, and parameter estimation is density estimation. Also
note from Eq. (129) that the MLE is obtained by maximizing the data-averaged per-sample log-likelihood function
D E D E
log L(θ; X) = log Qθ (X) .
N N
The MLE θ̂ml (N ) has many interesting and useful finite– and infinite–sample properties which are described well in many
references.99
97 Also known as an Exponential Model [17].
98 Under the assumption of identifiability, θ̂ml (N ) and Q̂(N ) are related in a one-to-one way by Q̂(N ) = Qθ̂ml (N ) .
99 For example, see [17, 18].
A.2 The Empirical Distribution

Given the collection of iid data DN = {X[1], · · · , X[N ]} ∈ XN for the discrete random variable X taking a finite number of
values with truth distribution P and any bounded function f (X), then the transformed collection
FN = {f (x[1]), · · · , f (x[N ])} ∈ XN
forms a sequence of iid data samples for the random variable f (X). The sample average of f (X) is
N
D E 1 X
f (X) = f (X[j]) .
N N j=1
Above we discussed the sample average for the particular case of f (X) = log Qθ (X).
The Strong Law of Large Numbers (SLLN) yields
K
D E N →∞ X X
f (X) −−−−−−−→ EP {f (X)} = f (X)P (X) = pi f (X(i) ) almost surely.
N
X∈X i=1
Thus in the limit of large N a sample average can provide an accurate approximation to a distributional average.100 This fact
leads one to define, and use, the empirical distribution as a surrogate for the unknown truth distribution P .
The empirical distribution P̂ = P̂ (N ) is a function of the measured iid data D = DN = {X[1], · · · , X[N ]} and is defined
for any realization, known or unknown, as
N N
1 X 1 X 1 X
P̂ (X) = P̂ (N ) (X) = 1(X = X0 ) = 1(X = X[j]) = δX , X[j] .
|D| 0 N j=1 N j=1
X ∈D
For any function f we have

N N
(N )
X 1 XX 1 X
EP̂ {f (X)} , EP̂ (N ) {f (X)} = f (X)P̂ (X) = f (X) δX , X[j] = f (X[j])
N j=1 N j=1
X∈X X∈X
and therefore D E
(N )
EP̂ {f (X)} = f (X) .
N
As a consequence, the SLLN can be written as
(N ) N →∞
EP̂ {f (X)} −−−−−−−→ EP {f (X)} almost surely.
Note that taking the function to be f (X) = 1(X = X(i) ) = δX , X(i) yields
n o N
(N ) i
p̂i = P̂ (X(i) ) = EP̂ 1(X = X(i) ) =
N
where
N
1 X
Ni = δ (i)
N j=1 X[j], X
is the relative frequency of occurrence of the particular realization value X(i) ∈ X in the data set of measurements DN =
{X[j], 1 ≤ j ≤ N }. Note that the empirical distribution can be written as
K
X
P̂ (X) = p̂i δX , X(i) .
i=1
As a consequence of the SLLN, we have

(N )
p̂i = p̂i −→ pi as N → ∞ almost surely.
Note from Eq. (129) that
n o n o
(N ) (N )
θ̂ml (N ) = arg max EP̂ log L(θ; X) = arg max EP̂ log Qθ (X) . (130)
θ∈Θ θ∈Θ
100 A “distributional average” is also called an “ensemble average”. The SLLN states that “sample averages converge to ensemble averages” almost surely,
a statement that is often phrased as “the property of ergodicity holds.” The SLLN will hold given appropriate sufficient conditions, such as iid finite-state
random variables having bounded moments.
A.3 Entropy, Cross Entropy and the Kullback-Leibler Divergence

For a realization X = X ∈ X and distribution P define the Information function
1
IP (X) = log = − log P (X) ≥ 0 .
P (X)
We interpret IP (X) as providing a measure of the amount of information one has gained by performing a measurement of the
random variable X and obtaining the realization value X = X given that we believe that the probability distribution describing
the random behavior of X is P . IP (X) is also called the surprise occurring when a measurement of X results in the realization
X . In information theory it is typical to use a base-2 logarithm so that the amount of information (or surprise) arising from
the measurement X = X is in units of bits. Note that the occurrence of a low (near zero) probability-P event is interpreted
as producing a lot of information (and is quite surprising) whereas the occurrence of a P -certain (probability one) event is
interpreted as providing no information (or surprise) at all.
If P is the truth distribution, then we take the measure of surprise provided by IP to be objectively valid. However, it is
usually the case that we do not know the truth distribution and instead are using a model of the stochastic behavior of reality
that is captured by a parameterized distribution Qθ ∈ Q. In this latter situation, we can only estimate that the amount of
information provided by the measurement X = X is I Qθ (X).
Because the real world, assumed to be behaving according to the truth distribution P , is the source of all measurements,
on the average we expect to see values of IQθ and IP given respectively by
H(P, Qθ ) = EP {IQθ (X)} and H(P, P ) = EP {IP (X)} .
The function

1
H(P, Qθ ) = EP {IQθ (X)} = EP log = −EP {log Qθ (X)} (131)
Qθ (X)
is the Cross-Entropy101 between the truth distribution P and the model distribution Qθ . It gives the amount of surprise we will
measure on the average when applying IQθ to data repeatedly generated by the real world according to P in an iid manner.
The Cross-Entropy between P and itself is the Self-Entropy, H(P, P ), or, more simple, the (Shannon) Entropy, H(P ), of
P:

1
H(P ) = H(P, P ) = EP {IP (X)} = EP log = −EP {log P (X)} . (132)
P (X)
The entropy H(P ) provides an objective measure of the intrinsic amount of information (surprise) that we gain (encounter) on
the average as we repeatedly, in an iid manner, query a world whose stochastic behavior is truly described by the distribution
P.
Define the Information Discrepancy ∆I(P, Qp ) between the model and truth distributions by
P (X)
∆I(P, Qp )(X) = IQθ (X) − IP (X) = log .
Qθ (X)
This gives a measure, in bits, of how much one overestimates or underestimates the amount of information obtained in a
measurement X = X assuming the model distribution Qθ when the world actually behaves according to P .102
The Kullback-Liebler (KL) Divergence, D(P kQθ ), of Qθ from P is defined to be the average Information Discrepancy
assuming that the world behaves according to the truth distribution,
n o
D(P kQθ ) = EP ∆I(P, Qp )(X) = EP {IQθ (X)} − EP {IP (X)} = H(P, Qθ ) − H(P ) . (133)
Note that it is a simple consequence of this definition that
H(P, Qθ ) = H(P ) + D(P kQθ ) . (134)

101 Also called the (Kerridge) Inaccuracy.
102 Thus one can see why P is also called the reference distribution.
It is evident that D(P kQθ ) gives a measure of how much the estimated average amount of information (surprise) in the world,
given by H(P, Qθ ), divergences from the true amount of average information (surprise) in the world, as given by H(P ).
The Divergence can be equivalently written as

P (X) X P (X)
D(P kQθ ) = EP log = P (X) log , (135)
Qθ (X) Qθ (X)
x∈X
and it is in this form that it is commonly encountered in the literature. The KL Divergence obeys the
Gibbs Inequality 103
0 ≤ D(P kQ) for all P , Q and 0 = D(P kQ) iff Q = P almost surely. (136)
We note the following closely related points:

1. The KD Divergence D(P kQθ ) provides an information theoretic measure of how much the model distribution Qθ
diverges from the truth (reference) distribution P . Various useful interpretations and properties of this measure of
discrepancy between two distributions are described in the literature. In particular, it is highly desirable to minimize the
divergence between an approximation distribution Qθ and the truth distribution P if one wants good performance from
the approximation.
2. Equations (134) and (136) imply the Shannon Inequality
H(P, Qθ ) ≥ H(P, P ) = H(P ) .
I.e., on the average we measure more surprise in the world using an incorrect distributional model Qθ than if we measure
surprise using the correct distributional model P .
3. D(P kQθ ) = 0 if and only iff the cross-entropy between P and Qθ is equal to the self-entropy of P
H(P, Qθ ) = H(P, P ) = H(P ) .
I.e., if we encounter no more surprise than we have to.

4. From Equations (133) and (131), minimizing the KL Divergence D(P kQθ ) with respect to Qθ is equivalent to mini-
mizing the cross-entropy H(P, Qθ ). I.e., minimizing divergence is equivalent to minimizing surprise.
Because of Point 1, it is rational to attempt to find a distribution Qθkd ∈ Q that satisfies
θkd = arg min D(P kQθ ) . (137)

θ∈Θ
Because of Point 4, this is equivalent to optimizing the Cross-Entropy
θkd = arg min H(P, Qθ ) , (138)

θ∈Θ
n o
which, because H(P, Qθ ) = −EP log Qθ (X) , corresponds to
n o
θkd = arg max EP log Qθ (X) . (139)
θ∈Θ
Note that Eq. (139) shows that the parameter vector, θkd , that is optimal in the sense of minimizing the Kullback-Liebler
divergence is also optimal in the sense that it maximizes the expected per-sample log-likelihood function,
n o n o
EP log L(θ; X) = EP log Qθ (X) .
Unfortunately, very often the truth distribution P is not known, or is intractable complex, so the procedure Eq. (139) cannot
be carried out in practice.
103 Also known as the Information Inequality.
A.4 Empirical Cross-Entropy Optimization and MLE

Even if we do not know the truth distribution P , assuming that we have iid data D = DN we can obtain an approximation to
the Cross-Entropy H(P, Qθ ) in the optimization (138) by the use of the empirical distribution,
n o
(N )
H(P̂ (N ) , Qθ ) = −EP̂ log Qθ (X) . (140)
Thus we can approximate the solution to Eq. (139) by the empirical distribution approximation
n o
(N )
θ̂kd (N ) = arg max EP̂ log Qθ (X) (141)
θ∈Θ
and asymptotically we expect that

n o
θ̂kd (N ) −→ θkd = arg max EP log Qθ (X) as N → ∞.
θ∈Θ
Now compare the parameter estimates shown in equations (130) and (141). We see that they are equivalent!
θ̂ml (N ) = θ̂kd (N ) . (142)
Thus the MLE estimate θ̂ml (N ) is a finite-sample, empirical estimate of the KD Divergence estimator θkd , for which we
expect that
θ̂ml (N ) −→ θkd as N → ∞ .
B Properties of the Boltzmann Distribution

Let X be a discrete random vector that takes K possible values X = X ∈ Rn in the sample space X = {X(1) , · · · , X(K) } ⊂
Rn , |X| = K.104 We interpret the random vector X = (X1 , · · · , Xn )T as describing the collective state of an n-particle
system in thermal equilibrium with a heat bath at temperature T and constant volume V . We also assume that the system does
not exchange particles with the heat bath. Each possible vector state (configuration) realization vector
X = X = (X1 , · · · , Xn )T ∈ X ⊂ Rn
encodes the realization values Xi , i = 1, · · · , n, of each of the n particles in the system. If there are n particles (i.e., n
components of X), and each one can take on one of m values, then K = mn .
At thermal equilibrium, the statistical behavior of the system is described by the Boltzmann distribution [6, 24],
K
1 −βE(X) X (j)
P (X) = P (X; β) = e with Z(β) = e−βE(X )
Z(β) j=1
1
where β = kT > 0.105
By assumption, the energy function E(X) is known and finite, and therefore P (X) > 0 for all X ∈ X. I.e., every
finite-energy Boltzmann distribution is a positive distribution.
PROPERTY 1
A positive distribution for a discrete random variable on a finite sample space is equivalent to a finite-energy
Boltzmann distribution on that sample space and vice versa.
Proof. The “vice versa” statement has already been discussed. Assuming that P is a positive distribution, then
P (X) > 0 for all X ∈ X. Select any state in X0 ∈ X to serve as a “reference state”. Define E(X0 ) to be any finite
real number 106 . Noting that P (X0 ) > 0, define the finite value E(X) for any X ∈ X by107
1 P (X)
E(X) , E(X0 ) − ln .
β P (X0 )
104 The stochastic setup is described at the beginning of Appendix A.1.
105 For completeness we set k equal to Boltzmann’s constant. When working with artificial neural networks we can take k = 1
106 It is simplest to take E(X ) = 0.
0
107 Note that in this appendix we are making it explicit that we are working with the natural logarithm. This is unlike much of the note where we
predominantly use the notation “log” and the precise nature of the logarithm depends on the context.
Then
e−βE(X)

P (X) −E(X0 )
−βE(X) = ln e =⇒ P (X) =
P (X0 ) Z
with
e−E(X0 )
Z= .
P (X0 )
To facilitate some of the mathematical derivations below, It is convenient to put pj = P (X(j) ) and Ej = E(X(j) ) and
write
K
1 −βEj X
pj = e with Z = e−βEj
Z j=1
Also note that

∂β 1 β ∂ ∂β ∂ β ∂
=− 2 =− and = =−
∂T kT T ∂T ∂T ∂β T ∂β
K
∂ X
Z= Ej e−βEj .
∂β j=1
and
ln pj = −βEj − ln Z .
Define the average energy of the Boltzmann distribution by
K K
X 1 X
Energy Average: U = E{E(X)} = pj Ej = Ej e−βEj
j=1
Z j=1
we have
∂
PROPERTY 2 U =− ln Z
∂β
Proof.
K
∂ 1 ∂ 1 X
− ln Z = − Z= Ej e−βEj = U
∂β Z ∂β Z j=1
The variance of the energy about its mean value U is given by
Energy Variance: VarE = E{(E(X) − U )2 } = E{E 2 (X)} − U 2
The dispersion or standard deviation108 of the energy about its mean is defined as
p p
Energy Dispersion: DE = VarE = E{(E(X) − U )2 }
∂2 ∂
PROPERTY 3 VarE = ln Z = − U
∂β 2 ∂β
108 Also called the RMS (for “root mean-square”) value.
Proof.
  !
K
∂ ∂ 1 X 1 X 1 X 1 X 2 −βEj
− U =− Ej e−βEj = −  Ej e−βEj  Ei e −βEi
+ E e
∂β ∂β Z j Z j
Z i=1
Z j j
| {z }| {z } | {z }
U U E{E 2 (X)}
Properties 2 and 3 are a consequence of the fact that the function ln Z(β) is the energy-moments generating function for
the Boltzmann distribution.
The ratio of the dispersion to the mean, DE /U , gives a measure of how peaked the energy of the system is about its mean
value U . For n large, this ratio tends to go to zero as n → ∞. For physical systems where n is of the order of Avogadro’s
number, this ratio is essentially zero and therefore repeated macroscopic measurements of the energy (essentially) always yield
the average value U .109
The entropy of a distribution on a finite sample space X = {X(1) , · · · , X(K) } with distribution pj = P (X(j) ), is defined
110
by
K
X
Entropy: H(P ) = −k pj ln pj
j=1
PROPERTY 4 – MaxEnt Distribution I
The Boltzmann distribution is the unique distribution, P , on X that maximizes the entropy H(P ) subject to the
constraint that U (P ) takes a specified value U.
Proof. See Appendix C.2.
PROPERTY 5 – MaxEnt Distribution II
The Boltzmann Distribution with Quadratic Energy,

1
E(X) = − XT W X − bT X (143)
2
is the unique distribution, P , on X that maximizes the entropy H(P ) subject to the constraints that the first and
second moments of X take specified, fixed values.
Proof. See Appendix C.3 .
PROPERTY 6 – General Boltzmann Distribution for Binary Random Vectors

Recall that X = (x1 , · · · , xn ) is a random vector with discrete components xi that takes its values in a finite
probability space (X, P(X), P ) with sample space X = {X(1) , · · · , X(K) } ⊂ Rn , K = |X| < ∞, and P = PX =
probability mass function on X.
Assume that xi takes binary-valued realization values xi = xi ∈ {−1, +1}, for i = 1, · · · , n. The random
variables xi are binary, or dichotomous, categorical variables and X is a binary categorical random vector. We
109 To within measurement error. Noise in the macroscopic measurement device will swamp the microscopic fluctuations of the energy E(x) about its
mean value U in the limit n → ∞.

110 For the purely mathematical systems encountered in AI, machine learning, information theory and communications engineering, we take k = 1 and the
P
(mathematical Shannon) entropy is defined by H(P ) = − j pj ln pj .
have K = 2n and in general P (X) takes K = 2n values, which are constrained to sum to one. If we assume that
P is positive, P (X) > 0 for all X ∈ X, then any such P can be represented as P (X) = P (X; θ) = Pθ (X) where
n n n
1 X X X
ln P (X; θ) = θ0 + θi xi + θij xi xj + θijk xi xj xk + · · · + θ12 ··· n x1 x2 · · · xn
β i=1 i<j i<j<k
for some value of the real K = 2n –dimensional parameter vector
θ = (θ0 , θ1 , · · · , θn , θ12 , · · · , θ(n−1)n , θ123 , · · · , θ(n−2)(n−1)n , · · · , θ12···n )T ∈ RK .
Note therefore that any positive distribution, P (X), of a binary random vector X can be written as a Boltzmann
Distribution
1 −βEθ (X)
P (X) = P (X; θ) = Pθ (X) = e
Zθ
for some K-dimensional parameter vector θ with
n
X n
X n
X
Eθ (X) = − θ i xi − θij xi xj − θijk xi xj xk − · · · − θ12 ··· n x1 x2 · · · xn
i=1 i<j i<j<k
and
X
Zθ = e−βθ0 = e−βEθ (X)
values of X
Note that the form of the partition function Zθ clearly gives the constraint condition that must exist amongst the
elements of θ in order to ensure that the distribution properly sums to one,
!
1 X
−βEθ (X)
θ0 = − ln e
β
values of X
Proof. See Appendix D.
The Boltzmann distribution for a binary-components categorical vector X is also known as the Boltzmann Machine dis-
tribution and the Ising model. Property 6 states the existence of a general log-linear model for any Ising model. See the
discussion in Appendix D.2. The following two properties are consequences of the log-linearity of the Ising model.
PROPERTY 7 – General Form of Moments for the Boltzmann Machine Distribution

Consider the Boltzmann Machine Distribution (aka Ising model) described in Property 6. All noncentral moments
of X can be written as
n o
mi1 i2 ···ik (θ) = EPθ xi1 xi2 · · · xik with 0 ≤ i1 ≤ i2 ≤ · · · ≤ ik ≤ n and 0 ≤ k ≤ n
where we define
i0 = 0 and m0 (θ) = EPθ {x01 } = 1.
Proof. This is a straightforward consequence of the fact that xi ∈ {−1, +1} with i ≤ n. See the discussion in
Appendix D.2.
PROPERTY 8 – Average Energy Gradient for the Boltzmann Machine Distribution

Let θ be the parameter vector for the general Ising Model (aka Boltzmann Machine) Pθ (X) with energy function
Eθ (X) described in Property 6. Then the gradient of Eθ (X) obeys
−EPθ {∇θ Eθ (X)} = vector of all noncentral moments of Pθ (X).

Proof. The general element of the vector θ is equal to θi1 i1 ···ik for 0 ≤ i1 ≤ i2 ≤ · · · ≤ ik ≤ n, 1 ≤ k ≤ n, and
equal to θ0 for k = 0. Then the corresponding component of −EPθ {∇θ Eθ (X)} is

∂
−EPθ = EPθ (X) {xi1 xi2 · · · xik } = mi1 i2 ···ik (θ)
∂θi1 i1 ···ik
for 1 ≤ k ≤ n and −1 for k = 0. Property 7 states that no other moments for the Boltzmann Machine Distribution
need be considered.
The following property is very closely related to the previous one.
PROPERTY 9 – Moments Generation Function for the Boltzmann Machine

Let Pθ (X) = P (X; θ) be the general log-linear binary Boltzmann Distribution (Ising model) described in Property
6. Let n o
mi1 i2 ···ik (θ) = EPθ xi1 xi2 · · · xik
for 0 ≤ i1 ≤ i2 ≤ · · · ≤ ik ≤ n, 0 ≤ k ≤ n. Then
1
ln Zθ
β
is the (noncentral) moments generating function in the sense that for any component θi1 i2 ···ik of the parameter
vector θ for 0 ≤ i1 ≤ i2 ≤ · · · ≤ ik ≤ n, 1 ≤ k ≤ n,

∂ 1
ln Zθ = mi1 i2 ···ik (θ)
∂θi1 i2 ···ik β
Proof. That no moments other than the ones stated need be considered is Property 7. Using the quantities
defined in the statement of Property 6, we have
∂

1

1 ∂Zθ X e−βEθ (X) ∂Eθ (X)
ln Zθ = =−
∂θi1 i2 ···ik β βZθ ∂θi1 i2 ···ik Zθ ∂θi1 i2 ···ik
values of X
X n o
= Pθ (X) xi1 xi2 · · · xik = EPθ xi1 xi2 · · · xik
values of X
Define the (Helmholtz) free energy by
1
Free Energy: F = −kT ln Z = − ln Z
β
In terms of the free energy we have

Z = e−βF and P (x) = e−β(E(x)−F ) .
∂F
PROPERTY 10 H=−
∂T
Proof.

∂F ∂ ln Z ∂ ln Z ∂β
− = −k ln Z + T = −k ln Z + T = −k (ln Z + βU )
∂T ∂T ∂β ∂T
  
X X X X
= −k  pj  ln Z + β Ej pj  = −k pj (ln Z + βEj ) = k pj ln pj = H
j j j j
PROPERTY 11 F = U − TH
Proof.
 
X X X
K
U −F = Ej=1 pj + kT ln Z = kT  βEj pj + pj ln Z 
j j j
X X
= kT pj (βEj + ln Z) = −kT pj ln pj = T H
j j
The fact that U = F + T H for a fixed volume and constant particle system in thermal equilibrium with a heat bath (and
therefore described by the Boltzmann distribution) is a fundamental result in equilibrium statistical mechanics. It says that of
the total (average) energy U stored in the system, only a portion, the free energy F , is (at most) available to do useful work;
the remaining part, the entropy portion T H, being purposeless chaotic, thermal agitations that can never be used to perform
useful work.
PROPERTY 13 – Minimum Free Energy Distribution

The Boltzmann distribution is the unique distribution on X that minimizes the free energy
F (P ) = U (P ) − T H(P )
Proof. See Appendix C.1.
The heat capacity at constant volume and particle number is defined by
∂U β ∂U 1 ∂U
Heat Capacity: CV = =− =− 2
∂T T ∂β kT ∂β
VarE
PROPERTY 12 CV =
kT 2
Proof. Follows from Property 2.
If the heat capacity CV = CV (T ) is known, it is useful to recast Fact 5 as

VarE = kT 2 CV
Generally for spin glass models the heat capacity CV vanishes as T → 0 [24, 34], , and therefore in the limit that T → 0
(equivalently as β → ∞),
VarE → 0 =⇒ E(X) → U for X sampled according to X ∼ P (X; β) .
Thus close to absolute zero all realization samples have essentially the same energy value E(X) ≈ U .
Furthermore, as T → 0, the the Entropy H becomes zero for a nondegenerate energy ground state (or very close to zero
for a nearly-nondegenerate energy ground state) resulting in
F = U − T H ≈ U.
Since the Boltzmann distribution is the minimum free energy distribution (Property 12), a realization sample X drawn from
P (X; β) when T ≈ 0 satisfies E(X) ≈ U ≈ F . Therefore, because F is a minimum, with high probability the energy
evaluated at a drawn sample X, E(X), must be (nearly) a minimum,111
T ≈ 0 =⇒ X̂ ≈ arg min E(X) for X̂ ∼ P (X; β) .
X
This is the rationale underlying the heuristic optimization procedure known as simulated annealing [25].
111 If it is exactly the case that T = 0, then F = U = E(X) (since then Var = 0 =⇒ E(X) = U while T H = 0 =⇒ F = U ) and F = E(X) can
E
only be a minimum if X makes E(x) a minimum.
C Optimality of the Boltzmann Distribution

Let X be a discrete random variable that takes a finite number of values X = X ∈ Rn in the set X = {X(1) , · · · , X(K) } ⊂ Rn .
In this appendix we discuss how a Boltzmann distribution on X,
K
1 −βE(X) X (j)
P (X) = P (X; β) = e and Z = Z(β) = e−βE(X )
Z j=1
with β > 0, satisfies certain optimality properties. We assume that the energy is always finite,
|E(X)| < 0 for all X ∈ X .
For convenience, we define pj = P (X(j) ) and Ej = E(X(j) ). Recall that a distribution on X is subject to the constraints
K
X
pj ≥ 0 and pj = 1 .
j=1
Note that the second condition imposes a linear constraint on the space of admissible probabilities.
The important statistical mechanical quantities of Entropy, H, Average Energy, U , and Free Energy, F , are given by
[6, 24],112
K
X K
X
H(P ) = − pj ln pj , U (P ) = pj Ej , and F (P ) = U (P ) − T H(P ) ,
j=1 j=1
where the (pseudo) temperature T and inverse temperature β are reciprocally related, β = T −1 .
C.1 Minimum Free Energy Distribution

Assuming the existence of a specified, finite energy function E(X), Ej = E(X(j) ), we show that the Boltzmann distribution is
the unique minimum free energy distribution that minimizes the free energy F = U − T H for a specified, fixed value of T .113
Consider the problem of minimizing F = U − T H with respect to the distribution P . This is equivalent to minimizing
the function βF = βU − P H with respect to P . Because of the linear equality constraint placed on the distribution by the
normalization requirement j pj = 1, we proceed to find a stationary solution to the function
X X X
`(P ; λ) = βF (P ) + λ pj = βU − H + λ pj = pj βEj + pj ln pj + λ pj .
j j j
where λ is a continuous, real-valued Lagrange multiplier. Setting the derivative of `(P, λ) with respect to pj to zero yields
∂ 1
`(P ) = βEj + 1 + ln pj + λ = 0 =⇒ pj = e−βEj > 0
∂pj Z
where
Z = e1+λ ⇐⇒ λ = ln Z − 1 .
The normalization requirement forces Z = j e−βEj , which fixes the value of Z, and hence that of the Lagrange multiplier
P
λ. Note that the solution is unique and always satisfies the positivity constraint pj > 0.114
We have established that the stationary solution of the Lagrangian is of the form,
1 −βE(X)
P (X) = e >0
Z
for all X ∈ X = {X(1) , · · · X(K) } and finite β > 0. Note that we did not impose the nonnegativity inequality constraint pj ≥ 0
on the distribution, and therefore our solution is the unique stationary point of the Lagrangian regardless of the nonnegativity
constraint. Thus the inequality constraint can be ignored when asking if the stationary point is a minimum, a maximum, or a
112 Seealso Appendix B.
113 Physically, this corresponds to the system being in thermal equilibrium with a system at a known, fixed temperature T .
114 Recall that at the outset we made the assumption that |E(x)| < ∞ for all x ∈ X.
saddle point of the linearly constrained optimization problem.115 So we ask: is the stationary solution a linearly constained
minimum of the free energy F (P )? Note that the second derivative of βF (P ) with respect to pj = P (X(j) ) is given by p1j
which is strictly positive for all pj > 0. Further note that the hessian of βF (P ) is a purely diagonal matrix containing the
second derivatives p1j on the diagonal, which means that the hessian of the free energy F (P ) is strictly positive-definite for
all P > 0. This further implies that the projection of the hessian onto any admissible subspace of first-order variations of the
values of pj about the stationary point is (locally) strictly positive-definite on that subspace.116 Thus our solution is the unique
global minimizer of the free energy subject to the normalization constraint.117
C.2 Maximum Entropy Distribution Subject to Average Energy Constraint

Here the system is taken to be constrained to have a fixed average energy,
X
U (P ) = pj Ej ≡ U = specified, fixed value
j
(j)
where we again assume the existence of a specified, finite energy function E(X),
P Ej = E(X ). Note that the average energy
constraint is linear in P , so that together with the normalization constraint j pj = 1, we are imposing two linear equality
constraints on the set of admissible probability values.
We now show that for this situation, the Boltzmann distribution is the unique maximum entropy distribution subject to the
given average energy constraint. In this case the inverse temperature β (equivalently T ) is a Lagrange multiplier that needs to
be determined.118
Incorporating the additional constraint that the probabilities sum to one, the Lagrangian for the problem of maximizing the
entropy subject to the given constraints is119
X X
`(P ; λ, β) = H(P ) − βU (P ) − λ pj = − pj ln pj − βpj Ej − λpj ,
j j
where β and λ are continuous, real-valued Lagrange multipliers. Take the derivative with respect to pj and set to zero to
determine the unique stationary solution,
∂ 1
`(P ) = −1 − ln pj − βEj − λ = 0 =⇒ pj = e−βEj
∂pj Z
where
Z = e1+λ ⇐⇒ λ = ln Z − 1 .
Note that Z = Z(β) = j e−βEj so that λ = λ(β) and pj = pj (β). The value of β (equivalently T ) is determined, in
P
principle, from solving the (highly nonlinear in β) constraint equation
K
X
pj (β)Ej = U.
j=1
Again, the probabilities are strictly positive, pj > 0, for all j, so we don’t have to worry about the existence of nondifferentiable
boundary point solutions.
115 I.e., because the inequality constraint is irrelevant, we do not have to worry about the possibility of a optimal value of the solution being at a nondiffer-
ential boundary point, which would correspond to at least one of the probability values, pj , being zero, pj = 0, if the inequality constraint on pj was relevant
and active.
116 Let p = (p , · · · , p )T Because of the single linear constraint eT p ≡ 1, all admissible first-order variations δp of p must satisfy eT δp = 0. I.e.,
1 K
p can only (infinitesimally) vary on the (K − 1)-dimensional subspace that is orthogonal to the vector e. The restriction of the Hessian of βF (P ) to that
subspace must be strictly positive-definite for infinitesimal variations about the stationary point, which therefore must be at least a local minimum. Because it
is the unique stationary point to the Lagrangian (and the inequality constraints are irrelevant), i.e., because there are no other possible points to consider, the
stationary solution must be a global minimum.
117 Another way to think about this is that F (P ) having a strictly positive-definite hessian on the domain P > 0 means that F (P ) is strictly convex in
P on that domain. Which means that F (P ) restricted to any subspace on that domain is strictly convex on that domain and therefore can have at most one
minimum on that domain, which must be a stationary point of the Lagrangian, which must be unique if it exists. Thus our solution, which is the unique
stationary point of the Lagrangian, is the unique minimizing solution that satisfies the linear constraint eT p = 1.
118 Remember that physically the Boltzmann distribution describes the equilibrium thermal behavior of a system in thermal equilibrium with a heat bath
at temperature T . If U is large, the system likely contains a lot of thermal energy, in which case T is likely large in value (equivalently, β is small). On the
other hand, if U is small, then likely the temperature T is small (equivalently, β is large).
119 Note that in the theory of Lagrange multipliers, we are free to choose the signs in front of the multipliers, with a particular choice usually made for
purposes of mathematical convenience.

Note that there is one, and only one, unique stationary solution for the Lagrangian variation problem. That the stationary
solution maximizes the entropy subject to the two imposed linear constraints is confirmed by noting that the hessian of the
entropy function is diagonal with diagonal elements − p1j < 0 that are everywhere strictly negative in the neighborhood of the
stationary solution. This means that the hessian restricted to any subspace of allowable variations of pj about the stationary
solution is strictly negative definite.120 As a consequence, the unique stationary point is a unique global maximum on the set
of admissible solutions.121
C.3 Maximum Entropy Distribution Subject to 1st and 2nd Moment Constraints
In the previous two situations, the energy function E(x) is assumed to be a priori known. Here we show that entropy maxi-
mization subject to constraints on the first and second non-central moments of the discrete random vector X uniquely yields
the122
Boltzmann Distribution with Quadratic Energy (BDQE)

1 −βE(X) 1
P (X) = P (X; β) = e with E(X) = − XT W X − bT X
Z 2
where the n-component random vector X takes a finite number, K, of realizations
X ∈ X = {X(1) , · · · , X(K) } ⊂ Rn
and where each realization is a real, finite n-vector X(j) ∈ Rn , j = 1, · · · , K.

To show this, we impose constraints on the first and second non-central moments of P as follows,
K
X T
f (P ) , EP {X} = pj X(j) ≡ µ = (µ1 , · · · , µn ) ∈ Rn
j=1
and
K
T
X
g(P ) , EP XX T = pj X(j) X(j) ≡ R = Rk,` ∈ Rn×n .

j=1
We call the matrix R of second non-central moments, somewhat erroneously, the correlation matrix.123 Of course, we also
have the normalization condition
K
X
c(P ) , pj ≡ 1 .
j=1
Note that the first (vector) constraint f (P ) = µ imposes n linear equality constraint conditions on P while the second
(matrix) constraint g(P ) = R = RT imposes n(n+1) 2 linear equality constraint conditions. Taken together with the normal-
n2 +3n+2
ization condition c(P ) = 1, we have a total of 2 linear equality constraints that must be satisfied by P , which is also
the number of Lagrange multipliers we will need to incorporate into our Lagrangian.
It is convenient to note that for b ∈ Rn , X
bT f (P ) = pj bT X(j)
j
n×n
and for W ∈ R , X T X T
tr g(P )W = pj tr X(j) X(j) W = pj X(j) W X(j) .
j j
120 In this case, because of the existence of two linear constraints, the space of admissible infinitesimal variations has dimension (K − 2) where K is the
number of probability values.

121 Alternative proof: The entropy is strictly concave on the domain P > 0 (because of the negative-definiteness of the entropy on that domain), and
therefore on any subspace of that domain. Thus, on any subspace of the domain P > 0 the entropy has at most one unique maximum, which we identify as
the unique stationary solution of the Lagrangian.
122 This result is the discrete state version of the well-known fact that for continuous random vectors the maximum entropy distribution subject to constraints
on the first two moments is a Gaussian distribution.

123 Note that if a covariance matrix C is specified, then R = C + µµT so that specifying µ and R is equivalent to specifying µ and C. Recall that C is the
second central moment of X, so in fact we could drop the adjective “non-central” in the description of this particular problem.
Because the last expression is a quadratic form, with no loss of generality we can take W = W T , which then means that W
contains n(n+1)
2 independent elements.
2
Maximizing the entropy H(P ) = − j pj ln pj subject to the n +3n+2
P
2 linear equality constraints f (P ) ≡ µ, g(P ) ≡ R,
124
and c(P ) ≡ 1, leads to the Lagrangian
β β
`(P ; λ, βb, W ) = H(P ) + β bTf (P ) + tr g(P )W − λc(P )
2 2
2
where λ, βb, and β
2W = 2W
β T
comprise 1 + n + n(n+1)
2 = n +3n+2
2 continuous, real-valued Lagrange multipliers cor-
n2 +3n+2
responding to the 2 linear equality constraints. We set the derivative of the Lagrangian with respect to pj equal to
zero,
∂ β T
`(P ) = −1 − ln pj + β bT X(j) + X(j) W X(j) − λ = 0
∂pj 2
to find the unique stationary solution
1 −βEj
pj = e >0
Z
where
1 T
Ej = E(X (j) )) = − X(j) W X(j) − bT X(j)
2
and the partition function (normalizing factor) Z satisfies
Z = e1+λ ⇐⇒ λ = ln Z − 1 .
2 2
In principle the n +3n+2
2 constraints allow one to solve for the n +3n+2
2 Lagrange multipliers. Again note that the probabilities
are strictly positive, pj > 0, for all j, so we don’t have to worry about the existence of nondifferentiable boundary point
solutions.
The unique stationary solution to the Lagrangian maximizes the entropy subject to the imposed first and second moment
constraints because the hessian of the entropy function is diagonal with diagonal elements − p1j < 0 that are everywhere strictly
negative in the neighborhood of the stationary solution. This means that the hessian restricted to any subspace of allowable
variations of pj about the stationary solution is strictly negative definite.125 As a consequence, the solution is a unique global
maximum on the set of admissible solutions.126
D The Boltzmann Distribution & Categorical Data Analysis

In the statistics literature, a random variable Y that takes a finite number of values m ≥ 2 is known as a categorical (or
qualitative) random variable. When m = 2, then Y is a binary (or dichotomous) categorical variable. When m > 2, Y is a
general or polytomous categorical variable.
Throughout, we assume that the distribution P = PX is positive,
P (X) = P (x1 , · · · , xn ) > 0 ∀x ∈ X .
A careful reading of the literature on categorical data analysis shows that there is a close relationship between the positive
distribution discrete random variable models considered in this note and those used in the modeling of categorical data [9, 29,
10, 2, 13].
124 We can, and do, take β to be any fixed, positive value, β > 0, by definition. Of course b and W (and λ) must still be determined from the requirement
that the linear equality constraints be satisfied.

2
125 In this case, because of the existence of n +3n+2 linear constraints on the K values of the distribution functions, the space of admissible infinitesimal
2
2 2
variations has dimension (K − n +3n+2 2
). In order for the analysis to be well posed, we require that K > n +3n+2
2
.
126 Another proof: Note that the entropy is strictly concave on the domain P > 0 (because of the negative-definiteness of the entropy on that domain), and
therefore on any subspace of that domain. Thus, on any subspace of the domain P > 0 the entropy has at most one unique maximum, which we identify as
the unique stationary solution of the Lagrangian.
D.1 General Multivariate Categorical Random Variables

Let X be a discrete random vector that represents the behavior of a random field, where we assume that X takes a finite number,
K < ∞, of values in a sample space X = {X(1) , · · · , X(K) } ⊂ Rn . The components of the random vector X = (x1 , · · · xn )T
are categorical random variables, xi , that each takes a finite number, mi , of values. We say that X is a multivariate categorical
random variable or categorical random vector. As we have done throughout this note, we make the simplifying assumption
that mi = m for all i = 1, · · · , n, in which case K = mn . We denote the components of a realization X = X by xi so that
X = (x1 , · · · , xn )T ∈ X ⊂ Rn . Here we consider the general (polytomous) case with m ≥ 2; in the next section we will focus
on the dichotomous case m = 2.
Let P = PX denote the true distribution function of the categorical random vector X. In general, to exactly specify the
true categorical distribution127
P (X) = P (x1 , · · · , xn ) for X ∈ X = {X(1) , . . . X(K) } (144)
requires knowledge of K − 1 probability values, or the parametric equivalent. To ensure normalization of the probabilities
P (X(`) ), ` = 1, · · · , K, requires knowledge of an additional normalization parameter,128 so that a total of K = mn parameters
is needed to completely specify the true distribution of X in general. Henceforth, we refer to (144) as the true K-probabilities
categorical distribution for X. Unfortunately, as noted in footnote 10, K = mn can be quite vast, even for relatively simple
situations. Thus it is often not a tractable proposition to completely know the exact distribution for the categorical random
vector X in the most general case.
Nontheless, despite the potentially daunting size of K, suppose we want to estimate the true K-probabilities categorical
distribution, P (x) from a drawing of N iid realization samples. If we set
N` = count of number of times the realization x(`) occurs amongst the N samples
then the sample data corresponds to count data DN = {N1 , · · · , NK }. For N fixed, the count data obeys the Multinomial
distribution,
h iN1 iNK
N h
P (DN ) = P (X(1) ) · · · P (X(K) ) (145)
N1 · · · NK
subject to the constraints
N1 + · · · + NK = N and P (X(1) ) + · · · + P (X(K) ) = 1 . (146)
Note that the Multinomial distribution provides a complete description of X under repeated iid sampling.
Maximizing the likelihood129 of the probabilities P (X(`) ), ` = 1, · · · , N , subject to the constraints (146) yields the
maximum likelihood estimates of the probabilities describing the true K-parameter categorical distribution,
N`
MLE P̂ (x(`) ) = for ` = 1, · · · , K . (147)
N
Unfortunately, these estimates are generally only reasonably accurate if N K.130 Since it is not unreasonable to have
values of K of the order of 1010,000 or larger,131 this generally is not a feasible approach to learning the true K-probabilities
distribution for X. For this reason, approximation models with a tractable number of parameters are used in practice.
Two tractable multivariate categorical variable distribution models are the Logistic Regression Distribution (LRD) and the
Quadratic Exponential Categorical Distribution (QECD):132
Logistic Regression Categorical (LRC) Distribution

n
1 X
ln P (X; ϑ) = θ0 + θ T X = θ0 + θ i xi (148)
β i=1
127 Because a general categorical distribution is the generalization of the Bernoulli distribution, the machine learning community sometimes refers to the
categorical distribution as a “multinoulli distribution” [16].

128 Effectively, this knowledge is equivalent to knowing the partition function Z. See Appendix C and note that all three probability derivations require
determination of a Lagrange multiplier λ which sets the value of Z, and vice versa.
129 For fixed D the likelihood is just the multinomial distribution P (D ) of Equation (145) viewed as a function of the probabilities P (X(`) ).
N N
130 With P positive, for the maximum likelihood method to be sound, at a minimum we want N large enough that N 6= 0 for all ` = 1, · · · , K. We also
`
want N large enough so that the estimation errors for the K estimates are reasonably small.
131 See footnote 10.
132 For both distributions, the elements of the parameter vector ϑ are given by the θ-parameters shown on the far right hand side of the equations.
Quadratic Exponential Categorical (QEC) Distribution

n n
1 1 X X
ln P (X; ϑ) = θ0 + θ T X + XT Θ X = θ0 + θi xi + θij xi xj (149)
β 2 i=1 i≤j
with Θ = ΘT where [Θ]ij = θij for i 6= j and [Θ]ii = 2θii otherwise.

Note that if we set b = θ, W = Θ, and define
1
E(X) , − XT W X − bT X and Z , e−βa0 ,
2
then the QECD (149) is seen to be equivalent to the Boltzmann Distribution with Quadratic Energy (BQE) described in Section
2.1.
As described in Appendix C.3, the Boltzmann Distribution with Quadratic Energy (which is equivalent to the QEC dis-
tribution (149)) is the maximum entropy (MaxEnt) distribution subject to constraints on the values of the first and second
moments. Similarly, it is straightforward to show that the Logistic Regression Categorical (LRC) distribution is the MaxEnt
distribution subject to a constraint on the value of the first moment. Indeed, it is always the case that a constraint on the value
an r-th moment adds summations of terms of order r to the log-probability of the MaxEnt distribution. For example, if we
place constraints on the first three moments of X, then the MaxEnt distribution is the:
Cubic Exponential Categorical (CEC) Distribution
n
1 T 1 T X
ln P (X; ϑ) = θ0 + θ X + X Θ X + θijk xi xj xk (150)
β 2
i≤j≤k
In the next subsection, where we discuss binary categorical variables, we provide a different motivation for viewing the
three distributions as reasonable approximations to the true K-probabilities categorical distribution (144).
The Logistic Regression Categorical (LRC) distribution (148) has only133 n+1 parameters to learn instead of the K = mn
parameters of the true distribution (144), and its log-probability depends at most linearly on realization values X. However
this simplification comes at a price—note that the LRC distribution can only model very simple behavior in that it corresponds
to the assumption of complete independence among the components of X,
P (X; ϑ) = P (x1 ; ϑ) · · · P (xn ; ϑ) with P (xi ; ϑ) ∝ e−θi xi .
The Quadratic Exponential Categorical Distribution (equivalently, the BQE distribution) of Eq. (149) is an approximation
to the true multivariate categorical distribution (144) that captures pairwise dependencies between the random categorical
components xi and xj . Empirically, these dependencies are captured in contingency tables [2, 13], which can be thought of as
multidimensional arrays of joint outcome counts of the realization values for an ordered set of random variables. For example,
if the ordered set of random variables is (xi , xj ) and they can both take binary values a, b then we have a two-dimensional
2 × 2 array of counts for the possible joint outcomes (a, a), (a, b), (b, b), (b, a). These raw counts must sum up to the total
amount of collected data, Nij , and the joint outcome counts divided by Nij give the sample frequencies of the joint outcomes.
If we have three categorical variables (xi , xj , xk ) taking binary values then we have a three-dimensional 2 × 2 × 2 array of
joint outcome counts. In the n-components case where X = (x1 , · · · , xn )T we have an n-dimensional array of joint outcome
counts between the components of X. Such arrays of outcome counts are known as contingency tables. It is one of the goals
of categorical data analysis to determine which variables xi are contingent on (i.e., statistically dependent on) variables xj .
Since the dependency structure is contained in the joint distribution P (X) = P (x1 , · · · , xn ), determining good estimates of
this distribution is a primary objective of categorical data analysis. Obviously, the LR distribution will not detect any nontrivial
contingencies since it encodes the independence assumption. However the QECD distribution (149) can detect contingencies.
2
Notice that the QECD (aka BGDQE) distribution requires the specification or estimation of n +3n+2
2 parameters, which
should be compared to the K = mn parameters required for the true multivariate categorical distribution (144). For example,
if m = 4 and n = 10, we have 66 versus K = 410 = 1, 048, 576 parameters respectively. Note that in this case the simple
LRD, which cannot encode dependencies between the variables, requires only 11 parameters.
D.2 Binary Categorical Variables

Let us now restrict ourselves to the case where the n components, xi , of the multivariate categorical vector X are binary
(dichotomous) random variables, m = 2. In this case X takes K = 2n possible values.134
133 Note, that for physical systems where n is of the order of Avogadro’s number, n itself can be quite large, in which case we still have a computationally
difficult problem.
134 K can still be quite vast, even in relatively simple situations. See footnote 10.
For mathematical convenience, and with no loss of generality,135 unless otherwise indicated, we take the components of
X to take the values xi = xi ∈ {−1, +1}.136 With this choice, we have x2i = 1 which has several nice consequences.
One useful consequence of x2i = 1 is that higher powers of xi reduce to lower powers. Specifically, let ` be any nonnegative
integer, then
(
` 1 when ` is even
xi = (151)
xi when ` is odd
Thus in an expansion in powers of xi there is no need to consider powers other than zero and one.137
Another consequence of the fact that x2j = 1 is that in sums of products of the components of X, higher-order products
reduce to lower-order products when component indices are equal.138 For example,
i = j =⇒ xi xj = 1 and j = ` =⇒ xi xj xk x` = xi xk . (152)
Another useful property due to the choice xi = ±1 is that for any real-valued quantity α
f (α) + f (−α) = f (αxi ) + f (−αxi ) . (153)
as is easily proved by testing for the two cases xi = ±1.
Eq. (153) is a special case of a more general:
SUMMATION THEOREM FOR BINARY ±1 RANDOM VARIABLES
Let x1 , · · · , xn be binary random variables with realization values xi = ±1, i = 1, · · · , n. For any function
f (x1 , · · · , xn ),
X X
f (x1 , · · · , xn ) = f (α1 x1 , · · · , αn xn ) (154)
x1 = ±1 ,··· , xn = ±1 α1 = ±1 ,··· , αn = ±1
PROOF: In the sum apply (153) term-by-term.
Because the true K-parameter binear categorical distribution P of Eq. (144) is assumed to be positive, we can write
1
ln P (X) = g(X) ∀X ∈ X (155)
β
for some function g and any fixed β > 1. Now select g to be of the following form,139
n
X n
X n
X
g(X) = θ0 + θ i xi + θij xi xj + θijk xi xj xk + · · · + θ12 ··· n x1 x2 · · · xn (156)
i=1 i<j i<j<k
or, equivalently,140
n
1 X
g(X) = θ0 + θ T x + xT Θ x + θijk xi xj xk + · · · + θ12 ··· n x1 x2 · · · xn , Θ = ΘT (157)
2
i<j<k
Because equalities between the component indices cause higher-order products of the components to reduce to lower order
products, we can use strict inequalities in the sums. Note that Eq. (156) is the most fully general polynomial expansion of g(X)
135 Any two dichotomous random variables Y ∈ {α , α } and Z ∈ {β , β } are related by a simple affine transformation Y = aZ + b. For example if
1 2 1 2
Y ∈ {0, 1} and Z ∈ {−1, +1} we have Y = 1+Z 2
and Z = 2Y − 1.
136 An alternative nice choice to to have x take values 0 or 1.
i
137 Note that this property also holds if we instead take x ∈ {0, 1}, as then xk = x for all integers k ≥ 1.
i i i
138 This property also holds if we instead take x ∈ {0, 1}, as then xk = x for all integers k ≥ 1.
i i i
139 Here, and in the subsequent development, we follow references [9] and [10].
140 Note that Θ = ΘT where [Θ] = θ for i 6= j and [Θ] = 0 otherwise.
ij ij ii
because all products and powers of higher order than shown in the sum will reduce to terms already in the sum. Also note that
Eq. (156) is linear in the unknown θ-parameters.
Are there enough θ-parameters in (155)–(156) to specify the K = 2n values of the true binary distribution (144)? Note
that in each sum in (156) there are as many parameters as there are terms in the sum. The number of terms in an order-r sum
number of ways that r of the n components
is the
n 141
of X can be chosen without regard to order and with no repetition, which is
r . Thus the number of θ-parameters is

n n n n n n
+ + + ··· + + + = (1 + 1)n = 2n = K .
0 1 2 n−2 n−1 n
We see, then, that the number of independent parameters need to specify g is K, the same as for the true distribution P . In
(156), conceptually, the parameter θ0 serves to normalize the distribution to sum to one and the remaining K − 1 parameters
are used to set the values of the unnormalized probabilities. Knowledge of the K values of P (X) (and hence of g(X)), yields
K linear equations in the K θ-parameters that can be solved to uniquely determine the parameter values provided that the
collection of K = 2n polynomial product terms
B = {(1), (x1 ), · · · , (xn ), (x1 x2 ), · · · , (xn−1 xn ), · · · , (x1 · · · xn )} (158)
form a linearly independent set of functions. This is indeed the case (the somewhat tedious details142 are given below),
and therefore the representation given by equations (155)–(157) can completely match any possible binary distribution (144)
[9, 10].
Define the full parameter vector ϑ by

T
ϑ = θ0 , θ1 , · · · , θn , θ12 , θ13 , · · · , θ(n−1)n , θ123 , · · · , θ1···n ∈ RK
and let the parameterized distribution P (X; ϑ) be the representation given by Equations (155)–(157). We call the fully pa-
rameterized representation (155)–(157) for the true K-parameter binear categorical distribution P of Eq. (144) the General
Log-Linear Binary distribution model for the true categorical distribution P (x):
General Log-Linear Binary (GLLB) Distribution

n
1 1 X
ln P (X; ϑ) = θ0 + θ T x + xT Θ x + θijk xi xj xk + · · · + θ12 ··· n x1 x2 · · · xn (159)
β 2
i<j<k
where Θ = ΘT has zero diagonal values.

Note that the GLLB distribution P (x; ϑ) is log-linear in the parameters ϑ despite the existence of nonlinearities in the
realization vector X. With respect to X, we variously refer to the terms in (159) as the constant (order zero in X), linear (order
1), quadratic (order 2), and cubic (order 3) terms, etc. The expansion on the right hand side of (159) is finite, containing terms
up to order n. One can select a value of the full parameter vector ϑ ∈ RK to match any K-probabilities multivariate binary
category distribution (144), although the large number of parameters, K = 2n , that one typically needs generally makes this
intractable to do. The GLLB is known as the saturated log-linear model in the literature on the analysis of categorical variables
[2]. A saturated log-linear model can construct the probabilities of the true categorical distribution (144) perfectly.
This is because, as mentioned above and proved below, the elements of B are linearly independent. This means that the
K = 2n elements of B forms a basis for the space of all functions, F, of the binary (dichotomous) vector
 
x1
 .. 
X =  .  ∈ X = {X(1) , · · · X(K) } ∈ Rn , xi ∈ {−1, +1} for i = 1, · · · , n, |X| = K = 2n .
xn
n n−r r
141 I.e.,one just sets x = y = 1 in the binomial expansion (x + y)n = n
P
r=0 r x y . Note that there are as many polynomial product terms of the
form (xj1· · · xjk ) as there are parameters θj1 ···jk .
142 Indeed, I couldn’t find a proof of this seemingly obvious fact in the literature, perhaps because it is tedious. Note that results for continuous variables
xi cannot be directly applied to the case where xi is discrete. If someone is aware of a published proof, or an outline of a proof, let me know and I’ll add the
reference this note.
I.e., there exists a vector of coefficients θ ∈ RK , K = 2n , such that

n
X n
X n
X
f (X) = θ0 + θ i xi + θij xi xj + θijk xi xj xk + · · · + θ12 ··· n x1 x2 · · · xn (160)
i=1 i<j i<j<k
for all
f (·) ∈ F = {f | f : X → R} . (161)
In words, every function of X = (x1 , · · · , xn )T can be represented by an n-th order polynomial of partial degree one with
respect to each component variable xi , i = 1, · · · n. Note that the space of functions F is finite dimension,
dim(F) = |B| = K = 2n .
Note, In particular, that the elements of B can be used represent the log-probability of any positive distribution over X by
taking the function f ∈ F to satisfy
1
f (X) = − ln P (X) ∈ R (162)
β
for any P (X) in the space of positive distributions over X,143

( )
X
P (·) ∈ P = P P : X → R+ ; P (X) > 0 , ∀ X ∈ X ; P (X) = 1 .

X∈X
Note that f (X) constrained as in Eq. (162) means that f belongs to a submanifold of F which is defined by the constraint
that the probability P must sum to one, a requirement that imposes a constraint condition between θ0 and the remaining
components of the parameter vector θ in the representation (160).
It is evident that a rational way to approximate the true K-probabilities distribution (144) for a binary components cate-
gorical vector X is to simplify the GLLB distribution P (x; ϑ) by setting some of the component parameters of ϑ identically
equal to zero in order to selectively remove higher-order terms from the full expansion shown in (159). When we do this, we
will say we are working with a reduced parameter vector and log-linear distribution, which we still denote by ϑ and P (x; ϑ)
respectively, allowing context to disambiguate the situation.144 Three such approximations that one can use are:
Logistic Regression Binary Distribution (LRB)

n
1 X
ln P (X; ϑ) = θ0 + θ T x = θ0 + θ i xi (163)
β i=1
n n

Number of parameters = 0 + 1 = 1 + n.
Quadratic Exponential Binary Distribution (QEB)

n n
1 1 X X
ln P (X; ϑ) = θ0 + θ T x + xT Θ x = θ0 + θ i xi + θij xi xj (164)
β 2 i=1 i<j
n n n n(n−1) n2 +n+2

Number of parameters = 0 + 1 + 2 =1+n+ 2 = 2 .
Cubic Exponential Binary Distribution (CEB)

n
1 T 1 T X
ln P (X; ϑ) = θ0 + θ x + x Θ x + θijk xi xj xk (165)
β 2
i<j<k
143 The parameter β is known and fixed to be any positive real number.
144 Such models are known as non-saturated or unsaturated log-linear models in the literature on the analysis of categorical variables [2].
n n n n n2 +n+2 n(n−1)(n−2)

Number of parameters = 0 + 1 + 2 + 3 = 2 + 6 .
Note that if we set b = θ, W = Θ, and define

1
E(X) , − XT W X − bT X and Z , e−βθ0 , (166)
2
then the QEB distribution (164) is seen to be equivalent to the Boltzmann Machine (aka Ising) distribution described in Section
3.
As discussed in Section D.1 above, the three distributions (163)–(165) are maximum entropy distributions that satisfy
constraints placed on the values of the moments of X.
PROOF THAT B FORMS A LINEARLY INDEPENDENT SET
To show that (158) forms a linearly independent set we need to demonstrate that
n
X n
X n
X X
0 = θ0 + θi xi + θij xi xj + θijk xi xj xk + · · · + θi1 i2 ···in−1 xi1 xi2 · · · xin−1 + θ12 ··· n x1 x2 · · · xn (167)
i=1 i<j i<j<k i1 <i2 <···<in−1
if and only if all of the K = 2n coefficients are zero, θ0 = θ1 = · · · = θ12, ··· n = 0. If we assemble all of the coefficients in
equation (167) into a vector θ ∈ RK , K = 2N , then the condition for independence of the terms in B can be succinctly stated
as the requirement that (167) holds if and only if α = 0.
It will be easier to work with the binary variables Z = (z1 , · · · , zn )T ,
xi + 1
zi = ∈ {0, 1}.
2
Using the equivalent transformation xi = 2zi − 1, Eq. (167) is transformed to the equation
n
X n
X n
X X
0 = α0 + αi zi + αij zi zj + αijk zi zj zk + · · · + αi1 i2 ···in−1 zi1 zi2 · · · zin−1 + α12 ··· n z1 z2 · · · zn (168)
i=1 i<j i<j<k i1 <i2 <···<in−1
which leads us to consider the set
Z = {(1), (z1 ), · · · , (zn ), (z1 z2 ), · · · , (zn−1 zn ), · · · , (z1 · · · zn )} . (169)
If we assemble the K = 2n components of (168) into a vector α ∈ RK , then the terms in Z are linearly independent provided
that equation (168) is true if and only if α = 0.
LEMMA
The elements of the set B of (158) are linearly independent if and only if the elements of the set Z of (169) are
linearly independent.
PROOF
One way to proceed is to show that the linear, invertible mapping xi = 2zi − 1 induces a linear, invertible mapping
between the collective elements of Z and X , which therefore guarantees the equivalence (i.e., an “if and only if”
relationship) of linear independence of the elements of B and linear independence of Z. However, we use a
different approach based on showing that
α = 0 ⇐⇒ θ = 0 (170)
for α of Eq. (168) and θ of Eq. (167), where the right hand sides of (168) and (167) are identical functions related
by the transformation x = 2x − e, as this then implies
elements of Z are linearly independent
z }| {
Eq. (167) ⇐⇒ Eq. (168) ⇐⇒ α = 0 ⇐⇒ θ = 0
x = 2z−e
| {z }
elements of X are linearly independent
Note that then “Eq. (168) ⇐⇒ α = 0” if and only if “Eq. (167) ⇐⇒ θ = 0”, which is the statement of the
lemma.
To show the validity of (170), we sequentially make substitutions xi = 2zi − 1 beginning with the last term in
Eq. (167) and moving to the next-to-last term, etc., until all terms in (167) have been exhausted. At the same
time that these substitutions are iteratively being made, we continually apply the consequences of the previous
iterations.
First Step. Making the substitution xi = 2zi − 1 in the last term of Eq. (167) results in
α1···n = 2n θ12···n which implies α1···n = 0 ⇐⇒ θ12···n = 0
Second Step. Now making the substitution xi = 2zi − 1 in the penultimate term of Eq. (167) and using the
result of our previous step yields
:0

n−1
= 2n−1 θi1 i2 ···in−1 for all i1 < i2 < · · · < in−1

αi1 i2 ···in−1 = 2 θi1 i2 ···in−1 −
θ12···n

which implies
αi1 i2 ···in−1 = 0 ⇐⇒ θi1 i2 ···in−1 = 0 for all i1 < i2 < · · · < in−1 .
Third Step. Progressing to the third to last term, making the subsitution xi = 2zi − 1 and using the results of
our previous steps, setting the values of the previously considered parameters to zero, gives
 
X 0 0
αi1 i2 ···in−2 = 2n−2 θi1 i2 ···in−2 + = 2n−2 θi1 i2 ···in−2
:
: 

θi1 i
2 ···in−2 ;k
+
θ12···n

k: ij <k<i`
for all i1 < i2 < · · · < in−2 and ij , i` ∈ {0, i1 , · · · , in−2 , in−1 }. This gives, assuming that the previously
considered parameters are set to zero,
αi1 i2 ···in−2 = 0 ⇐⇒ θi1 i2 ···in−2 = 0 for all i1 < i2 < · · · < in−2 .
k-th Step. Continuing in this manner, at the k-to-last term use the results of our previous steps, while setting the
values of the previously considered parameters for the terms outbound from k to zero, to obtain,
αi1 i2 ···in−k+1 = 2n−k+1 θi1 i2 ···in−k+1
This gives, assuming that the previously considered parameters are set to zero,
αi1 i2 ···in−k+1 = 0 ⇐⇒ θi1 i2 ···in−k+1 = 0 for all i1 < i2 < · · · < in−k+1 .
Last Step. Finally at the k = n + 1-st stage, we have, with i0 = 0 and using the results of the previous stages,
we have, assuming that the previously considered parameters are all set to zero,
α0 = θ0 ⇐⇒ θ0 = 0.
With the Lemma proved, it remains to show that the terms in Z are linearly independent, i.e., that Eq. (168) holds if and only
if α = 0. Since sufficiency is trivial, we must show that Eq. (168) implies α = 0, where zi ∈ {0, 1}, i = 1, · · · , n. We do
this by proceeding recursively on the right-hand-side of (168), beginning with the first term α0 and proceeding term-by-term
to the last term α12 ··· n z1 z2 · · · zn . Note that at each step we use the results determined in the previous steps.
Assumption: Eq. (168) holds.

Step 0. Set all components equal to zero, Z = 0. This with Eq. (168) implies α0 = 0.
Step 1. For each i, set zi = 1 and all other components to zero, zj = 0, j 6= i. Together with Eq. (168) and the
result of the previous step, this implies αi = 0 for i = 1, · · · n.
Step 2. For each i and j which are pairwise distinct (i.e., i 6= j), and ordered as i < j, set zi = zj = 1 and set
all other components to zero. Together with Eq. (168) and the results of the previous steps, this implies
αij = 0, for i, j = 1, · · · n, i < j.
Step 3. For each pairwise distinct triple i, j, k, ordered as i < j < k set zi = zj = zk = 1 and set all other
components to zero. Together with Eq. (168) and the results of the previous steps, this implies αijk = 0, for
i, j, k = 1, · · · , n, i < j < k.
Step k. For each pairwise distinct collection of k indices Ik = {i1 , · · · ik }, ordered as i1 < i2 < · · · < ik , set
zi` = 1 for i` ∈ Ik 6= ∅ and then set all other components to zero. Together with Eq. (168) and the results of
the previous steps, this implies αi1 ,··· ,ik = 0 for all index collections Ik = {i1 , · · · ik }.
Step n. Set all components equal to one, z1 = · · · zn = 1. Together with Eq. (168) and the results of the previous
steps, this implies α12···n = 0
Conclusion: α = 0.
Note that if we define I0 = ∅ in Step k, then the proof that α = 0 is summarized as:
Loop for k = 0, · · · , n;
Do Step k;
End Loop.
E The Quadratic Exponential Continuous and Binary Distributions

The multivariate “Quadratic Exponential Continuous Distribution” for an n-component real random vector X = (x1 , · · · , xn )T
realized as X = X = (x1 , · · · , xn )T ∈ Rn is, of course, the multivariate Gaussian (aka Normal) probability density function
(pdf),145
1 1 T −1
P (X) = p = e− 2 (X−µ) C (X−µ) . (171)
n
(2π) det C
The multivariate Gaussian distribution (171) can be rewritten into the form
1 −E(X) 1 T
P (X) = e , E(X) = X W X − bT X , W = C −1 (172)
Z 2
and, in turn, any multivariate distribution of the form (172) with W invertible can be placed in the standard Gaussian form
(171). Thus the two forms (171) and (172) are entirely equivalent, assuming the invertibility of W . Whereas C is known as
the covariance matrix, its inverse W = C −1 is known as the concentration matrix. The Gaussian distribution (171)–(172) has
many special and important properties, a few of which we list here:146
Properties of the Quadratic Exponential Continuous (Gaussian) Distribution (171) and (172)
G1. Distributional Closure Property I: Marginals of a multivariate Gaussian are Gaussian.
G2. Distributional Closure Property II: The conditional pdfs P (A|B) computed from any two disjoint subsets A and B of
the elements of a Gaussian vector X are Gaussian.
G3. The Gaussian pdf is the maximum (differential) Shannon entropy pdf subject to constraints on the first two moments of
X. W is a matrix of Lagrange multipliers, which determines C = W −1 .
G4. C = W −1 is the covariance matrix. In particular, C is symmetric and positive definite.
G5. xi ⊥
⊥ xj ⇐⇒ cij = [C]ij = 0.
G6. W = C −1 is the concentration matrix. In particular, W is symmetric and positive definite.

145 As is standard, we assume that the covariance matrix C is invertible.
146 For a derivation of property G3, see a standard information theory textbook, e.g. [8] or [28]. For discussions and proofs of properties G1, G2, G4, G5
and G6, with an excellent discussion of concentration matrices, see [31]. For a proof of property G7, see [27] (this is a nice homework problem for graduate
students).

G7. xi ⊥
⊥ xj (X − xi ei − xj ej ) ⇐⇒ wij = [W ]ij = 0.
It is interesting to ask which of these properties, or analogies of these properties, if any, also hold for the Quadratic
Exponential Binary (QEB) distribution (164) which describes the behavior of a categorical random vector X with binary
components. As we have noted, the QEB distribution is equivalent to the Ising (aka Boltzmann Machine) distribution,
1 −βE(X) 1
P (X) = e , E(X) = − XT W X − bT X . (173)
Z 2
We will list the properties properties of the Quadratic Exponential Binary (QEB) distribution corresponding to the Gaus-
sian Properties G1–G7 as B1–B7, and will discuss them point-by-point. However, relative to the properties G1–G7 of the
Gaussian distribution, many of the properties of the QEB are negative properties.
Properties of the Quadratic Exponential Binary (QEB) Distribution (164) , (166) and (173)
B1. Distributional closure under marginalization of X does not hold.

B2. Distributional closure under conditioning on subsets of elements of X does not hold.
The negative results B1 and B2 are discussed in references [9] and [10].
B3. The QEB distribution is the maximum entropy distribution subject to constraints on the first and second moments of the
binary-components categorical vector X
This is shown in Appendix C.3.
B4 There is no straightforward analogy to property G4 for the QEB distribution. W −1 does not exist because its diagonal
elements are all zero. Furthermore, although it is symmetric, in general W is not positive-semidefinite.
It is possible to modify W to a positive definite version W +dI without modifying the QEB distribution by adding a sufficiently
large positive constant d > 0,147 but W + dI in general will not have the interpretation of being the covariance matrix of X.148
B5 There is no straightforward analogy to property G5. W −1 does not exist in a meaningful way that one can identify as
being a covariance matrix.
See the discussion for property B4.
B6 There is no straightforward analogy to property G6. Because it has diagonal elements, W is not positive-semidefinite.
Although, as per the discussion for property B4, one can create positive definite versions of W , the matrix W −1 would still,
in general, not be a covariance matrix, which is a necessary condition for W to be a concentration matrix [31].

B7 xi ⊥
⊥ xj (X − xi ei − xj ej ) ⇐⇒ wij = [W ]ij = 0.
Thus property G7 which holds for the Gaussian distribution is also true for the QEB distribution. This is noted in references
[10, 11] but without proof. For completeness, a proof of property B7 is given below.
An important consequence of propery B7 is that it is straightforward to construct a probabilistic dependency graph for the
QEB distribution if one knows the matrix W .149 One simply creates an edge between sites i and j, viewed as vertices on a
graph, if and only if wij 6= 0.
PROOF OF PROPERTY B7:
For the binary random vector X = (x1 , · · · , xn )T define
X̄ = X−ij = X − xi ei − xj ej
147 This is because x2i = 1 allows us to harmlessly transform the right hand side of Eq. (164) by the changes θ0 ← (θ0 − nd) and θii ← d, i = 1, · · · , n.
148 Which value of c is the correct value to use? More generally, one can make a modification W + D > 0, where D > 0 is a diagonal matrix, but which
matrix D do we use?
149 See [27] for a discussion of dependency graphs.
and recall that

⊥ xj (X − xi ei − xj ej ) ⇐⇒ P (xi , xj | X̄) = P (xi | X̄)P (xj | X̄) .
xi ⊥
| {z }
X̄
It is also convenient to define

1
E(X̄) = − X̄T W X̄ − bT X̄
2
and
h̄i = hi (X̄) = eTi W X̄ + bi
with
X̄ = X−ij .
Property B7 is equivalent to
P (xi , xj | X̄) = P (xi | X̄)P (xj | X̄) if and only if wij = 0 .
A necessary condition for P (xi , xj | X̄) = P (xi | X̄)P (xj | X̄) to hold is that P (xi , xj | X̄) is capable of being factored into two
functions as
P (xi , xj | X̄) = f (xi , X̄)g(xj , X̄) . (174)
If it is impossible for the joint conditional probability to have such a factorization, then xi and xj cannot be conditionally
independent. We will show that a necessary condition for the conditional probability to factor as shown in (174) is that
wij = 0. After we have shown necessity, we will then show sufficiency,
wij = 0 =⇒ P (xi , xj | X̄) = P (xi | X̄)P (xj | X̄) .
Note that
1 T
−E(X) = X W X + bT X = (X̄ + xi ei + xj ej )T W (X̄ + xi ei + xj ej ) + bT (X̄ + xi ei + xj ej )
2
= −E(X̄ij ) + xi h̄i + xj h̄j + wij xi xj .
Therefore,
e−βE(X̄) β (xi h̄i +xj h̄j +wij xi xj )
P (X) = P (xi , xj , X̄) = e
Z
and
e−βE(X̄)
eβ (xi h̄i +xj h̄j +wij xi xj ) .
X X
P (X̄) = P (xi , xj , X̄) =
Z
xi , xj ∈ {−1,+1} xi , xj ∈ {−1,+1}
Thus,
P (xi , xj , X̄) eβ (xi h̄i +xj h̄j +wij xi xj )

P (xi , xj | X̄) = = . (175)
P (X̄) eβ (xi h̄i +xj h̄j +wij xi xj )
P
xi , xj ∈ {−1,+1}
Using the Summation Theorem (154), we have
eβ (xi h̄i +xj h̄j +wij xi xj )

X
(176)
xi , xj ∈ {−1,+1}
= eβ (xi h̄i +xj h̄j +wij xi xj ) + eβ (−xi h̄i +xj h̄j −wij xi xj ) + eβ (xi h̄i −xj h̄j −wij xi xj ) + eβ (−xi h̄i −xj h̄j +wij xi xj )
Using Eq. (175) we find

1
P (xi , xj | X̄) = (177)
1 + e−2βwij xi xj e−2βxi h̄i + e−2βxj h̄j + e−2β(xi h̄i +xj h̄j )
The conditional probability will factor as shown in Eq. (174) if and only if the denominator factors, which will be the case
if and only wij = 0. This the vanishing of wij is a necessary condition for conditional independence of xi and xj given X̄.
Assume now that wij = 0. Then Eq. (177) factors as

1 1 1
P (xi , xj | X̄) = = .
1 + e−2βxi h̄i + e−2βxj h̄j + e−2β(xi h̄i +xj h̄j ) 1 + e−2βxi h̄i 1 + e−2βxj h̄j
Thus, X
X 1 1 1
P (xi | X̄) = P (xi , xj | X̄) = =
xj =±1
1+ e−2βxi h̄i xj =±1
1+ e−2βxj h̄j 1+ e−2βxi h̄i
resulting in
P (xi , xj | X̄) = P (xi | X̄)P (xj | X̄)
as claimed.
F Probabilistic Graphical Models & The Boltzmann Distribution

As noted, for X an n-component categorical vector, where each component is a categorical random variable takes m discrete
values, specifying the values of K = mn parameters to generally specify a Boltzmann distribution,
1 −βE(X)
P (X) = e = e−β(E(X)−θ0 ) .
Z
Since K can quickly become astronomically large, this is generally an intractable task.
To create models of reasonable (non) complexity, one can either restrict the form of the energy function, typically to the
quadratic form
1 n(n − 1) n2 + n + 2
E(X) = −bT X − XT W X , K =1+n+ = ,
2 2 2
or decompose the energy function into a sum as
E(X) = E1 (XC1 ) + · · · + Ec (XCc )
where
XG = XC1 ∪ · · · ∪ XCc .
In practice, for most practical models of interest, both procedures are exploited.
References
[1] David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. “A Learning Algorithm for Boltzmann Machines”.
Cognitive science, 9(1):147–169, 1985.
[2] A. Agresti. Categorical Data Analysis. Wiley-Interscience, second edition, 2002.
[3] Daniel J Amit. Modeling Brain Function: The World of Attractor Neural Networks. Cambridge University Press, 1992.
[4] P. Billingsley. Probability and Measure. Wiley-Interscience, 3rd edition, 1995.
[5] P. Brémaud. Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues. Springer, 1999.
[6] Herbert B Callen. Thermodynamics and an Introduction to Thermostatistics. John wiley & sons, 2nd edition, 1985.
[7] I.A. Cosma and L. Evers. Markov Chains and Monte Carlo Methods – Lecture Notes. AIMS - African Institute for
Mathematical Sciences, 2010.
[8] T.H. Cover and J.A. Thomas. Elements of Information Theory. Wiley, second edition, 2006.
[9] D.R. Cox. “The Analysis of Multivariate Binary Data”. Journal of the Royal Statistical Society, Series C (Applied
Statistics), 21(2):113–120, 1972.
[10] D.R. Cox and N. Wermuth. “A Note on the Quadratic Exponential Binary Distribution”. Biometrika, 81(2):403–408,
1994.
[11] D.R. Cox and N. Wermuth. “On Some Models for Multivariate Binary Variables Parallel in Complexity with the Multi-
variate Gaussian Distribution”. Biometrika, 89(2):462–4469, 2002.
[12] P.J. Dhrymes. Mathematics for Econometrics. Springer, 4th edition, 2013.
[13] S. Fienberg. The Analysis of Cross-Classified Categorical Data. Springer, second edition, 2007.
[14] G.D. Forney Jr. “The Viterbi Algorithm”. Proceedings of the IEEE, 61(3):268–278, March 1973.
[15] R.J. Glauber. “Time-Dependent Statistics of the Ising Model”. Journal of Mathematical Physics, 4(2):294–307, 1963.
[16] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
[17] C. Gourieroux and A. Monfort. Statistics & Econometric Models – Volume One. Cambridge University Press, 1995.
[18] C. Gourieroux and A. Monfort. Statistics & Econometric Models – Volume Two. Cambridge University Press, 1996.
[19] G Grimmett. Probability on Graphs: Random Processes on Graphs and Lattices. Cambridge University Press, 2011.
[20] S. Haykin. Neural Networks and Learning Machines. Prentice Hall, 3rd edition, 2008.
[21] John Hertz, Anders Krogh, and Richard G Palmer. Introduction to the Theory of Neural Computation. Addison-Wesley,
1991.
[22] Geoffrey E Hinton and Terrence J Sejnowski. “Optimal perceptual inference”. In Proceedings of the IEEE conference
on Computer Vision and Pattern Recognition, pages 448–453. IEEE New York, 1983.
[23] Geoffrey E Hinton and Terrence J Sejnowski. “Learning and releaming in Boltzmann machines”. Parallel Distributed
Processing, 1, 1986.
[24] Kerson Huang. Statistical Mechanics. Wiley, 1987.
[25] S. Kirkpatrick, C. D. Gelatt Jr., and M.P. Vecchi. “Optimization by Simulated Annealing ”. Science, 220(4598):671–680,
1983.
[26] K. Kreutz-Delgado. “Real Vector Derivatives, Gradients & Nonlinear Least-Squares”. Lecture Notes - Report Number
ECE275A-LS2-F17v1.0, 2017.
[27] S.L. Lauritzen. Graphical Models. Claredon Press, 1996.
[28] D.J.C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003.
[29] M. Nerlove and S.J. Press. Univariate and Multivariate Log-Linear and Logistic Models. Technical report, Rand Corpo-
ration Technical Report R-1306-EDA/NI, 1973.
[30] J.R. Norris. Markov chains. Cambridge university press, 1998.
[31] L.L. Scharf. Statistical Signal Processing. Pearson, 1991.

[32] L. Suskind and A. Friedman. Quantum Mechanics: The Theoretical Minimum. Basic Books, 2014.
[33] H.J. Sussmann. “Learning Algorithms for Boltzmann Machines’. In Proceedings of the 27th IEEE Conference on
Decision and Control, pages 786–791, 1988.
[34] G. Tkacik, E. Schneidman, M.J. Berry II, and W. Bialek. “Spin Glass Models for a Network of Real Neurons”.
arXiv:0912.5409v1, 2009.
[35] L. Younes. Synchronous Boltzmann Machines can be Universal Approximators. Applied Mathematics Letters, 9(3):109–
113, 1996.

Stochastic Neural Networks v2.0c PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stochastic Neural Networks v2.0c PDF

Uploaded by

Copyright:

Available Formats

Stochastic Artificial Neural Networks

The Ising Model, Gibbs Sampling, Glauber Dynamics

Draft Lecture Notes – Summer 2016, 2017, 2018

DRAFT VERSION PRL-SNNIM-2017.v2.0c — January 31, 2019

2 Statistical Mechanics and the Boltzmann Distribution 2

3 The Ising Model of a Neural Network – The Boltzmann Machine 5

4 Markov Chain Monte Carlo (MCMC) and the Gibbs Sampler 6

5 Gibbs Sampler on the Boltzmann Machine 11

6 Boltzmann Machine as a Distribution Estimator 22

6.4 Boltzmann Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

A Information Theory and Maximum Likelihood Estimation 38

B Properties of the Boltzmann Distribution 43

C Optimality of the Boltzmann Distribution 49

D The Boltzmann Distribution & Categorical Data Analysis 52

E The Quadratic Exponential Continuous and Binary Distributions 60

F Probabilistic Graphical Models & The Boltzmann Distribution 63

2 Statistical Mechanics and the Boltzmann Distribution

The various expressions

Often the energy function is a quadratic form in x,

2.2 The Boltzmann Distribution on a Random Field

X = X ∈ X = {X(1) , · · · , X(K) } , |X| = K < ∞ ,

3 The Ising Model of a Neural Network – The Boltzmann Machine

3.1 Energy Function for the Ising Model

spin-external field interaction energy of particle i = −xi bi .

which we can stack into a n-vector as16

4 Markov Chain Monte Carlo (MCMC) and the Gibbs Sampler

Pk+1,k (Xk+1 = X0 k Xk = X) = P (X0 k X) with X0 , X ∈ X, |X| = K .

In vector-matrix form this is written as,

lectures notes is given in [7].

Detailed Balance: P (X0 k X) π(X) = P (X k X0 ) π(X0 ) for all X0 , X ∈ X (8)

4.2 Random Fields and MCMC

with realized configuration

It is useful to note the following probability expressions relating X, X−i , and xi :

π(X) = π(xi , X−i ) = π(xi | X−i )π(X−i ) (11)

4.3 The Gibbs Sampler on a Random Field with Distribution π

P (X0 k X) = 0 if X and X0 differ at more than one site

and otherwise using the Gibbs Sampling Rule:

P (X0 k X) , P (i, x0i k X)

P (i, x0i k X) = P (x0i k i, X)P (i k X).

The Gibbs sampler choice for P (x0i k i, X) is the conditional probability

P (x0i k i, X) , π(x0i | X−i ). (16)

Note that as a consequence of this choice,

P (x0i k i, X) = P (x0i k i, X−i ). (17)

RANDOM SCAN GIBBS SAMPLER

4.4 The Metropolis-Hasting Algorithm on a Random Field with Distribution π

4.5 Exploiting Ergodicity of a Markov Chain Stochastic Sampler

Convergence can be slow for a variety of reasons:

5 Gibbs Sampler on the Boltzmann Machine

Eδ (xi ) , Eδ (xi ; X−i ) , E(X) − E(X−i ) ⇐⇒ E(X) = E(X−i ) + Eδ (xi ) . (23)

where wTi is the i-th row of the weighting matrix W .

Note that we have shown that

E(X) = E(X−i ) − xi hi (X−i ) (26)

With E(X) = E(X−i ) + Eδ (xi ), Eδ (xi ) = −xi hi (X−i ), we have that

π(X) = π(xi , X−i ) = Z −1 e−βEδ (xi ) e−βE(X−i ) .

we obtain the desired conditional probability

π(xi , X−i ) π(X) e−βEδ (xi )

e+xi βhi (X−i )

π(xi = ±1 | X−i ) = σ(±2βhi (X−i )) , (31)

which we can also write as

Eδ (xi ) = −xi hi (X−i ) . (35)

e2βhi (X−i )zi e2βhi (X)zi xi + 1

For convenience, these two equivalent statements are stated as

φi i V(i) = φi i (V(i) ; ai , ri , i )

φa,r, (Z) = ρ z1 z2 · · · zN + POLY(Z; N − 1) (107)

F1 (V) = F0 (V) − φ11 V(i) ; a1 .